As I get older, I find myself enjoying these types of stories less and less. My issue comes from the fact that nobody seems comfortable having a conversation about facts and data, instead resorting to childish analogies about turning knobs.
That’s not how our jobs work. We don’t “adjust a carefulness meter.” We make conscious choices day to day based on the work we’re doing and past experience. As an EM, I’d be very disappointed if the incident post mortem was reduced to “your team needs to be more careful.”
What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process. We then need to lean on data and experience of what the trade offs of those changes would be. Asking a test? Go for it. Adding extra layers of approval before shipping? I’ll need to see some very strong reasons for that.
> What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process.
The answer this post gives to that bizarre question that always gets asked, is ‘nothing’, unless you want to significantly adjust the speed that we deliver features.
Any added process or check is going to impose overhead and make the team a little bit less happy. Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.
In extremis, you’re reduced to a situation in which you have zero incidents, but you also have zero work getting done.
That’s simply not true. Some processes are good, some post-mortem outcomes will focus on improving deployment speed (so you can revert changes faster), or improve your monitors so you can detect (and mitigate) incidents faster.
On the other hand, enforcing a manual external QA check on every release WILL slow things down.
You’re repeating the same mistake as the article by assuming “process” sits on a grade that naturally slows work down. This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you.
> This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you
I agree with the premise here, but in my experience running incidents review the issue that I see it’s a mixture of a performatic safetycism with reactivity.
Processes are the cheap bandaid to fix design, architectural and cultural issues.
Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
> Processes are the cheap bandaid to fix design, architectural and cultural issues.
They can be, yes. I have a friend that thinks I'm totally insane by wanting to release code to production multiple times a day. His sweet spot is once every 2 weeks because he wants QA to check over every change. Most of his employers can manage once a month at best, and once a quarter is more typical.
> Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
I 100% agree with this. Your comment also reminded me to say that incident reviews are necessary but not sufficient. You also need engineering leadership reviewing at a higher-level to make bigger organisational or technical changes to further improve things.
> Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.
Perhaps we have different backgrounds, but even in late stage startups I find there is an abundance of low hanging fruit and simple fixes. I'm sure it's different at Google, though.
Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph). Somehow though, teams have been lead to believe that each incident needs to be responded with more processes.
This is an even bigger problem putside of our profession. Organisations do everything in their power to reduce the agency of employees and the general public through process. A company would rather spend 20 hours verifying a purchase than allow an unnecessary purchase every now and then. German culture in particular seems to favour this to an extreme.
I've worked in highly autonomous and empowered teams that still root cause-analyzed every incident to death. The rationale being that if you'd get PagerDuty-ied in the middle of the night, it better be worth losing your sleep over. And it was great. I've also worked in slow, bureaucratic environments. They're not the same. Turning the magic dial (up) towards "more care" apparently doesn't move you along the same axis towards bureaucracy per se.
I like the German workplace (to white collar job) for several reasons but this one is one that drives me away.
We used to have an issue of deployments breaking in production, and one of the reasons was that we did not have some kind of smoke test in the post deployment (in our case we only had a rolling update as a strategy).
The rational solution was only create that post-deployment step. The solution that our German managers demanded: cut access to deployment for the entire team, “on-call” to check the deployments, and a deployment spreadsheet to track it.
In my experience it’s a cultural problem in Germany. Everything has to be done methodically, even if the method adds a disproportionate amount of friction. Often, the purported benefits are not even there. The thoroughness is full of holes, the diligence never done, the follow-ups never happening.
It leads to situations where you need a certificate from your landlord that you take to the certified locksmith that your landlord contracted and show a piece of ID to order a key double that arrives 3 business days later at a cost of 60€. A smart German knows that there’s a locksmith in the basement of a nearby shopping mall that will gladly duplicate any key without a fuss, but even then the price is inflated by the authorised locksmiths.
I document German bureaucracy for immigrants. Everything is like this. Every time I think “it can’t really be this ridiculous, I’m being uncharitable”, a colleague has a story that confirms that the truth is even more absurd.
It’s funny until you realise the cost it has for society at large. All the wasted labour, all the bottlenecks, and little to show for it.
> Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph).
I'm going to strongly disagree with this (when it's done well, not bureaucratically).
We review what can be improved due to problems and we incorporate it into our basic understanding of everything we do, it's the gaining of experience and muscle memory to execute fast while also accounting for things proactively.
It's a long term process but the payoff is great. Reduced time+effort on problems after the fact ends up long term increasing amount of valuable work produced.
When Grandpa was 20 years old he left the house and forgot to take his keys, so every time he left the house he checked his pockets for his keys.
When he was 24 he left the house and left the stove on. He learned to check the stove before leaving the house.
When he was 28 he left his wallet at home. He learned to check for his wallet.
...
Now Grandpa is 80. His leaving home routine includes: checks for his keys, his phone, his wallet. He ensures the lights are off, the stove is off, the microwave door is closed, the iron is off, the windows are closed in case it rains...
Grandpa has learned from his mistakes so well that it now takes him roughly an hour to leave the house. Also, he finds he doesn't tend to look forward to going out as much as he once did...
In response to this, I'd like to highlight what I wrote:
> We then need to lean on data and experience of what the trade offs of those changes would be.
As engineering leaders, this is a key part of our job. We don't just blindly add processes to prevent every issue. I should add that we also need to analyse our existing processes to see what should change or is not needed any more.
I get where you are coming from, and I certainly do expect actual facts, data, and reasoning to be a part of any serious postmortem analysis. But those will almost always be in relation to a very specific circumstance. I think there is still room for generalized parables such as this article - otherwise, we would be reading a postmortem blog post, which are also common here and usually do contain what you are asking for.
I think you can generalise without resorting to silly games like the article does. I gave some examples in a sibling comment that are high level enough to give an idea of the types of things I’d think about, without locking in to a specific incident I was part of.
Isn't that the point of the story though? To say "we need to have a conversation about facts, rather than something like always saying "we need to be more careful next time" when there's a problem?
As far as I can tell, the author doesn't really give any generalized advice on how careful you should be, he's just pointing at the "carefulness dial" and telling people to make an informed decision.
I'm not really sure it is the author's point. I re-read the article to try and find your interpretation and I couldn't really find it there. Maybe a slight hint in the coda?
Agreed. I recall being taught in college physics labs: there is no such thing as “human error”. Instead, think about the causes and mechanisms of each source of error, which helps both quantifying and mitigating them.
Same energy here. “Be more careful” is extraordinarily hand-wavy for a profession that calls itself engineering.
you're probably on the spectrum, as most anyone here.
there's a tradeoff on shipping garbage fast which won't explode on your hands and getting promoted. and there's also the political art of selling what you want to other people by masking it as what they want.
you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.
these analogies help us rational people point out the BS at least, without having to fight the BSer.
I’ve been in this industry a long time. I’ve read Lying with Statistics, and a bunch of Tufte. I don’t think it would be too much hyperbole to say I’ve spent almost a half a year of cumulative professional time (2-3 hours a month) arguing with people about bad graphs. And it’s always about the same half dozen things or variants on them.
The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date. Are you sure that’s the message you’re trying to send?
Managers go through the five stages of grief every time they ask for a pony and you counteroffer with a donkey. And the charts often offer them a pony instead of a donkey. Doing the denial, anger and bargaining in a room full of people becomes toxic, over time. It’s a self goal but bouncing it off the other team’s head. Don’t do that.
> The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date.
This strikes me as a pedantic argument, since the graph was clearly drawn by hand and is meant to illustrate an upward curving line. Now, maybe there's essentially no clear difference between 5 and 5.1, but when you extrapolate out to where 6 would be (about 65 pixels to the right of 5, if I can be pedantic for a moment), there actually is a difference.
Doesn't the flat line in this context mean that you're at a local minimum, which is where you want to stay? Where being less careful would take more time due to increased number of incidents.
I agree with both of you, but consider that if this was a real meeting, you’ve both just wasted 10-15 minutes arguing about a line on the graph of a metaphor. (and yes I’ve seen this behaviour in meetings).
I agree with the original comment, as professionals we can do better than simplified analogies (or at least we should strive to)
It's true, though: accomplishing anything at all is relatively a feat. You have two ways to contrast a retrospective: what if we had been perfect, and what if we had done nothing. Sometimes a comparison against the latter is more informative.
You seem to be concerned over a non-issue: I'm pretty sure you're the only person in this entire discussion who misinterpreted the graph. Does that mean the people you're talking about are yourself? :P
And I've done high school calculus. It's a picture of a parabola, the derivative is very small near the minimum, and it can look kinda flat if the picture isn't perfectly drawn.
Principles of Product Development makes the point that a lot of real world tradeoffs in your development process are U-shaped curves, which implies that you will have very small costs for missing the optimum by a little. A single decision that you get wrong by a lot is likely to dominate those small misses.
A more sensible way to present the idea is to put the turning point of the parabola at the origin of the graph and then show that 5 is somewhere on the line of super-linearly increasing schedule risk.
The article stipulates that 5 is the value that minimizes execution time.
It could have put that on the y-axis, and labeled the left “extremely rushed” and the right side “extremely careful”. Maybe that would’ve been clearer, though I really think it’s clear if are charitable and don’t assume the author has made a mistake.
It’s a picture of a parabola if someone put the y axis at the dotted line not the origin. If you want to bargain with people - and this fictional conversation is a negotiation - then don’t anchor the people on bad values.
The article stipulates that 5 is the value that minimizes execution time. So the value between 0 and 5 would’ve been higher. It doesn’t intersect the x-axis because you can’t finish the task in zero time.
See the quote below.
That said, while I didn’t think it was a confusing drawing, I now wish he’d drawn the rest of the parabola, because it would’ve prevented this whole conversation.
> EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.
This is not how I'd expect any professional engineering team to do risk mitigation.
This is not hard.
Enumerate risks. List them. Talk about them.
If you want to turn into something prioritisable, for each one, quantify them: on a scale of 1 to 10 what's the likelihood? On a scale of 1 to 10, what's the impact? Multiply the numbers. Communicate these numbers and see if others agree with your assessment of the numbers. As a team, if the product is more than 15, spend some time thinking about mitigation work you can do to reduce either likelihood or impact or both. The higher the number, the more important you put mitigations into your backlog or "definition of done". Below 15? Check with the team you're going to ignore this.
Mitigations are extra work. They add time. They slow down delivery. That's fine, you add them to your backlog as dependent tasks, and your completion estimates move out. Need to hit a deadline? Look at descoping, and include in that descoping a conversation about removing some of the risk mitigations and accepting the risk likelihood and impact.
Having been EM, TL and X in this story (and the TPM, PM, CTO and other roles), I don't want a "knob" that people are turning in their heads about their subjective measure of "careful".
I want enumerated risks with quantified impact and likelihood and adult conversations about appropriate mitigations that lead to clear decisions.
If you can't enumerate risks with the work you're about to do either you don't have any, or you haven't thought enough about it and you're yolo'ing.
If you have a list of risks and you don't know how to mitigate them, you're just yolo'ing.
"We don't have time to plan" is the biggest source of nonsense in this industry. The process I just described takes about 15 minutes to go through for a month's worth of work. Nobody is so busy they can't spare 15 minutes to think about things that might cause major problems that soak up far, far, far more time.
I like the idea of having an actual 'carefulness knob' prop and making the manager asking for faster delivery/more checks actually turn the knob themselves, to emphasise that they're the one responsible for the decision.
Yep. The best way to pushback is to ask your manager, "We'll do it fast since you are asking for it. What is the plan for contingencies in case things break?"
I don’t think the manager needs a button to remember he’s responsible for the decisions, usually it’s the other way around, devs having to remember they’re not; I’ve been too many times the manager putting it on zero only to hear, “You can’t do that it has to be at least a nine !!!”.
It’s not the right approach. Structural engineers shouldn’t let management fiddle with their safety standards to increase speed. They will still blame you when things fail. In software, you can’t just throw in yolo projects with much lower “carefulness” than the rest of the product, everything has maintenance. The TL in this case needs to establish a certain set of standards and practices. That’s not a choice you give away to another team on a per-feature basis.
It’s also a ridiculous low bar for engineering managers to not even understand the most fundamental of tradeoffs in software. Of course they want things done faster, but then they can go escalate to the common boss/director and argue about prioritization against other things on the agenda. Not just “work faster”. Then they can go manage those whose work output is proportional to stress, not programmers.
Management decides whether they build a cheap wooden building, a brick one, or a steel skyscraper. These all have different risk profiles.
Safety is a business/management decision, even in structural engineering. A pedestrian bridge could be constructed to support tanks and withstand nuclear explosions, but why. Many engineered structures are actually extremely dangerous - for example mountain climbing trails.
Also yes, you have many opportunities to just YOLO without significant consequences in software. A hackathon is a good example - I love them, always great to see the incredible projects at the end. The last one I visited was sponsored by a corporation and they straight up incorporated a startup next day with the winning team.
If management intends expected use to be a low-load-quick-and-dirty-temporary-use prototype to be delivered in days, it seems the engineers are not doing their job if they calibrate their safety process to a heavy-duty-life-critical application. And vice versa.
Making the decision about the levels of use, durability, reuse-ability, scalability, AND RISK is all management. Implementing those decisions as decided by management is on engineering. It is not on engineering to fix a bad-trade-off management decision beyond what is reasonably possible (if you can, great, but go look to work someplace less exploitative).
What you wrote didn’t really refute my point either.
If management allocates time and resources only for a quick-&-dirty prototype not for public use, then releases it to the public with bad consequences, they will definitely ask the engineers about it. If the engineers properly covered their paper trail, i.e., kept receipts for when management refused their requests for resources to build-in greater safety, then engineering will be not responsible. Ethically, this is the correct model.
But if he & you are saying that management will try to exploit engineering and then blame failures on engineering when it was really bad management, yup, you should expect that kind of ethical failure from management. Yes, there are exceptions, but the structure definitely encourages such unethical management behaviors.
"Management" is generally a bunch of morons bikeshedding to try to distract everyone from the fact that their jobs are utterly worthless. The fewer decisions management makes, the better. If management feels like they need to make a decision, throw something in there they can tell you to remove later. It's a tried and tested method of getting those imbeciles off your back. Then build the damn thing that actually needs building. Management will take the credit, of course. But they'll be happy. Everyone wins.
Think about it this way: the person who suffers the consequences of the decision should be making the decision. That's not management; they will never, ever accept any level of blame for anything. They'll immediately pass that buck right on to you. So that makes it your decision. Fuck management; build what needs building instead.
Look at what happened when "management" started making decisions at Boeing about risk, instead of engineers making the decisions.
Building the wrong thing is exactly what happens when you listen to management too much. Talk to the client yourself. Learn the subject. Get the textbook. Read the materials. That's how you build the right thing.
> They are required because building the wrong thing is worse than not building anything at all
And yet, "manager" is usually[1] only responsible for ensuring the boards get carried from the truck to the construction site and that two workers don't shoot at each other with nail guns, not "we, collectively, are building the right house."
I freely admit that my cynicism is based on working in startups, where who knows what the right thing actually is, but my life experience is that managers for sure do not: they're just having meetings to ensure the workers are executing on the plan that the manager heard in their meeting
1: I am also 1000000% open to the fact that I fall into the camp of not having seen this mythical "competent manager" you started with
Fwiw in a real world scenario it'd be more helpful to hear "the timeline has risks" alongside a statement of a concrete process you might not be doing given that timeline. Everyone already knows about diminishing returns, we don't need a lesson on that.
and you'd be amazed when you start really having these discussions with the client how often stuff ends up not only not being needed, but often going right past 'nice to have' and right to 'lets not'. The problem is often the initial problem being WAY overspecified in ways that don't ACTUALLY matter but generate tons of extra work.
Yeah, to me this kind of thing is much better than the carefulness knob.
Delaying or just not doing certain features that have low ROI can drastically shorten the development time without really affecting quality.
This is something that as an industry we seem to have unlearned. Sure, it still exists in the startup space, with MVPs, but elsewhere it's very difficult. In the last 20 years I feel like engineers have been pushed more and more away from the client, and very often you just get "overspecified everything" from non-technical Product Managers and have to sacrifice in quality instead.
I had one today that with one email went from "I want this whole new report over our main entity type with 3 user specified parameters" to "actually, just add this one existing (in the db) column to this one existing report and that totally solves my actual problem". My time going from something like 2 days to 15 minutes + 10 minutes to write the email.
For me the biggest one lately was due to miscommunication and bad assumptions.
The designer was working in a redesign in their own free time, so they were using this "new design" as a template for all recent mockups. The Product Manager just created tickets with the new design and was adamant that changing the design was part of the requirements. The feature itself was simple, but the redesign was significantly hard.
Talking with the business person revealed that they were not even aware of the redesign and it was blocked until next year.
One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.
There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.
My response: "Nothing. We're not going to do anything."
The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".
I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."
In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)
I had a similar situation, but in my case was due to a an upstream outage in a AWS Region.
The final assessment in the Incident Review was that we should have a multi-cloud strategy. Our luck that we had a very reasonable CTO that prevented the team do to that.
He said something along the lines that he would not spend 3/4 of a million plus 40% of our engineering time to cover something that rarely happens.
[preface that this response is obviously operating on very limited context]
"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.
While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.
I didn't want to add a wall of text for context :) And that was the only time I've said something like that to a client. I was not being confrontational, just telling them how it is.
I suppose my point was that there's a cost associated with increasing reliability, sometimes it's just not worth paying it. And that people will usually appreciate candor rather than vague promises or hand-wavy explanations.
We've started to note these wont-fixes down as risks and started talking about probability and impact of these. That has resulted in good and realistic discussions with people from other departments or higher up.
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.
The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.
I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)
Good, but I would have preferred a comment about 'process gates' somewhere in there [0]. I.e. rather than say "it's probably nothing let's not do anything" only to avoid the extreme "let's double check everything from now on for all eternity", I would have preferred a "Let's add this temporary process to check if something is actually wrong, but make sure it has a clear review time and a clear path to being removed, so that the double-checking doesn't become eternal without obvious benefit".
When you have zero incidents using the temporary process people will automatically start to assume it’s due to the temporary process, and nobody will want to take responsibility for taking it out.
Yep yep, exactly this. When an incident review reveals a fluke that flew past all the reasonable safeguards, a case that the team may have acknowledged when implementing those safeguards. Sometimes those safeguards are still adequate, as you can’t mitigate 100% of accidents, and it’s not worth it to try!
I’d go further to say that it’s a trap to try, it’s obvious that you can’t get 100% reliability, but people still feel uneasy with doing nothing
> If we see a pattern of mistakes being made then we can talk about taking steps to prevent them.
...but that's not really nothing? You're acknowledging the error, and saying the action is going to be watch for a repeat, and if there is one in a short-ish amount of time, then you'll move to mitigation. From a human standpoint alone, I know if I was the client in the situation, I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
Don't get me wrong; I agree with your assessment. But don't sell non-technical actions short!
The abbreviated story I told was perhaps more dramatic-sounding than it really played out. I didn't just say "Nothing." mic dropwalk out :)
The client was satisfied after we owned the mistake, explained that we have a number of measures in place for preventing various mistakes, and that making a test for this particular one doesn't make sense. Like, nothing will prevent me from creating a cron job that does "rm -rf * .o". But lights will start flashing and fixing that kind of blunder won't take long.
If you want to go full corporate, and avoid those nervous laughs and frowns from people who can't tell if you're being serious or not, I recommend dressing it up a little.
Corollary is that Risk Management is a specialist field. The least risky thing to do is always to close down the business (can't cause an incident if you have no customers).
Engineers and product folk, in particular, I find struggle to understand Risk Management.
When juniors ask me what technical skill I think they should learn next my answers is always; Risk Management.
(Heavily recommended reading: "Risk, the science and politics of fear")
> Engineers and product folk, in particular, I find struggle to understand Risk Management.
How do you do engineering without risk management? Not the capitalized version, but you’re basically constantly making tradeoffs. I find it really hard to believe that even a junior is unfamiliar with the concept (though the risk they manage tends to be skewed towards risk to their reputation).
Yeah. Policies, procedures, and controls have costs. They can save costs, but they also have their own costs. Some pay for themselves; some don't. The ones that don't, don't create those procedures and controls.
This feels like a really good starting point to me, but I just want to point out that there's a very low ceiling on the effectiveness of "carefulness". I can spend 8 hours scrutinizing code looking for problems, or I can spend 1 hour writing some tests for it. I can spend 30 minutes per PR checking it for style issues, or I can spend 2 hours adding a linter step to CI.
The key here is automating your "carefulness" processes. This is how you push that effectiveness curve to the right. And, the corollary here is that a lack of IC carefulness is not to blame when things break. It is always, always, always process.
And to reemphasize the main point of TFA, things breaking is often totally fine. The optimal position on the curve is almost never "things never break". The gulf between "things never break" and "things only break .0001% of the time" is a gulf of gazillions of dollars, if you can even find engineers motivated enough and processes effective enough to get you anywhere close to there. This is what SLAs are for: don't give your stakeholders false impressions that you'll always work forever because you're the smartest and most dedicated amongst all your competitors. All I want is an SLA and a compensation policy. That's professional; that's engineering.
In general, I found that when I've told people to be careful on that code path (because it has bitten me before) I don't get the sense that it is a welcomed warning.
It's almost as if I'm questioning their skill as a engineer.
I don't know about you but when I'm driving a road and there is black ice around the corner a warning from a fellow driver is welcomed.
I’ve seen this happen a number of ways. Some jobs I’ve worked for, everyone’s ears perk up when they hear that and several people will curious. I’ve worked places where people’s egos were in the way of basic good sense. I’ve seen well intentioned people who were just flat out abrasive.
I have no idea which situations you’re finding yourself in. It may help to sit back and see if you could word things different. I have gotten better communicating by asking the people I’m working with if there was a better way I could have said something. Some managers I’ve had had good advice. (I’ve also gotten myself dragged into their office for said advice.)
I have no idea how you approached it, but you could let them decide if they want your advice and have specific examples on how things went wrong if they do. “Hey, I noticed you’re working on this. We’ve had some problems in the past. If you have time, I can go into more detail.”
Then again you could just be working with assholes.
I think some of the blame _definitely_ lies with me. No doubt. The only sure fire way I've found that won't cause offence is to offer up a slack call/huddle so that they can hear my tone.
I have been failing quite successfully at communicating my tone over text for some time now, so I confess that and admit it upfront.
I did a lot of the work in my 40 year software career as an individual, which meant it was on me to estimate the time of the task. My first estimate was almost always an "If nothing goes wrong" estimate. I would attempt to make a more accurate estimate by asking myself "is there a 50% chance I could finish early?". I considered that a 'true' estimate, and could rarely bring myself to offer that estimate 'up the chain' (I'm a wimp ...). When I hear "it's going to be tight for Q2", in the contexts I worked in, that meant "there's no hope". None of this invalidates the notion of a carefulness knob, but I do kinda laugh at the tenor of the imagined conversations that attribute a lot more accuracy to the original estimate that I ever found in reality in my career. Retired 5 years now, maybe some magic has happened while I wasn't looking.
More than once I've used the xkcd method (Pull a gut number out of thin air, then double the numerator and increment the unit e.g. 1 hour -> 2 days, 3 weeks -> 6 months). When dealing with certain customers this has proven disappointingly realistic.
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.
This is a really critical property that doesn't get highlighted nearly often enough, and I'm glad to see it reinforced here. Slow is smooth, smooth is fast. And predictable.
Lorin is always on point, and I appreciate the academic backing he brings to the subject. But for how many years do we need to tell MBAs that "running with scissors is bad" before it becomes common knowledge? (Too damn many.)
The dominant model in project management is "divide a project into a set of tasks and analyze the tasks independently". You'd imagine you could estimate the work requirement for a big project by estimating the tasks and adding them up, but you run into various problems.
Some tasks are hard to estimate because they have an element of experimentation or research. Here a working model is the "run-break-fix" model where you expect to require an unknown number of attempts to solve the problem. In that case there are two variables you can control: (1) be able to solve the problem in less tries, and (2) take less time to make a try.
The RBF model points out various problems with carelessness as an ideology. First of all, being careless can cause you to require more tries. Being careless can cause you to ship something that doesn't work. Secondly, and more important, the royal road to (2) is automation and the realization that slow development tools cause slow development.
That is, careless people don't care if they have a 20 minutes build. It's a very fast way to make your project super-late.
I worked at a place that organized a 'Hackathon' where we were supposed to implement something with our project in two hours. I told them, "that's alright, but it takes 20 minutes for us to build our system, so if we are maximally efficient we get 6 tries at this". The eng manager says "it doesn't take 20 minutes to build!" (he also says we "write unit tests" and we don't, he says we "handle errors with Either in Scala" which we usually don't, and says "we do code reviews", which I don't believe) I set my stopwatch, it takes 18 minutes. (It is creating numerous Docker images for various parts of the system that all need to get booted up)
That organization was struggling with challenging requirements from multiple blue chip customers -- it's not quite true that turning that 20 minute build into a 2 minute build will accelerate development 10x but putting some care in this area should pay for itself.
I like the idea of imagining that we can arbitrarily adjust the carefulness knob, but I don't think it works like that in reality. You can certainly spend more time writing tests, but a lot of the unforeseen problems I've hit over the years weren't caused by lack of testing--they were caused by unknown things that we couldn't have known regardless of how careful we were. It doesn't make for a very satisfying post mortem.
All I see here is an all-too-common organizational issue that something like this is having to be explained to someone in a management role. They should know these things. And they should know them well.
If your company is needing to have conversations like this more than rarely—let alone experiencing the actual issue being discussed—then that's a fundamental problem with leadership.
In real life you both can't afford having reputation risk and you need to ship anyway. If you have an incident, guess who's going to be liable – the manager or the name on the commit?
Stop negotiating quality; negotiate scope and set a realistic time. Shipping a lot of crap faster is actually slower. 99% of the companies out there can't focus on doing _one_ thing _well_, that's how you beat the odds.
Others mentioned a parabola, which seems true to me. Imagine writing a database migration and not testing it. You could end up losing customer data, which is very expensive to "fix".
Oracle is a thing because for the first several years of their existence they lost customer data on the regular, so the IBM DB people laughed at and ignored them.
I am probably missing an essential point here, but my first reaction was "this is literally the quality part of the scope/cost/time quality trade-off triangle?"
Has that become forgotten lore? (It might well be. It's old, and our profession doesn't do well with knowledge transmission. )
> I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.
Yeah, sure, that never happens. That's why "I told you so" is not at all a common phrase amongst folks working on reliability-related topics ;)
As I get older, I find myself enjoying these types of stories less and less. My issue comes from the fact that nobody seems comfortable having a conversation about facts and data, instead resorting to childish analogies about turning knobs.
That’s not how our jobs work. We don’t “adjust a carefulness meter.” We make conscious choices day to day based on the work we’re doing and past experience. As an EM, I’d be very disappointed if the incident post mortem was reduced to “your team needs to be more careful.”
What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process. We then need to lean on data and experience of what the trade offs of those changes would be. Asking a test? Go for it. Adding extra layers of approval before shipping? I’ll need to see some very strong reasons for that.
> What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process.
The answer this post gives to that bizarre question that always gets asked, is ‘nothing’, unless you want to significantly adjust the speed that we deliver features.
Any added process or check is going to impose overhead and make the team a little bit less happy. Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.
In extremis, you’re reduced to a situation in which you have zero incidents, but you also have zero work getting done.
That’s simply not true. Some processes are good, some post-mortem outcomes will focus on improving deployment speed (so you can revert changes faster), or improve your monitors so you can detect (and mitigate) incidents faster.
On the other hand, enforcing a manual external QA check on every release WILL slow things down.
You’re repeating the same mistake as the article by assuming “process” sits on a grade that naturally slows work down. This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you.
> This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you
I agree with the premise here, but in my experience running incidents review the issue that I see it’s a mixture of a performatic safetycism with reactivity.
Processes are the cheap bandaid to fix design, architectural and cultural issues.
Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
> Processes are the cheap bandaid to fix design, architectural and cultural issues.
They can be, yes. I have a friend that thinks I'm totally insane by wanting to release code to production multiple times a day. His sweet spot is once every 2 weeks because he wants QA to check over every change. Most of his employers can manage once a month at best, and once a quarter is more typical.
> Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
I 100% agree with this. Your comment also reminded me to say that incident reviews are necessary but not sufficient. You also need engineering leadership reviewing at a higher-level to make bigger organisational or technical changes to further improve things.
> Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.
Perhaps we have different backgrounds, but even in late stage startups I find there is an abundance of low hanging fruit and simple fixes. I'm sure it's different at Google, though.
Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph). Somehow though, teams have been lead to believe that each incident needs to be responded with more processes.
This is an even bigger problem putside of our profession. Organisations do everything in their power to reduce the agency of employees and the general public through process. A company would rather spend 20 hours verifying a purchase than allow an unnecessary purchase every now and then. German culture in particular seems to favour this to an extreme.
I've worked in highly autonomous and empowered teams that still root cause-analyzed every incident to death. The rationale being that if you'd get PagerDuty-ied in the middle of the night, it better be worth losing your sleep over. And it was great. I've also worked in slow, bureaucratic environments. They're not the same. Turning the magic dial (up) towards "more care" apparently doesn't move you along the same axis towards bureaucracy per se.
I like the German workplace (to white collar job) for several reasons but this one is one that drives me away.
We used to have an issue of deployments breaking in production, and one of the reasons was that we did not have some kind of smoke test in the post deployment (in our case we only had a rolling update as a strategy).
The rational solution was only create that post-deployment step. The solution that our German managers demanded: cut access to deployment for the entire team, “on-call” to check the deployments, and a deployment spreadsheet to track it.
Good lord, are all those non-tech managers the reason Europe can't seem to build viable tech companies?
In my experience it’s a cultural problem in Germany. Everything has to be done methodically, even if the method adds a disproportionate amount of friction. Often, the purported benefits are not even there. The thoroughness is full of holes, the diligence never done, the follow-ups never happening.
It leads to situations where you need a certificate from your landlord that you take to the certified locksmith that your landlord contracted and show a piece of ID to order a key double that arrives 3 business days later at a cost of 60€. A smart German knows that there’s a locksmith in the basement of a nearby shopping mall that will gladly duplicate any key without a fuss, but even then the price is inflated by the authorised locksmiths.
I document German bureaucracy for immigrants. Everything is like this. Every time I think “it can’t really be this ridiculous, I’m being uncharitable”, a colleague has a story that confirms that the truth is even more absurd.
It’s funny until you realise the cost it has for society at large. All the wasted labour, all the bottlenecks, and little to show for it.
The other end of the spectrum has snakeoil salesmen grifting from town to town. It's a hard equilibrium to blance.
> Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph).
I'm going to strongly disagree with this (when it's done well, not bureaucratically).
We review what can be improved due to problems and we incorporate it into our basic understanding of everything we do, it's the gaining of experience and muscle memory to execute fast while also accounting for things proactively.
It's a long term process but the payoff is great. Reduced time+effort on problems after the fact ends up long term increasing amount of valuable work produced.
The key is to balance this process pragmatically.
I call this the "grandpa's keys" problem!
When Grandpa was 20 years old he left the house and forgot to take his keys, so every time he left the house he checked his pockets for his keys.
When he was 24 he left the house and left the stove on. He learned to check the stove before leaving the house.
When he was 28 he left his wallet at home. He learned to check for his wallet.
...
Now Grandpa is 80. His leaving home routine includes: checks for his keys, his phone, his wallet. He ensures the lights are off, the stove is off, the microwave door is closed, the iron is off, the windows are closed in case it rains...
Grandpa has learned from his mistakes so well that it now takes him roughly an hour to leave the house. Also, he finds he doesn't tend to look forward to going out as much as he once did...
In response to this, I'd like to highlight what I wrote:
> We then need to lean on data and experience of what the trade offs of those changes would be.
As engineering leaders, this is a key part of our job. We don't just blindly add processes to prevent every issue. I should add that we also need to analyse our existing processes to see what should change or is not needed any more.
And he will be replaced by a younger version who hasn't learned so many lessons.
The road to bureaucrat hell is paved with good intentions
https://andrewchen.substack.com/p/bureaucrat-mode
What you're saying agrees with the article.
However, you say you're agreeing with scott_w, and scott_w is criticizing the article. So this is confusing.
I get where you are coming from, and I certainly do expect actual facts, data, and reasoning to be a part of any serious postmortem analysis. But those will almost always be in relation to a very specific circumstance. I think there is still room for generalized parables such as this article - otherwise, we would be reading a postmortem blog post, which are also common here and usually do contain what you are asking for.
I think you can generalise without resorting to silly games like the article does. I gave some examples in a sibling comment that are high level enough to give an idea of the types of things I’d think about, without locking in to a specific incident I was part of.
Isn't that the point of the story though? To say "we need to have a conversation about facts, rather than something like always saying "we need to be more careful next time" when there's a problem?
As far as I can tell, the author doesn't really give any generalized advice on how careful you should be, he's just pointing at the "carefulness dial" and telling people to make an informed decision.
I'm not really sure it is the author's point. I re-read the article to try and find your interpretation and I couldn't really find it there. Maybe a slight hint in the coda?
Agreed. I recall being taught in college physics labs: there is no such thing as “human error”. Instead, think about the causes and mechanisms of each source of error, which helps both quantifying and mitigating them.
Same energy here. “Be more careful” is extraordinarily hand-wavy for a profession that calls itself engineering.
Exactly, these stories only appeal to children and childish adults.
I think we can criticise the analogy as reductive without insulting people, here.
you're probably on the spectrum, as most anyone here.
there's a tradeoff on shipping garbage fast which won't explode on your hands and getting promoted. and there's also the political art of selling what you want to other people by masking it as what they want.
you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.
these analogies help us rational people point out the BS at least, without having to fight the BSer.
> you're probably on the spectrum, as most anyone here.
Your comment starts with an ableist slur so I’m sure it’s going to be good /s
> you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.
Nah, reading this comment wasn’t worthwhile after all.
> these analogies help us rational people point out the BS at least, without having to fight the BSer.
How cute, you think you’re “rational.”
I’ve been in this industry a long time. I’ve read Lying with Statistics, and a bunch of Tufte. I don’t think it would be too much hyperbole to say I’ve spent almost a half a year of cumulative professional time (2-3 hours a month) arguing with people about bad graphs. And it’s always about the same half dozen things or variants on them.
The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date. Are you sure that’s the message you’re trying to send?
Managers go through the five stages of grief every time they ask for a pony and you counteroffer with a donkey. And the charts often offer them a pony instead of a donkey. Doing the denial, anger and bargaining in a room full of people becomes toxic, over time. It’s a self goal but bouncing it off the other team’s head. Don’t do that.
> The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date.
This strikes me as a pedantic argument, since the graph was clearly drawn by hand and is meant to illustrate an upward curving line. Now, maybe there's essentially no clear difference between 5 and 5.1, but when you extrapolate out to where 6 would be (about 65 pixels to the right of 5, if I can be pedantic for a moment), there actually is a difference.
This is a conversation about human behavior, not pixels.
A flat line will lead to bargaining, as I said. Don’t paint yourself into uncomfortable conversations.
If you don’t want the wolf in the barn don’t open the door.
Doesn't the flat line in this context mean that you're at a local minimum, which is where you want to stay? Where being less careful would take more time due to increased number of incidents.
If you can explain to me how you convince a nontechnical person why 6 is not free, I’ll concede the point.
But if not then my original thesis that this is needlessly asking for a pointless argument that costs social capital stands.
> If you can explain to me how you convince a nontechnical person why 6 is not free, I’ll concede the point.
"I said the minimum is at 5, but if you want me to trace the line more accurately then let's take a 2 minute break and I'll do that".
I think the pointless argument here is the one you make.
Mostly free is not the same thing as free, and the scale of ‘carefulness’ is already completely arbitrary. How much more careful is 6 than 5?
You don’t have to argue about it, because the scale doesn’t represent anything. The only thing to say is, sure, we’ll set the carefulness to 6.
I agree with both of you, but consider that if this was a real meeting, you’ve both just wasted 10-15 minutes arguing about a line on the graph of a metaphor. (and yes I’ve seen this behaviour in meetings).
I agree with the original comment, as professionals we can do better than simplified analogies (or at least we should strive to)
Good insight, thanks for that
> you’ve both just wasted 10-15 minutes arguing about a line on the graph of a metaphor
So goddamned many times. And multiply your 2-10 minute sidebar by the number of people in the room. You just spent over $200 on a wobble in a line.
Plus you’ve usually lost the plot by that point.
Curve higher at one hand and one finger more.
Maybe we need to start with Zeno. Doing anything at all is relatively impossible.
Maybe an excellent joke for the pre meeting.
I don’t think the boss will find it funny though.
It's true, though: accomplishing anything at all is relatively a feat. You have two ways to contrast a retrospective: what if we had been perfect, and what if we had done nothing. Sometimes a comparison against the latter is more informative.
Is it really that hard for you to intuit the intent that it's a graph that slopes upwards?
I’m saying people suck, particularly when you’re trying to give them what they see as bad news.
You seem to be concerned over a non-issue: I'm pretty sure you're the only person in this entire discussion who misinterpreted the graph. Does that mean the people you're talking about are yourself? :P
And I've done high school calculus. It's a picture of a parabola, the derivative is very small near the minimum, and it can look kinda flat if the picture isn't perfectly drawn.
Principles of Product Development makes the point that a lot of real world tradeoffs in your development process are U-shaped curves, which implies that you will have very small costs for missing the optimum by a little. A single decision that you get wrong by a lot is likely to dominate those small misses.
A more sensible way to present the idea is to put the turning point of the parabola at the origin of the graph and then show that 5 is somewhere on the line of super-linearly increasing schedule risk.
The article stipulates that 5 is the value that minimizes execution time.
It could have put that on the y-axis, and labeled the left “extremely rushed” and the right side “extremely careful”. Maybe that would’ve been clearer, though I really think it’s clear if are charitable and don’t assume the author has made a mistake.
It’s a picture of a parabola if someone put the y axis at the dotted line not the origin. If you want to bargain with people - and this fictional conversation is a negotiation - then don’t anchor the people on bad values.
The article stipulates that 5 is the value that minimizes execution time. So the value between 0 and 5 would’ve been higher. It doesn’t intersect the x-axis because you can’t finish the task in zero time.
See the quote below.
That said, while I didn’t think it was a confusing drawing, I now wish he’d drawn the rest of the parabola, because it would’ve prevented this whole conversation.
> EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.
How is it a bad value?
If you centered it on the y axis, that would mean your carefulness scale goes from -5 to 5, and that's just confusing.
I guarantee you the boss in this scenario hated calculus or a skipped it.
Also parabolas are algebra not calculus, but the same counter argument stands.
It's a U shape, you don't need either one for a U shape.
Is "self goal" meaningful? I suspect you mean "own goal", but I can't tell from context
This is not how I'd expect any professional engineering team to do risk mitigation.
This is not hard.
Enumerate risks. List them. Talk about them.
If you want to turn into something prioritisable, for each one, quantify them: on a scale of 1 to 10 what's the likelihood? On a scale of 1 to 10, what's the impact? Multiply the numbers. Communicate these numbers and see if others agree with your assessment of the numbers. As a team, if the product is more than 15, spend some time thinking about mitigation work you can do to reduce either likelihood or impact or both. The higher the number, the more important you put mitigations into your backlog or "definition of done". Below 15? Check with the team you're going to ignore this.
Mitigations are extra work. They add time. They slow down delivery. That's fine, you add them to your backlog as dependent tasks, and your completion estimates move out. Need to hit a deadline? Look at descoping, and include in that descoping a conversation about removing some of the risk mitigations and accepting the risk likelihood and impact.
Having been EM, TL and X in this story (and the TPM, PM, CTO and other roles), I don't want a "knob" that people are turning in their heads about their subjective measure of "careful".
I want enumerated risks with quantified impact and likelihood and adult conversations about appropriate mitigations that lead to clear decisions.
You are just describing what would happen if knob is turned up.
The whole point is whether or not to turn it up, not what the process is once its turned up. Not everyone can afford to waste time planning so much.
If you can't enumerate risks with the work you're about to do either you don't have any, or you haven't thought enough about it and you're yolo'ing.
If you have a list of risks and you don't know how to mitigate them, you're just yolo'ing.
"We don't have time to plan" is the biggest source of nonsense in this industry. The process I just described takes about 15 minutes to go through for a month's worth of work. Nobody is so busy they can't spare 15 minutes to think about things that might cause major problems that soak up far, far, far more time.
I like the idea of having an actual 'carefulness knob' prop and making the manager asking for faster delivery/more checks actually turn the knob themselves, to emphasise that they're the one responsible for the decision.
Yep. The best way to pushback is to ask your manager, "We'll do it fast since you are asking for it. What is the plan for contingencies in case things break?"
*when things break
*if, it’s an increasing probably, not a guarantee
I’ve seen code with next to no verification turn out great.
I don’t think the manager needs a button to remember he’s responsible for the decisions, usually it’s the other way around, devs having to remember they’re not; I’ve been too many times the manager putting it on zero only to hear, “You can’t do that it has to be at least a nine !!!”.
Letting your hands type an error is a trust fall that nobody is willing to make in an environment without a much higher than usual amount of trust.
Did you put it in writing, in uncertain terms?
I’ve seen maybe one out of ten managers actually take responsibility when things go to shit.
It’s not the right approach. Structural engineers shouldn’t let management fiddle with their safety standards to increase speed. They will still blame you when things fail. In software, you can’t just throw in yolo projects with much lower “carefulness” than the rest of the product, everything has maintenance. The TL in this case needs to establish a certain set of standards and practices. That’s not a choice you give away to another team on a per-feature basis.
It’s also a ridiculous low bar for engineering managers to not even understand the most fundamental of tradeoffs in software. Of course they want things done faster, but then they can go escalate to the common boss/director and argue about prioritization against other things on the agenda. Not just “work faster”. Then they can go manage those whose work output is proportional to stress, not programmers.
Management decides whether they build a cheap wooden building, a brick one, or a steel skyscraper. These all have different risk profiles.
Safety is a business/management decision, even in structural engineering. A pedestrian bridge could be constructed to support tanks and withstand nuclear explosions, but why. Many engineered structures are actually extremely dangerous - for example mountain climbing trails.
Also yes, you have many opportunities to just YOLO without significant consequences in software. A hackathon is a good example - I love them, always great to see the incredible projects at the end. The last one I visited was sponsored by a corporation and they straight up incorporated a startup next day with the winning team.
Expected use and desired tolerance is a management decision. Safety is still on engineers.
Isn't the point what level of safety?
If management intends expected use to be a low-load-quick-and-dirty-temporary-use prototype to be delivered in days, it seems the engineers are not doing their job if they calibrate their safety process to a heavy-duty-life-critical application. And vice versa.
Making the decision about the levels of use, durability, reuse-ability, scalability, AND RISK is all management. Implementing those decisions as decided by management is on engineering. It is not on engineering to fix a bad-trade-off management decision beyond what is reasonably possible (if you can, great, but go look to work someplace less exploitative).
What you wrote didn’t really refute the point you’re replying to.
What you wrote didn’t really refute my point either.
If management allocates time and resources only for a quick-&-dirty prototype not for public use, then releases it to the public with bad consequences, they will definitely ask the engineers about it. If the engineers properly covered their paper trail, i.e., kept receipts for when management refused their requests for resources to build-in greater safety, then engineering will be not responsible. Ethically, this is the correct model.
But if he & you are saying that management will try to exploit engineering and then blame failures on engineering when it was really bad management, yup, you should expect that kind of ethical failure from management. Yes, there are exceptions, but the structure definitely encourages such unethical management behaviors.
I wasn’t trying
That's evident
"Management" is generally a bunch of morons bikeshedding to try to distract everyone from the fact that their jobs are utterly worthless. The fewer decisions management makes, the better. If management feels like they need to make a decision, throw something in there they can tell you to remove later. It's a tried and tested method of getting those imbeciles off your back. Then build the damn thing that actually needs building. Management will take the credit, of course. But they'll be happy. Everyone wins.
Think about it this way: the person who suffers the consequences of the decision should be making the decision. That's not management; they will never, ever accept any level of blame for anything. They'll immediately pass that buck right on to you. So that makes it your decision. Fuck management; build what needs building instead.
Look at what happened when "management" started making decisions at Boeing about risk, instead of engineers making the decisions.
I can tell you’ve never worked with a competent manager. They are required because building the wrong thing is worse than not building anything at all
Building the wrong thing is exactly what happens when you listen to management too much. Talk to the client yourself. Learn the subject. Get the textbook. Read the materials. That's how you build the right thing.
> They are required because building the wrong thing is worse than not building anything at all
And yet, "manager" is usually[1] only responsible for ensuring the boards get carried from the truck to the construction site and that two workers don't shoot at each other with nail guns, not "we, collectively, are building the right house."
I freely admit that my cynicism is based on working in startups, where who knows what the right thing actually is, but my life experience is that managers for sure do not: they're just having meetings to ensure the workers are executing on the plan that the manager heard in their meeting
1: I am also 1000000% open to the fact that I fall into the camp of not having seen this mythical "competent manager" you started with
Management can be good. Few engineers in a startup are doing it too, they just don't call it that.
Which is why liability is needed.
Fwiw in a real world scenario it'd be more helpful to hear "the timeline has risks" alongside a statement of a concrete process you might not be doing given that timeline. Everyone already knows about diminishing returns, we don't need a lesson on that.
My favorite tool when defining project timelines: What are we not doing?
There's an infinite number of nice-to-haves. A nice good deadline makes it super easy to clarify what you actually need vs what you only want.
and you'd be amazed when you start really having these discussions with the client how often stuff ends up not only not being needed, but often going right past 'nice to have' and right to 'lets not'. The problem is often the initial problem being WAY overspecified in ways that don't ACTUALLY matter but generate tons of extra work.
Yeah, to me this kind of thing is much better than the carefulness knob.
Delaying or just not doing certain features that have low ROI can drastically shorten the development time without really affecting quality.
This is something that as an industry we seem to have unlearned. Sure, it still exists in the startup space, with MVPs, but elsewhere it's very difficult. In the last 20 years I feel like engineers have been pushed more and more away from the client, and very often you just get "overspecified everything" from non-technical Product Managers and have to sacrifice in quality instead.
I had one today that with one email went from "I want this whole new report over our main entity type with 3 user specified parameters" to "actually, just add this one existing (in the db) column to this one existing report and that totally solves my actual problem". My time going from something like 2 days to 15 minutes + 10 minutes to write the email.
For me the biggest one lately was due to miscommunication and bad assumptions.
The designer was working in a redesign in their own free time, so they were using this "new design" as a template for all recent mockups. The Product Manager just created tickets with the new design and was adamant that changing the design was part of the requirements. The feature itself was simple, but the redesign was significantly hard.
Talking with the business person revealed that they were not even aware of the redesign and it was blocked until next year.
If everyone actually knew this stuff, this entire class of problem would cease to exist. Given that it has not...
I use a fairly simple “checkerboard” table to display risk, and encourage planning for mitigation/prevention/remedy[0].
In my experience, risk has two dimensions (probability and severity), and three ways to be handled (prevention, mitigation, and remediation).
[0] https://littlegreenviper.com/risky-business/
A personal anecdote:
One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.
There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.
My response: "Nothing. We're not going to do anything."
The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".
I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."
In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)
I had a similar situation, but in my case was due to a an upstream outage in a AWS Region.
The final assessment in the Incident Review was that we should have a multi-cloud strategy. Our luck that we had a very reasonable CTO that prevented the team do to that.
He said something along the lines that he would not spend 3/4 of a million plus 40% of our engineering time to cover something that rarely happens.
[preface that this response is obviously operating on very limited context]
"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.
While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.
[1] https://en.wikipedia.org/wiki/Swiss_cheese_model
I agree.
I didn't want to add a wall of text for context :) And that was the only time I've said something like that to a client. I was not being confrontational, just telling them how it is.
I suppose my point was that there's a cost associated with increasing reliability, sometimes it's just not worth paying it. And that people will usually appreciate candor rather than vague promises or hand-wavy explanations.
We've started to note these wont-fixes down as risks and started talking about probability and impact of these. That has resulted in good and realistic discussions with people from other departments or higher up.
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.
I hear ya, that sounds familiar.
The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.
I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)
Good, but I would have preferred a comment about 'process gates' somewhere in there [0]. I.e. rather than say "it's probably nothing let's not do anything" only to avoid the extreme "let's double check everything from now on for all eternity", I would have preferred a "Let's add this temporary process to check if something is actually wrong, but make sure it has a clear review time and a clear path to being removed, so that the double-checking doesn't become eternal without obvious benefit".
[0] https://news.ycombinator.com/item?id=33229338
Nothing more permanent than a temporary process.
When you have zero incidents using the temporary process people will automatically start to assume it’s due to the temporary process, and nobody will want to take responsibility for taking it out.
The infamous lion-repeling rock in action.
Yep yep, exactly this. When an incident review reveals a fluke that flew past all the reasonable safeguards, a case that the team may have acknowledged when implementing those safeguards. Sometimes those safeguards are still adequate, as you can’t mitigate 100% of accidents, and it’s not worth it to try!
I’d go further to say that it’s a trap to try, it’s obvious that you can’t get 100% reliability, but people still feel uneasy with doing nothing
> If we see a pattern of mistakes being made then we can talk about taking steps to prevent them.
...but that's not really nothing? You're acknowledging the error, and saying the action is going to be watch for a repeat, and if there is one in a short-ish amount of time, then you'll move to mitigation. From a human standpoint alone, I know if I was the client in the situation, I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
Don't get me wrong; I agree with your assessment. But don't sell non-technical actions short!
> You're acknowledging the error,
Which is important but not taking an action.
> and saying the action is going to be watch for a repeat
That watching was already happening. Keeping the status quo of watching is below the level of meaningful action here.
> if there is one in a short-ish amount of time, then you'll move to mitigation.
And that would be an action, but it would be a response to the repeat.
> I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
They did say roughly those things, worded in a different way. It's not like they planned to say "nothing" and then walk out without elaborating!
The abbreviated story I told was perhaps more dramatic-sounding than it really played out. I didn't just say "Nothing." mic drop walk out :)
The client was satisfied after we owned the mistake, explained that we have a number of measures in place for preventing various mistakes, and that making a test for this particular one doesn't make sense. Like, nothing will prevent me from creating a cron job that does "rm -rf * .o". But lights will start flashing and fixing that kind of blunder won't take long.
If you want to go full corporate, and avoid those nervous laughs and frowns from people who can't tell if you're being serious or not, I recommend dressing it up a little.
You basically took the ROAM approach, apparently without knowing it. This is a good thing. https://blog.planview.com/managing-risks-with-roam-in-agile/
Correct.
Corollary is that Risk Management is a specialist field. The least risky thing to do is always to close down the business (can't cause an incident if you have no customers).
Engineers and product folk, in particular, I find struggle to understand Risk Management.
When juniors ask me what technical skill I think they should learn next my answers is always; Risk Management.
(Heavily recommended reading: "Risk, the science and politics of fear")
> Engineers and product folk, in particular, I find struggle to understand Risk Management.
How do you do engineering without risk management? Not the capitalized version, but you’re basically constantly making tradeoffs. I find it really hard to believe that even a junior is unfamiliar with the concept (though the risk they manage tends to be skewed towards risk to their reputation).
Yeah. Policies, procedures, and controls have costs. They can save costs, but they also have their own costs. Some pay for themselves; some don't. The ones that don't, don't create those procedures and controls.
Good manager, have a cookie.
This feels like a really good starting point to me, but I just want to point out that there's a very low ceiling on the effectiveness of "carefulness". I can spend 8 hours scrutinizing code looking for problems, or I can spend 1 hour writing some tests for it. I can spend 30 minutes per PR checking it for style issues, or I can spend 2 hours adding a linter step to CI.
The key here is automating your "carefulness" processes. This is how you push that effectiveness curve to the right. And, the corollary here is that a lack of IC carefulness is not to blame when things break. It is always, always, always process.
And to reemphasize the main point of TFA, things breaking is often totally fine. The optimal position on the curve is almost never "things never break". The gulf between "things never break" and "things only break .0001% of the time" is a gulf of gazillions of dollars, if you can even find engineers motivated enough and processes effective enough to get you anywhere close to there. This is what SLAs are for: don't give your stakeholders false impressions that you'll always work forever because you're the smartest and most dedicated amongst all your competitors. All I want is an SLA and a compensation policy. That's professional; that's engineering.
> lack of IC carefulness is not to blame when things break
But if you can trust ICs a bit, you can move faster
LT is a member of the leadership team.
LT: Get it done quick, and don't break anything either, or else we're all out of a job.
EM: Got it, yes sir, good idea!
[EM surreptitiously turns the 'panic' dial to 10, which reduces a corresponding 'illusion of agency' dial down to 'normal']
In general, I found that when I've told people to be careful on that code path (because it has bitten me before) I don't get the sense that it is a welcomed warning.
It's almost as if I'm questioning their skill as a engineer.
I don't know about you but when I'm driving a road and there is black ice around the corner a warning from a fellow driver is welcomed.
I’ve seen this happen a number of ways. Some jobs I’ve worked for, everyone’s ears perk up when they hear that and several people will curious. I’ve worked places where people’s egos were in the way of basic good sense. I’ve seen well intentioned people who were just flat out abrasive.
I have no idea which situations you’re finding yourself in. It may help to sit back and see if you could word things different. I have gotten better communicating by asking the people I’m working with if there was a better way I could have said something. Some managers I’ve had had good advice. (I’ve also gotten myself dragged into their office for said advice.)
I have no idea how you approached it, but you could let them decide if they want your advice and have specific examples on how things went wrong if they do. “Hey, I noticed you’re working on this. We’ve had some problems in the past. If you have time, I can go into more detail.”
Then again you could just be working with assholes.
I think some of the blame _definitely_ lies with me. No doubt. The only sure fire way I've found that won't cause offence is to offer up a slack call/huddle so that they can hear my tone.
I have been failing quite successfully at communicating my tone over text for some time now, so I confess that and admit it upfront.
I did a lot of the work in my 40 year software career as an individual, which meant it was on me to estimate the time of the task. My first estimate was almost always an "If nothing goes wrong" estimate. I would attempt to make a more accurate estimate by asking myself "is there a 50% chance I could finish early?". I considered that a 'true' estimate, and could rarely bring myself to offer that estimate 'up the chain' (I'm a wimp ...). When I hear "it's going to be tight for Q2", in the contexts I worked in, that meant "there's no hope". None of this invalidates the notion of a carefulness knob, but I do kinda laugh at the tenor of the imagined conversations that attribute a lot more accuracy to the original estimate that I ever found in reality in my career. Retired 5 years now, maybe some magic has happened while I wasn't looking.
More than once I've used the xkcd method (Pull a gut number out of thin air, then double the numerator and increment the unit e.g. 1 hour -> 2 days, 3 weeks -> 6 months). When dealing with certain customers this has proven disappointingly realistic.
It generally comes down to the age old question, pick two out of: Quality, Speed or Cost
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.
This is a really critical property that doesn't get highlighted nearly often enough, and I'm glad to see it reinforced here. Slow is smooth, smooth is fast. And predictable.
Lorin is always on point, and I appreciate the academic backing he brings to the subject. But for how many years do we need to tell MBAs that "running with scissors is bad" before it becomes common knowledge? (Too damn many.)
I think the kind of people that run with scissors for years and somehow have nothing happen have a higher tendency to become MBAs.
The dominant model in project management is "divide a project into a set of tasks and analyze the tasks independently". You'd imagine you could estimate the work requirement for a big project by estimating the tasks and adding them up, but you run into various problems.
Some tasks are hard to estimate because they have an element of experimentation or research. Here a working model is the "run-break-fix" model where you expect to require an unknown number of attempts to solve the problem. In that case there are two variables you can control: (1) be able to solve the problem in less tries, and (2) take less time to make a try.
The RBF model points out various problems with carelessness as an ideology. First of all, being careless can cause you to require more tries. Being careless can cause you to ship something that doesn't work. Secondly, and more important, the royal road to (2) is automation and the realization that slow development tools cause slow development.
That is, careless people don't care if they have a 20 minutes build. It's a very fast way to make your project super-late.
I worked at a place that organized a 'Hackathon' where we were supposed to implement something with our project in two hours. I told them, "that's alright, but it takes 20 minutes for us to build our system, so if we are maximally efficient we get 6 tries at this". The eng manager says "it doesn't take 20 minutes to build!" (he also says we "write unit tests" and we don't, he says we "handle errors with Either in Scala" which we usually don't, and says "we do code reviews", which I don't believe) I set my stopwatch, it takes 18 minutes. (It is creating numerous Docker images for various parts of the system that all need to get booted up)
That organization was struggling with challenging requirements from multiple blue chip customers -- it's not quite true that turning that 20 minute build into a 2 minute build will accelerate development 10x but putting some care in this area should pay for itself.
[1] https://www.amazon.com/Have-Fun-at-Work-Livingston/dp/093706...
I like the idea of imagining that we can arbitrarily adjust the carefulness knob, but I don't think it works like that in reality. You can certainly spend more time writing tests, but a lot of the unforeseen problems I've hit over the years weren't caused by lack of testing--they were caused by unknown things that we couldn't have known regardless of how careful we were. It doesn't make for a very satisfying post mortem.
All I see here is an all-too-common organizational issue that something like this is having to be explained to someone in a management role. They should know these things. And they should know them well.
If your company is needing to have conversations like this more than rarely—let alone experiencing the actual issue being discussed—then that's a fundamental problem with leadership.
In real life you both can't afford having reputation risk and you need to ship anyway. If you have an incident, guess who's going to be liable – the manager or the name on the commit?
Stop negotiating quality; negotiate scope and set a realistic time. Shipping a lot of crap faster is actually slower. 99% of the companies out there can't focus on doing _one_ thing _well_, that's how you beat the odds.
I got the sack when I did this last, which was better than the alternative – to keep trudging until the inevitable incident
So what does the graph look like from 1 to 5? Is zero defined? What does it mean?
Others mentioned a parabola, which seems true to me. Imagine writing a database migration and not testing it. You could end up losing customer data, which is very expensive to "fix".
Oracle is a thing because for the first several years of their existence they lost customer data on the regular, so the IBM DB people laughed at and ignored them.
I am probably missing an essential point here, but my first reaction was "this is literally the quality part of the scope/cost/time quality trade-off triangle?"
Has that become forgotten lore? (It might well be. It's old, and our profession doesn't do well with knowledge transmission. )
would it help if we had names for the various 2 of 3's? in my idiolect:
Vou atravessar aqui
> I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.
Yeah, sure, that never happens. That's why "I told you so" is not at all a common phrase amongst folks working on reliability-related topics ;)