I get all of these complaints. Why do I also have to be an infrastrucutre engineer? And why is my infrastructure not bespoke enough to do this weird thing I want to do? Why cant I use 5 different languages at this 30 person company?
The thing about immutable infrastructure is that its straightforward. There are a set of assumptions others can make about your app.
Immutable infrastructure is boring. Deployments are uncreative. Thats a good thing.
Repeat after me, "my creative energy should be spent on my customers"
> Repeat after me, "my creative energy should be spent on my customers"
"I should save my energy, so I won't exercise."
"I should save money, so I won't deploy it towards investments."
I don't think "creativity" is a zero-sum, finite resource; I think it's possible to generate more by spending it intelligently. And he pointed out how moving towards immutable infrastructure, while more "standard," directly hurt customers (the engineering team lost deployment velocity and functionality), so it's especially weird to end your comment the way you did.
To say "immutable infrastructure is just more straightforward" so definitively, from the limited information you have, is just you stating your biases. The stateful system he describes the company moving away from may also have been pretty "straightforward" and "boring," just with different fixed points. Beauty in the eye of the beholder and all that.
Creative energy is fairly zero sum though. You can't spend 100% of your day producing creative works. It's taxing, it takes time, it takes failures. Abstractions are one way engineers take away the requirement for creative energy and build on top of it. We don't start our jobs inventing an operating system for our new job so we can start to write some programs with it, we just use linux/macos/windows, and move on. You don't need to be creative with the video driver on your laptop while you build some business crud app, and you don't need to be creative with your infrastructure either. UNLESS that is where your company will succeed. Spend all of your creative energy there.
The general nastiness of updates is one of the largest customer friction points in many systems, but creative energy should be directed away from fixing it?
I think like many things in engineering, it depends?
I'm sure most applications in life benefit from accepting a little downtime in order to simplify development. But there are certainly scenarios where we can use some "high quality engineering" to make downtime as low as possible.
Every customer has one pet feature. They want that feature, then they want less downtime and more performance, then maybe they want the features other customers want.
The biggest problem with bespoke deployments like this is that they can widen the gap between a happy path and a disaster. If deployments are faster but not derisked, customer expectations are raised but failure cases become more costly.
Nobody wants downtime, but it's easy to spend too much effort on avoiding it, taking time from actually important development. Plenty of customers don't mind occasional downtime, and it can mean the system is simpler and they get features faster.
Too much power to architects worsen the situation because they both have formal responsibility to keep the downtime low, but they are also appointed to finding technical solutions rather than sometimes technically mundane product improvements.
Also in the worst case, this solution becomes so cool that it attracts the best developers internally, away from building products.
Don’t forget startup innovation culture: everything has to be disturbed. Encourage with tax exemptions for « innovative » jobs and you’ll have cohorts of engenders reinventing wheels from infra to UX in a glorified innovative modern "industry".
If you want to do insfrastructure innovation you are more than welcome to. There are lots of engineers dedicated to it. Its also not that hard to go from software engineer to infrastructure engineer thus bringing your experiece and unique perspective. But working at a SMB or startup (the 90%) doesn't justify innovation for innovations sake. 1 acre of corn doesn't justify inventing the combine.
The way to enable fast ops evolution is by creating a small bubble with either a mutable facade or the immutability restrictions disabled, and go innovate there. Once you are ready, you can port the changes to the overall environment.
And the way to do the thing the article complains about is with partial deployments.
Both of those are much better behaved on a large-scale ops than the small-scale counterparts. K8s kinda "supports" both of them, but like almost everything in k8s, it's more work, and there are many foot-guns.
The biggest common lesson is being able to inspect/interrogate/modify the running system. Debugging distributed system failures purely based on logs/metrics output is not a particularly pleasant job, and most immutable software stacks don't offer a log more than that.
However, for Erlang specifically, the lesson is pushing statelessness as far down the system as you possibly can. Stateless immutable containers that we can kill at will are great - but what if we could do the same thing at the request handler level?
Exactly right. OC hits all the points. Fine granularity of failure, (re)warming caches, faster iteration by reducing cost of changes, etc.
I too lived this. Albeit with "poor man's Erlang" (aka Java). Our customers were hospitals, ERs, etc. Our stack could not go down. And it had to be correct. So sometimes that means manual human intervention.
There's another critial distinction, missed by the "whole freaking docker-meets-kubernetes" herd:
Our deployed systems were "pets". Whereas k8s is meant for "cattle".
Ha, I recently wrote a system that does more or less the exact same thing for pushing updates to IoT devices. I can tell the system to update particular nodes to a given git commit, then I can roll it out to a handful of devices, then I can say "update all of them" but wait 30 seconds in between each update and so on.
The big thing about immutable infrastructure is that it is reproducible. I've seen both worlds and I do appreciate the simplicity and quickness of the upgrade solution presented in the post. The problem with this manual approach is that it is quite easy to end up with a few undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
Now, for immutable infrastructure, you have a whole different set of problems. All your changes are nicely logged in git, but to deploy you need to rebuild containers and roll them out over a cluster. To do this smoothly, the cluster also needs to have some kind of high availability setup, making everything quite complex and, in the end, you wasted minutes to hours of compute for something that a pet setup can do in a few seconds. But you can be sure that a server going down or a reboot are completely safe operations.
What works for you really depends on your situation (team size, importance of the app, etc.), but both approaches do have their uses and reducing the immutable infra approach to "people run k8s because it's hip" misses the point.
> undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
You can mostly prevent this by mandating that fresh nodes come up regularly. Have a management process that keeps a rolling window of ~5% of your fleet in connection-drain, and replaces the nodes as soon as they hit low-digits of connections.
Whole fleet is replaced every ~3 weeks, you learn about any deployment/startup failures within one day of new code landing in trunk, minimal disruption to client connections.
> The big thing about immutable infrastructure is that it is reproducible. I've seen both worlds and I do appreciate the simplicity and quickness of the upgrade solution presented in the post. The problem with this manual approach is that it is quite easy to end up with a few undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
The point is not really automation vs manual. Hot loading is amenable to automation too. The point is really that when you replace immutable servers with state with another set, there's a lengthy process to migrate the state. If you can mutate the servers, you save a lot of wall clock time, a lot of server cpu time, and a lot of client cpu time.
I deal with this issue at my current job. I used to work in Erlang and it took a couple minutes to push most changes to production. Once I was ready to move to production, it was less than 30 minutes to prepare, push, load, verify and move on with my life. I could push follow ups right away, or wrap up several issues, one at a time, in a single day. Coming from PHP was pretty similar, with caveats about careful replacement of files (to avoid serving half a PHP file) and PHP caching.
Now I work with Rust, terraform, and GCP; it takes about 12 minutes for CI to build production builds, it takes terraform at least 15 minutes to build a new production version deployment, and several more minutes for it to actually finish coming up, only then can I start to move traffic, and the traffic takes a long time to fully move, so I have to come back the next day to tear down the old version. I won't typically push a follow up right away, because then I've got three versions running. I can't push multiple times a day. If I'm working many small issues, everything has to be batched into one release, or I'll be spending way too much of my time doing deploys, and the deployment process will be holding back progress.
The funny thing here is BEAM is “immutable infrastructure as a programming language environment” which, to me, is strictly superior to the current disjunction between “infrastructure configuration” and “application code”.
Erlang defaults to pure code and every actor is like a little microservice with good tooling for coordination. There are mutable aspects like a distributed database, but nothing all that different from the mutable state that exists in every “immutable infrastructure” deployment I’ve seen.
> Now, for immutable infrastructure, you have a whole different set of problems. All your changes are nicely logged in git, but to deploy you need to rebuild containers and roll them out over a cluster.
The real issue is that it effectively forces externalizing nearly all state. On the surface, this seems like it's just a good thing, but if you think about the limitations and complexity it creates, it starts seeming less unquestionably good. Sometimes that complexity is warranted, but very frequently it is not.
That being said, I think modifying code is a running system without pretty strict procedures/control around it is... dangerous. I've seen hotfixex get dropped/forgotten because it only existed on running system and not in source control more than a couple of times.
What a lovely and well-written piece. I think the dev vs ops divide has caused so many problems like this. We just write systems differently when we have to run things versus when it gets thrown over the wall to other people to deal with.
Maybe that sounds like I'm blaming developers, but I'm not. I think this is rooted in management theories of work. They optimize for simple top-down understanding, not cross-functional collaboration. If people are rewarded for keeping to over-optimistic managerial plan (or keeping up "velocity"), then they're mostly going to throw things over the wall.
Would a middle ground be possible? E.g. by default use stateless containers, but for certain stacks or popular app frameworks support automated stateful deploys?
Two years after writing A Pipeline Made of Airbags, I ended up prototyping a minimal way to do hot code loading from kubernetes instances by using generic images and using a sidecar to load pre-built software releases from a manifest in a way that worked both for cold restarts and for hot code loading: https://ferd.ca/my-favorite-erlang-container.html
It's more or less as close to a middle-ground as I could imagine at the time.
A truly well crafted app requires very little maintenance or support, and that maintenance/support has already been throught through and made easy to learn and do.
These things are possible, and they fit economically somewhere in the 3-5 year maturity of a system. Years 1-3 are usually necessarily focused on features and releases, but far too many orgs just stop at that point and aren't willing to invest that extra year or two in work that will save time and money for many years to follow.
I believe this resistance is due to the short-sightedness of buyouts/IPOs or simply leadership churn.
K8s is a Google thing for Google problems. It’s just not needed for most software delivery.
Edit: meanwhile I’m waiting 45 minutes on average for my most recent, single-line change, PR to roll out to k8s cluster at $massive_company that totally does this like everyone else.
Google doesn't use Kubernetes internally. Kubernetes is a simplified version of Borg. If it's taking 45 minutes to deploy a change, that's on your company's platform team, not Google.
I get all of these complaints. Why do I also have to be an infrastrucutre engineer? And why is my infrastructure not bespoke enough to do this weird thing I want to do? Why cant I use 5 different languages at this 30 person company?
The thing about immutable infrastructure is that its straightforward. There are a set of assumptions others can make about your app.
Immutable infrastructure is boring. Deployments are uncreative. Thats a good thing.
Repeat after me, "my creative energy should be spent on my customers"
> Repeat after me, "my creative energy should be spent on my customers"
"I should save my energy, so I won't exercise."
"I should save money, so I won't deploy it towards investments."
I don't think "creativity" is a zero-sum, finite resource; I think it's possible to generate more by spending it intelligently. And he pointed out how moving towards immutable infrastructure, while more "standard," directly hurt customers (the engineering team lost deployment velocity and functionality), so it's especially weird to end your comment the way you did.
To say "immutable infrastructure is just more straightforward" so definitively, from the limited information you have, is just you stating your biases. The stateful system he describes the company moving away from may also have been pretty "straightforward" and "boring," just with different fixed points. Beauty in the eye of the beholder and all that.
Creative energy is fairly zero sum though. You can't spend 100% of your day producing creative works. It's taxing, it takes time, it takes failures. Abstractions are one way engineers take away the requirement for creative energy and build on top of it. We don't start our jobs inventing an operating system for our new job so we can start to write some programs with it, we just use linux/macos/windows, and move on. You don't need to be creative with the video driver on your laptop while you build some business crud app, and you don't need to be creative with your infrastructure either. UNLESS that is where your company will succeed. Spend all of your creative energy there.
>> Repeat after me, "my creative energy should be spent on my customers"
I agree with you. But from the blog:
>> "Product requirements were changed to play with the adopted tech."
That's when things may have gone too far.
It's "weird" to want low downtime?
The general nastiness of updates is one of the largest customer friction points in many systems, but creative energy should be directed away from fixing it?
Gross.
I think like many things in engineering, it depends?
I'm sure most applications in life benefit from accepting a little downtime in order to simplify development. But there are certainly scenarios where we can use some "high quality engineering" to make downtime as low as possible.
who said anything about downtime? Immutable infrastructure does not require downtime.
> who said anything about downtime? Immutable infrastructure does not require downtime.
Um, the article did.
Every customer has one pet feature. They want that feature, then they want less downtime and more performance, then maybe they want the features other customers want.
The biggest problem with bespoke deployments like this is that they can widen the gap between a happy path and a disaster. If deployments are faster but not derisked, customer expectations are raised but failure cases become more costly.
How is downtime beneficial to the customer?
Nobody wants downtime, but it's easy to spend too much effort on avoiding it, taking time from actually important development. Plenty of customers don't mind occasional downtime, and it can mean the system is simpler and they get features faster.
I have been there! Duly upvoted!
Too much power to architects worsen the situation because they both have formal responsibility to keep the downtime low, but they are also appointed to finding technical solutions rather than sometimes technically mundane product improvements.
Also in the worst case, this solution becomes so cool that it attracts the best developers internally, away from building products.
Don’t forget startup innovation culture: everything has to be disturbed. Encourage with tax exemptions for « innovative » jobs and you’ll have cohorts of engenders reinventing wheels from infra to UX in a glorified innovative modern "industry".
I think theres more to it than that.
You are correct for 90% of the cases, but this also kills innovation.
If you want to do insfrastructure innovation you are more than welcome to. There are lots of engineers dedicated to it. Its also not that hard to go from software engineer to infrastructure engineer thus bringing your experiece and unique perspective. But working at a SMB or startup (the 90%) doesn't justify innovation for innovations sake. 1 acre of corn doesn't justify inventing the combine.
I love that last line. That’s the best analogy I’ve heard.
The way to enable fast ops evolution is by creating a small bubble with either a mutable facade or the immutability restrictions disabled, and go innovate there. Once you are ready, you can port the changes to the overall environment.
And the way to do the thing the article complains about is with partial deployments.
Both of those are much better behaved on a large-scale ops than the small-scale counterparts. K8s kinda "supports" both of them, but like almost everything in k8s, it's more work, and there are many foot-guns.
It's a real shame that we are steadily losing all the lessons of Erlang/SmallTalk/Lisp machines.
What are the specific lessons worth preserving, but being lost?
(I assume that "keep an image, it's too costly to rebuild everything from version-controlled sources" is not such a lesson.)
The biggest common lesson is being able to inspect/interrogate/modify the running system. Debugging distributed system failures purely based on logs/metrics output is not a particularly pleasant job, and most immutable software stacks don't offer a log more than that.
However, for Erlang specifically, the lesson is pushing statelessness as far down the system as you possibly can. Stateless immutable containers that we can kill at will are great - but what if we could do the same thing at the request handler level?
Yes specific would be better.
Of course "keep an image" and "version-controlled sources" are not mutually exclusive.
https://www.google.com/books/edition/Mastering_ENVY_Develope...
> (I assume that "keep an image, it's too costly to rebuild everything from version-controlled sources" is not such a lesson.)
Exactly - that lesson is in no danger of being lost. Docker and other kinds of VM image are ubiquitous.
society often over powers good ideas.. if the zeitgeist is about kube, companies will accept its cost not really knowing the lost benefits anyway..
I guess we'll have to wait to reinstate these features later on
Exactly right. OC hits all the points. Fine granularity of failure, (re)warming caches, faster iteration by reducing cost of changes, etc.
I too lived this. Albeit with "poor man's Erlang" (aka Java). Our customers were hospitals, ERs, etc. Our stack could not go down. And it had to be correct. So sometimes that means manual human intervention.
There's another critial distinction, missed by the "whole freaking docker-meets-kubernetes" herd:
Our deployed systems were "pets". Whereas k8s is meant for "cattle".
Different tools for different use cases.
Ha, I recently wrote a system that does more or less the exact same thing for pushing updates to IoT devices. I can tell the system to update particular nodes to a given git commit, then I can roll it out to a handful of devices, then I can say "update all of them" but wait 30 seconds in between each update and so on.
The big thing about immutable infrastructure is that it is reproducible. I've seen both worlds and I do appreciate the simplicity and quickness of the upgrade solution presented in the post. The problem with this manual approach is that it is quite easy to end up with a few undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
Now, for immutable infrastructure, you have a whole different set of problems. All your changes are nicely logged in git, but to deploy you need to rebuild containers and roll them out over a cluster. To do this smoothly, the cluster also needs to have some kind of high availability setup, making everything quite complex and, in the end, you wasted minutes to hours of compute for something that a pet setup can do in a few seconds. But you can be sure that a server going down or a reboot are completely safe operations.
What works for you really depends on your situation (team size, importance of the app, etc.), but both approaches do have their uses and reducing the immutable infra approach to "people run k8s because it's hip" misses the point.
> undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
You can mostly prevent this by mandating that fresh nodes come up regularly. Have a management process that keeps a rolling window of ~5% of your fleet in connection-drain, and replaces the nodes as soon as they hit low-digits of connections.
Whole fleet is replaced every ~3 weeks, you learn about any deployment/startup failures within one day of new code landing in trunk, minimal disruption to client connections.
This doesn't prevent anything, it just schedules possible breakages because your infra isn't 100% immutable.
IME, this doesn't work because companies won't implement/will deprioritize any infra changes that impact the development cycle.
It's not so very different to dropping your PR into any other automated-CI/CD-all-the-way-to-prod pipeline.
Albeit maybe a little easier to justify to management that you are dropping everything to fix the breakage when your CI/CD pipeline stops.
> The big thing about immutable infrastructure is that it is reproducible. I've seen both worlds and I do appreciate the simplicity and quickness of the upgrade solution presented in the post. The problem with this manual approach is that it is quite easy to end up with a few undocumented fixes/upgrades/changes to your pet server and suddenly upgrading or even just rebooting the servers/app becomes something scary.
The point is not really automation vs manual. Hot loading is amenable to automation too. The point is really that when you replace immutable servers with state with another set, there's a lengthy process to migrate the state. If you can mutate the servers, you save a lot of wall clock time, a lot of server cpu time, and a lot of client cpu time.
I deal with this issue at my current job. I used to work in Erlang and it took a couple minutes to push most changes to production. Once I was ready to move to production, it was less than 30 minutes to prepare, push, load, verify and move on with my life. I could push follow ups right away, or wrap up several issues, one at a time, in a single day. Coming from PHP was pretty similar, with caveats about careful replacement of files (to avoid serving half a PHP file) and PHP caching.
Now I work with Rust, terraform, and GCP; it takes about 12 minutes for CI to build production builds, it takes terraform at least 15 minutes to build a new production version deployment, and several more minutes for it to actually finish coming up, only then can I start to move traffic, and the traffic takes a long time to fully move, so I have to come back the next day to tear down the old version. I won't typically push a follow up right away, because then I've got three versions running. I can't push multiple times a day. If I'm working many small issues, everything has to be batched into one release, or I'll be spending way too much of my time doing deploys, and the deployment process will be holding back progress.
The funny thing here is BEAM is “immutable infrastructure as a programming language environment” which, to me, is strictly superior to the current disjunction between “infrastructure configuration” and “application code”.
Erlang defaults to pure code and every actor is like a little microservice with good tooling for coordination. There are mutable aspects like a distributed database, but nothing all that different from the mutable state that exists in every “immutable infrastructure” deployment I’ve seen.
> Now, for immutable infrastructure, you have a whole different set of problems. All your changes are nicely logged in git, but to deploy you need to rebuild containers and roll them out over a cluster.
The real issue is that it effectively forces externalizing nearly all state. On the surface, this seems like it's just a good thing, but if you think about the limitations and complexity it creates, it starts seeming less unquestionably good. Sometimes that complexity is warranted, but very frequently it is not.
That being said, I think modifying code is a running system without pretty strict procedures/control around it is... dangerous. I've seen hotfixex get dropped/forgotten because it only existed on running system and not in source control more than a couple of times.
Joe would be turning in his grave if he knew where industry are right now on the k8s love-bomb.
The real issue is that the languages most use do not support, reliably, another approach.
k8s is not the issue. Worse Is Better languages and runtimes are.
What a lovely and well-written piece. I think the dev vs ops divide has caused so many problems like this. We just write systems differently when we have to run things versus when it gets thrown over the wall to other people to deal with.
Maybe that sounds like I'm blaming developers, but I'm not. I think this is rooted in management theories of work. They optimize for simple top-down understanding, not cross-functional collaboration. If people are rewarded for keeping to over-optimistic managerial plan (or keeping up "velocity"), then they're mostly going to throw things over the wall.
Would a middle ground be possible? E.g. by default use stateless containers, but for certain stacks or popular app frameworks support automated stateful deploys?
Two years after writing A Pipeline Made of Airbags, I ended up prototyping a minimal way to do hot code loading from kubernetes instances by using generic images and using a sidecar to load pre-built software releases from a manifest in a way that worked both for cold restarts and for hot code loading: https://ferd.ca/my-favorite-erlang-container.html
It's more or less as close to a middle-ground as I could imagine at the time.
A well crafted app is great. It is also complex and generally only maintainable/supportable by those who built it.
A truly well crafted app requires very little maintenance or support, and that maintenance/support has already been throught through and made easy to learn and do.
These things are possible, and they fit economically somewhere in the 3-5 year maturity of a system. Years 1-3 are usually necessarily focused on features and releases, but far too many orgs just stop at that point and aren't willing to invest that extra year or two in work that will save time and money for many years to follow.
I believe this resistance is due to the short-sightedness of buyouts/IPOs or simply leadership churn.
If the original devs wrote good documentation then pretty much anybody can maintain it easily.
K8s is a Google thing for Google problems. It’s just not needed for most software delivery.
Edit: meanwhile I’m waiting 45 minutes on average for my most recent, single-line change, PR to roll out to k8s cluster at $massive_company that totally does this like everyone else.
Google doesn't use Kubernetes internally. Kubernetes is a simplified version of Borg. If it's taking 45 minutes to deploy a change, that's on your company's platform team, not Google.
[dead]
[flagged]