Hey, author for the blog post is here. If you have any questions or comments, please let me know!
It's also worth calling out the first diagram shows dependencies between features for Ubicloud's managed Postgres. AWS, Azure, and GCP's managed Postgres service would have a different diagram. That's because we at Ubicloud treat write-ahead logs (WAL) as a first class citizen.
From the article:
“If the user isn’t actively working on the database, PostgreSQL won’t generate a WAL file each minute. WAL files are by default 16 MB and PostgreSQL waits for the 16 MB to fill up. So, your restore granularity could be much longer, and you may not be able to restore to a particular minute in time.
You can overcome this problem by setting the archive_timeout and forcing Postgres to generate a new WAL file every minute. With this configuration, Postgres would create a new WAL file when it hits the 1-minute or 16 MB threshold, whichever comes first.
The second issue with backup/restore is “no activity”. In this case, PostgreSQL wouldn’t create a new file even if the archive_timeout is set. As a solution, you can generate artificial write activity by calling pg_current_xact_id().”
Can you explain why to create a WAL file even though there is no activity?
We want to ensure that the users are able to restore to any minute in the past.
Then, as @pintxo, @bjornsing, and @Doxin asked below; Why not let users pick a time and then replay the closest backup? Why create empty backups? This is a valid question.
The answer is that without an empty backup, it's not possible to confirm there was no write activity. Maybe some write activity occurred but you fail to archive the WAL file. You need at least one WAL file (empty or not) with a timestamp equal to or more recent than the target restore time to ensure there were no write activity.
I didn't explain this in detail in the blog post for the sake of brevity, but you don't need to create an empty WAL file for each minute with no write activity. For instance, if there was no write activity for 10 minutes and then a WAL file is generated, PostgreSQL can deduce there was no write activity during those 10 minutes and is able restore to any minute within that period. However, if the user wants to restore to 5 minutes ago and there has been no write activity for the past 10 minutes, PostgreSQL doesn't have a WAL file to confirm lack of write activity. In this case, you need to call pg_current_xact_id() to generate a WAL file. So the trick is, when you get a restore request, call pg_current_xact_id() only once, just in case if there were no write activity.
This is a good example of difference between running PostgreSQL for yourself vs. others. When running PostgreSQL for yourself, such edge cases is not a concern. However, when managing it for others, an edge case like this could prevent a restore and create an operational burden. Automating this process is required especially if you are managing lots of PostgreSQL databases.
Thank you, that makes a lot of sense. I should have intuited that, but I couldn’t quite make the connection.
I really appreciate the write up, it gave me a lot to think about as someone who has only ever managed my own DB.
"Point in time restore" is a bad way to call the feature if you don't let your customers pick a moment in time to restore to, so those tricks ensure that there's enough WAL entries to allow people to pick with per-second granularity.
> those tricks ensure that there's enough WAL entries to allow people to pick with per-second granularity
How can they ensure that? Sounds more like they ensure that people can pick point in time with minute granularity.
I’m no expert, but doesn’t postgres write a timestamp with each record in the WAL? I know that when you start postgres in recovery mode it lets you specify a timestamp you want to restore to, and I don’t think that timestamp needs to be on a WAL file boundary. (So the tricks seem weird / unnecessary to me.)
How do you know the WAL was not lost during transmission, or the server crashed before transferring the WAL even though there were writes after the last WAL you have in your backup?
1. Do you use logical or physical replication for high availability? It seems like logical replication requires a lot of manual intervention, so I'd assume physical replication?
2. If that's the case, does that mean customers can't use logical replication for other use cases?
I'm asking because logical replication seems to become more and more common as a solution to automatically replicate data to external systems (another Postgres instance, Kafka, data warehousing, or offline sync systems in my case), but many cloud providers appear to not support it. Many others do (including AWS, GCP), so I'd also be interested in how they handle high availability.
There are many reasons that cloud providers don't want to support logical replication;
- It requires giving superuser access to user. Many cloud providers don't want to give that level of privilege. Some cloud providers fork PostgreSQL or write custom extensions to allow managing replication slots without requiring superuser access. However, doing it securely is very difficult. You suddenly open up a new attack vector for various privilege escalation vulnerabilities.
- If user creates a replication slot, but does not consume the changes, it can quickly fill up the disk. I dealt many different failure modes of PostgreSQL, and I can confidently say that disk full cases one of the most problematic/annoying ones to recover from.
- It requires careful management of replication slots in case of fail over. There are extensions or 3rd party tools helping with this though.
So, some cloud providers don't support logical replication and some support it weakly (i.e. don't cover all edge cases).
Thankfully there are some improvements are being done in PostgreSQL core that simplifies failover of logical replication slot (check out this for more information https://www.postgresql.org/docs/17/logical-replication-failo...), but it is still too early.
Yeah, I've dealt with some of those edge cases on AWS and GCP.
Some examples:
1. I've seen a delay of hours without any messages being sent on the replication protocol, likely due to a large transaction in the WAL not committed when running out of disk space.
2. `PgError.58P02: could not create file \"pg_replslot/<slot_name>/state.tmp\": File exists`
3. `replication slot "..." is active for PID ...`, with some system process holding on to the replication slot.
4. `can no longer get changes from replication slot "<slot_name>". ... This slot has never previously reserved WAL, or it has been invalidated`.
All of these require manual intervention on the server to resolve.
And that's not even taking into account HA/failover, these are just issues with logical replication on a single node. It's still a great feature though, and often worth having to deal with these issues now and then.
Definitely agreed. It is great feature and a building block for many complex features such as multiple primaries, zero down time migrations etc. I'm also quite happy to see with each PG version, it becomes more stable/easy to use.
There are "strange" properties your users have... who are those people?
* You describe them as wanting to deploy the database in container: why would anyone do that (unless for throwaway testing or such)?
* The certificate issue seems also very specific to the case when something needs to go over public Internet to some anonymous users... Most database deployments I've seen in my life fall into one of these categories: database server for a Web server, which talk in a private local network, or database backing some kind of management application, where, again the communication between the management application and the database happen without looping in the public Internet.
* Unwilling to wait 10 minutes to deploy a database. Just how often do they need new databases? Btw. I'm not sure any of the public clouds have any ETAs on VM creation, but from my practice, Azure can easily take more than 10 minutes to bring up a VM. With some flavors it can take a very long time.
The only profile I can think of is the kind of user who doesn't care about what they store in the database, in the sense that they are going to do some quick work and throw the data away (eg. testing, or maybe BI on a slice of data). But then why would these people need HA or backups, let alone version updates?
* Sorry for the confusion, our customers do not necessarily want to deploy database in a container. It is just we encountered lots of folks who wants to do that and asked us about how to handle HA and backups. Though, I don't think it is rare situation to be in. Even in this thread, there are multiple people asking about running Postgres on K8s.
* We saw many deployments where communication between web server and database were going through public internet. It doesn't need to be for anonymous users. It is also even somewhat common where web server and database are managed by different SaaS providers, so they have to (in most cases) communicate through public network.
* We (and all cloud providers) are trying to reduce overall provisioning time, mostly to reduce the friction for first time users. There is no SLA but for common instance types, it would be unusual to wait for more than 1 minute at AWS, Azure or Ubicloud for VM provisionings.
Maybe you and I just experience different parts of the big computing ecosystem, hence what is "usual" for each of us is different. Out of curiosity, are you coming from enterprise background?
I started as software developer and I'm still a software developer. Though I always worked on either core database development or building managed database services, by coincidence at the beginning and by choice later on. I was also fortunate to have the opportunity to work alongside some of the leading experts in this domain and learn from them.
There is a "Connection Pooling" box at the first diagram. Though, the article does not talk about all boxes. It would simply be very long wall of text if we had mention all components. Instead we picked some of the more interesting parts and focused on those.
Do you mean writing something similar for MySQL or building a MySQL service at Ubicloud?
For the first one, the answer would be no, because I don't have any expertise on MySQL. I only used it years ago on my hobby projects. I'm not the best person to write something like this for MySQL.
For the second one, the answer would be maybe, but not any time soon. We have list of services that we want to build at Ubicloud and currently services like K8s have more priority.
Every time I've used MySQL for anything, I will come across something that annoys me. You can use ANSI-Quotes mode for tables, columns, etc, but when making a foreign key reference it has to be back-ticks.
UTF8 isn't, which is a holdover from during UTF8 design, but even then, it should have been retired and re-aliased from the next major release (as a breaking change, that wouldn't break much).
Magic (stylized) single-quotes work as apostrophes in queries, meaning you can't use simple escape logic if you're writing a custom client, and a whole new level risk of security bugs. Yeah, you should use parameterized queries, but still.
If your default indexing is case-insensitive, if you do an index on a binary field, it's going to be case-insensitive (which can/will cause annoying issues).
There are so many more one-off and analogies people have when working with MySQL. PostgreSQL by contrast has been far more consistent in the usage, development and application of query support. Custom enhancements and so many other features just work, and better.
I used to work with/host lots of small php apps that tied in with MySQL. PHP has dropped in popularity over the years.
Add to this hosting for Postgres has become common so you're not tied to cheap hosting for MySQL only.
At least I think that the Oracle name being tied to MySQL made it icky for a lot of people that think the big O is a devil straight from hell. This is something that got me looking at Postgres way back when.
There's probably a myriad of 'smaller' reasons just like this that enforce the trend.
It is managed maintenance window, which basically means letting user to pick a time window such as Saturday 8PM-9PM and as a service provider you ensure that non critical maintenances happen at that time window.
Your blog doesn't really mention any of the turn key PostgreSQL deployment options out there. These days, especially on Kubernetes, it has never been easier to run a SaaS-equivalent PostgreSQL stack.
I think you may benefit from researching the ecosystem some more.
The author seems to have spent the better part of a decade working professionally with Postgres. I think editorial choice, rather than ignorance, might be why they’re not mentioning more prior art.
Yes. :) We quite like k8s-based managed Postgres solutions. In fact, we convinced Patroni's author to come work with us in a previous gig at Microsoft. We find that a good number of companies successfully use k8s to manage Postgres for themselves.
In this blog post, we wanted to focus on running Postgres for others. AWS, Azure, and GCP's managed Postgres services, and those offered by startups like Crunchy, don't use k8s-based solutions. For us, one reason was operational simplicity. You can write a control plane for managed Postgres in 20K lines of code, including unit tests. This way, if anything breaks at scale, you can quickly figure out the issue without having to dive into dependencies.
We also understand different tradeoffs apply when you're running Postgres for yourself or for others. In this blog post, we wanted to focus on the latter scenario.
> Security: Did you know that with one simple trick you can drop to the OS from PostgreSQL and managed service providers hate that? The trick is COPY table_name from COMMAND
If anyone actually needs the extra performance from avoiding streaming over the Postgres protocol, this could have been done with some dignity using pipes and splice or using SCM_RIGHTS. The latter technology has been around for a long time.
That's kinda the point - there is an argument that one should not be baking arbitrary shell command execution into database server at all. Such execution will lack critical security features - setting right user, process tracking, cleanup, etc..
If you need to execute commands on database server for admin work, use something designed for this (such as ssh) - this will keep right management and logging simple, only one source of shell command execution.
If you need to execute commands periodically, use some sort of task scheduler, running as a dedicated user. To avoid 2nd connection, you may use use postgres-controllable job queues. Either way, limit to allowed commands only, so that even if postgres credentials are leaked no arbitrary commands can be executed.
Inboth approaches, this would have allow high speed, localhost-specific transports instead of local shell.. if postgresql would have supported them.
actually i think what i said was wrong, because i guess the shell command generating the data to import runs on the database server, not the client, so there's no reason the database server couldn't be using pipes or file descriptor passing for this already. in fact i'm not clear why amluto thinks it isn't
Postgres already has "COPY FROM STDIN" command, which makes database server use postgres connection for the raw data. However, since it uses the existing connection, it needs to be compatible with Postgres protocol, which means that there is an extra overhead in wrapping streaming data.
On the other hand, "COPY FROM COMMAND" has no wrapping overhead, as it opens direct pipe to/from command, so no Postgres protocol parsing is involved - as only short command string is sent via postgres connection, while bulk data goes over dedicated channel. This makes it faster, although I am not sure how much does this actually save.
amluto's point was that one can achieve no-wrapping performance of "COPY FROM COMMAND" if one could pass dedicated data descriptor to "COPY FROM STDIN". This could be done using SCM_RIGHTS (a AF_UNIX protocol feature that passes descriptors between processes) or pipes+slice (not 100% sure how those would help). But with SCM_RIGHTS, you'd have your psql client create pipe, exec process, and then pass the descriptor to the sql server. This would have exactly the same speed as "COPY FROM COMMAND" (no overhead, dedicated pipe) but would not mix security contexts and would execute any code under server's username - overall better solution.
Your point was "only on localhost", which I interpreted as "this approach would only work if psql runs on localhost (because SCM_RIGHTS only works on localhost); while "COPY FROM COMMAND" could even be executed remotely".
This is 100% correct, but as I said, I think an ability to execute commands remotely as a server user is a bad idea and should have never existed.
In Red Hat ecosystem there is an Ansible role to that end : https://github.com/linux-system-roles/postgresql
I don't know if it will help everyone but it could be a good way to standardize and maintain an instance configuration
"The problem with the server certificate is that someone needs to sign it. Usually you would want that certificate to be signed by someone who is trusted globally like DigiCert. But that means you need to send an external request; and the provider at some point (usually in minutes but sometimes it can be hours) signs your certificate and returns it back to you. This time lag is unfortunately not acceptable for users. So most of the time you would sign the certificate yourself. This means you need to create a certificate authority (CA) and share it with the users so they can validate the certificate chain. Ideally, you would also create a different CA for each database."
Couldn't you automate this with Let's Encrypt instead?
Sure you can, but Let's Encrypt, just like DigiCert, is a 3rd party provider and they don't guarantee that you would get a signed certificate in few minutes. If they have an outage, it could take hours to get a certificate and you wouldn't be able to provision any database servers during that time. In our previous gig at Microsoft, we had multiple DigiCert outages which blocked the provisionings.
I personally, anecdotally, haven't had any problems with this the last years, and it doesn't seem like this is a big issue based on the information from the incident forum posts:
https://community.letsencrypt.org/c/incidents/16/l/top
Self signing probably causes quite a few other issues, even though you have more control of the process, doesn't it?
I cannot comment on Let's Encrypt's reliability. Maybe I had just too many bad experiences from DigiCert outages and I'm bit pessimistic. However, their status page does not give much confidence https://letsencrypt.status.io/pages/history/55957a99e800baa4...
I think if you need to generate a certificate once in a while, using Let's Encrypt or DigiCert is OK. Even if they are down, you can wait for few hours. If you need to generate a certificate every few minutes, few hours of downtime means hundreds of failed provisionings. Hence, we opted for self-signing.
In terms of reliability, it is great, because we control everything. It is also quite fast; it takes few seconds to generate and sign a certificate. The biggest drawback is that you need to distribute the certificate for CA as well. Historically, this was fine, because you need to pass CA cert to PostgreSQL as a parameter anyway, so the additional friction for users that we introduced due to CA cert distribution was low. However with PG16, now there is an option sslrootcert=system, which automatically uses OS trusted CA roots certs. Now the alternative is much seamless and requires almost no action from user, which tilted the balance in favor of globally trusted CAs, but still it doesn't give me enough reason for the switch.
I have few ideas around simultaneously self signing a cert and also requesting certificate from Let's Encrypt. The database can start with the self signed certificate at the beginning and we can switch to Let's Encrypt certificate when it is ready. Maybe I'd implement something like that in the future.
At another thread in this page, I wrote more about this, but in summary; we also like k8s-based managed Postgres solutions. They are quite useful if you are running Postgres for yourself. In managed Postgres services offered by hyperscalers or companies like Crunchy though, it is not used very commonly.
At k8s, isolation is at the container level, thus properly isolating (for security purposes) system calls is quite difficult. This wouldn't be a concern if you are running Postgres for yourself.
Also for us, one reason was operational simplicity. You can write a control plane for managed Postgres in 20K lines of code, including unit tests. This way, if anything breaks at scale, you can quickly figure out the issue without having to dive into dependencies.
I'm a little confused about the point-in-time restore functionality, I'm pretty sure there must be a way to not have to force those one minute WAL boundaries. DigitalOceans managed PostgreSQL for instance just allows you to specify a timestamp and restore, and when looking into the PostgreSQL docs I remember seeing an option to specify a timestamp as well.
You can still restore to a given minute even without one minute WAL boundaries (most of the time). Consider the case where you have a very low write activity and you would be able to fill up one WAL file (16MB) in 1 hour. That WAL file won't be archived until it is full and if you lose your database for some reason, you won't have last 1 hour's data in your backups. That means you cannot to restore any minute in that one hour window. Shorter WAL boundaries reduces your exposure. If you set archive_timeout to 1 minute, then you can restore any minute in the past with the exception of the last minute (in practice, it is possible to lose last few minutes because their WAL file might not be archived yet, but still the exposure would be much less)
DigitalOcean uses 5 minutes as archive_timeout, which is also a reasonable value. In our experience, we saw that most of our customers prefer less exposure and we settled on 1 minute as archive_timeout value.
Roughly archive_timeout defines your RPO(recovery point objective).
You are not wrong. There are benefits for sharing certificate authority per customer. We actually considered using one authority per organization. However in our case, there is no organization entity in our data model. There is projects, which is similar but not exactly same. So we were not entirely sure where should we put the boundary and decided to go with more isolated approach.
It is likely that we would add a organization-like entity in the future to our data model and at that time sharing certificate authority would make more sense.
Hey, author for the blog post is here. If you have any questions or comments, please let me know!
It's also worth calling out the first diagram shows dependencies between features for Ubicloud's managed Postgres. AWS, Azure, and GCP's managed Postgres service would have a different diagram. That's because we at Ubicloud treat write-ahead logs (WAL) as a first class citizen.
From the article: “If the user isn’t actively working on the database, PostgreSQL won’t generate a WAL file each minute. WAL files are by default 16 MB and PostgreSQL waits for the 16 MB to fill up. So, your restore granularity could be much longer, and you may not be able to restore to a particular minute in time.
You can overcome this problem by setting the archive_timeout and forcing Postgres to generate a new WAL file every minute. With this configuration, Postgres would create a new WAL file when it hits the 1-minute or 16 MB threshold, whichever comes first.
The second issue with backup/restore is “no activity”. In this case, PostgreSQL wouldn’t create a new file even if the archive_timeout is set. As a solution, you can generate artificial write activity by calling pg_current_xact_id().”
Can you explain why to create a WAL file even though there is no activity?
We want to ensure that the users are able to restore to any minute in the past.
Then, as @pintxo, @bjornsing, and @Doxin asked below; Why not let users pick a time and then replay the closest backup? Why create empty backups? This is a valid question.
The answer is that without an empty backup, it's not possible to confirm there was no write activity. Maybe some write activity occurred but you fail to archive the WAL file. You need at least one WAL file (empty or not) with a timestamp equal to or more recent than the target restore time to ensure there were no write activity.
I didn't explain this in detail in the blog post for the sake of brevity, but you don't need to create an empty WAL file for each minute with no write activity. For instance, if there was no write activity for 10 minutes and then a WAL file is generated, PostgreSQL can deduce there was no write activity during those 10 minutes and is able restore to any minute within that period. However, if the user wants to restore to 5 minutes ago and there has been no write activity for the past 10 minutes, PostgreSQL doesn't have a WAL file to confirm lack of write activity. In this case, you need to call pg_current_xact_id() to generate a WAL file. So the trick is, when you get a restore request, call pg_current_xact_id() only once, just in case if there were no write activity.
This is a good example of difference between running PostgreSQL for yourself vs. others. When running PostgreSQL for yourself, such edge cases is not a concern. However, when managing it for others, an edge case like this could prevent a restore and create an operational burden. Automating this process is required especially if you are managing lots of PostgreSQL databases.
Thank you, that makes a lot of sense. I should have intuited that, but I couldn’t quite make the connection. I really appreciate the write up, it gave me a lot to think about as someone who has only ever managed my own DB.
"Point in time restore" is a bad way to call the feature if you don't let your customers pick a moment in time to restore to, so those tricks ensure that there's enough WAL entries to allow people to pick with per-second granularity.
One can let users pick a time and then just replay the closest backup? Why create empty backups?
> those tricks ensure that there's enough WAL entries to allow people to pick with per-second granularity
How can they ensure that? Sounds more like they ensure that people can pick point in time with minute granularity.
I’m no expert, but doesn’t postgres write a timestamp with each record in the WAL? I know that when you start postgres in recovery mode it lets you specify a timestamp you want to restore to, and I don’t think that timestamp needs to be on a WAL file boundary. (So the tricks seem weird / unnecessary to me.)
But if there has been no activity you surely can just pick the most recent log that's older than the time the user picked?
How do you know the WAL was not lost during transmission, or the server crashed before transferring the WAL even though there were writes after the last WAL you have in your backup?
That's why.
The point is to create a WAL file if there is a little activity, but not enough to fill 16MB.
1. Do you use logical or physical replication for high availability? It seems like logical replication requires a lot of manual intervention, so I'd assume physical replication? 2. If that's the case, does that mean customers can't use logical replication for other use cases?
I'm asking because logical replication seems to become more and more common as a solution to automatically replicate data to external systems (another Postgres instance, Kafka, data warehousing, or offline sync systems in my case), but many cloud providers appear to not support it. Many others do (including AWS, GCP), so I'd also be interested in how they handle high availability.
Yes, we use physical replication for HA.
There are many reasons that cloud providers don't want to support logical replication;
- It requires giving superuser access to user. Many cloud providers don't want to give that level of privilege. Some cloud providers fork PostgreSQL or write custom extensions to allow managing replication slots without requiring superuser access. However, doing it securely is very difficult. You suddenly open up a new attack vector for various privilege escalation vulnerabilities.
- If user creates a replication slot, but does not consume the changes, it can quickly fill up the disk. I dealt many different failure modes of PostgreSQL, and I can confidently say that disk full cases one of the most problematic/annoying ones to recover from.
- It requires careful management of replication slots in case of fail over. There are extensions or 3rd party tools helping with this though.
So, some cloud providers don't support logical replication and some support it weakly (i.e. don't cover all edge cases).
Thankfully there are some improvements are being done in PostgreSQL core that simplifies failover of logical replication slot (check out this for more information https://www.postgresql.org/docs/17/logical-replication-failo...), but it is still too early.
There is also case where logical replication is not 100% complete. One of our application uses LOBs for some reason and can't do logical replication.
Yeah, I've dealt with some of those edge cases on AWS and GCP.
Some examples:
1. I've seen a delay of hours without any messages being sent on the replication protocol, likely due to a large transaction in the WAL not committed when running out of disk space.
2. `PgError.58P02: could not create file \"pg_replslot/<slot_name>/state.tmp\": File exists`
3. `replication slot "..." is active for PID ...`, with some system process holding on to the replication slot.
4. `can no longer get changes from replication slot "<slot_name>". ... This slot has never previously reserved WAL, or it has been invalidated`.
All of these require manual intervention on the server to resolve.
And that's not even taking into account HA/failover, these are just issues with logical replication on a single node. It's still a great feature though, and often worth having to deal with these issues now and then.
Definitely agreed. It is great feature and a building block for many complex features such as multiple primaries, zero down time migrations etc. I'm also quite happy to see with each PG version, it becomes more stable/easy to use.
There are "strange" properties your users have... who are those people?
* You describe them as wanting to deploy the database in container: why would anyone do that (unless for throwaway testing or such)?
* The certificate issue seems also very specific to the case when something needs to go over public Internet to some anonymous users... Most database deployments I've seen in my life fall into one of these categories: database server for a Web server, which talk in a private local network, or database backing some kind of management application, where, again the communication between the management application and the database happen without looping in the public Internet.
* Unwilling to wait 10 minutes to deploy a database. Just how often do they need new databases? Btw. I'm not sure any of the public clouds have any ETAs on VM creation, but from my practice, Azure can easily take more than 10 minutes to bring up a VM. With some flavors it can take a very long time.
The only profile I can think of is the kind of user who doesn't care about what they store in the database, in the sense that they are going to do some quick work and throw the data away (eg. testing, or maybe BI on a slice of data). But then why would these people need HA or backups, let alone version updates?
* Sorry for the confusion, our customers do not necessarily want to deploy database in a container. It is just we encountered lots of folks who wants to do that and asked us about how to handle HA and backups. Though, I don't think it is rare situation to be in. Even in this thread, there are multiple people asking about running Postgres on K8s.
* We saw many deployments where communication between web server and database were going through public internet. It doesn't need to be for anonymous users. It is also even somewhat common where web server and database are managed by different SaaS providers, so they have to (in most cases) communicate through public network.
* We (and all cloud providers) are trying to reduce overall provisioning time, mostly to reduce the friction for first time users. There is no SLA but for common instance types, it would be unusual to wait for more than 1 minute at AWS, Azure or Ubicloud for VM provisionings.
Maybe you and I just experience different parts of the big computing ecosystem, hence what is "usual" for each of us is different. Out of curiosity, are you coming from enterprise background?
Thanks for the article. Reading it makes me realize being a good SRE/DevOps is crazily difficult. Did you start as a DB admin?
I started as software developer and I'm still a software developer. Though I always worked on either core database development or building managed database services, by coincidence at the beginning and by choice later on. I was also fortunate to have the opportunity to work alongside some of the leading experts in this domain and learn from them.
Thanks for sharing. Nuff said, I need to find a new gig.
I enjoyed reading the article very much. Thanks for the write up!
surprised the article doesn't mention use of pgbouncer, do you not use a proxy in front of your pg instances?
There is a "Connection Pooling" box at the first diagram. Though, the article does not talk about all boxes. It would simply be very long wall of text if we had mention all components. Instead we picked some of the more interesting parts and focused on those.
Great write up. Do you plan to have similar for MySQL as well?
Do you mean writing something similar for MySQL or building a MySQL service at Ubicloud?
For the first one, the answer would be no, because I don't have any expertise on MySQL. I only used it years ago on my hobby projects. I'm not the best person to write something like this for MySQL.
For the second one, the answer would be maybe, but not any time soon. We have list of services that we want to build at Ubicloud and currently services like K8s have more priority.
I meant the second one. Thank you for your answer.
Second unrelated question that anyone might want to add to, why is MySQL falling out of favour in recent years in comparison to Postgres?
Every time I've used MySQL for anything, I will come across something that annoys me. You can use ANSI-Quotes mode for tables, columns, etc, but when making a foreign key reference it has to be back-ticks.
UTF8 isn't, which is a holdover from during UTF8 design, but even then, it should have been retired and re-aliased from the next major release (as a breaking change, that wouldn't break much).
Magic (stylized) single-quotes work as apostrophes in queries, meaning you can't use simple escape logic if you're writing a custom client, and a whole new level risk of security bugs. Yeah, you should use parameterized queries, but still.
If your default indexing is case-insensitive, if you do an index on a binary field, it's going to be case-insensitive (which can/will cause annoying issues).
There are so many more one-off and analogies people have when working with MySQL. PostgreSQL by contrast has been far more consistent in the usage, development and application of query support. Custom enhancements and so many other features just work, and better.
https://news.ycombinator.com/item?id=35906604
From just over a year ago. And honestly the recent reasons look like all the reasons why I moved away from it in the past.
> ...why is MySQL falling out of favour in recent years in comparison to Postgres
I'm curious about this myself! Anyone know, or care to share?
Anecdotal evidence?
I used to work with/host lots of small php apps that tied in with MySQL. PHP has dropped in popularity over the years.
Add to this hosting for Postgres has become common so you're not tied to cheap hosting for MySQL only.
At least I think that the Oracle name being tied to MySQL made it icky for a lot of people that think the big O is a devil straight from hell. This is something that got me looking at Postgres way back when.
There's probably a myriad of 'smaller' reasons just like this that enforce the trend.
Until I switched to Postgres 8 years ago I knew a lot about MySQL. I barely know anything about Postgres
Thanks for the great write up.
What is MMW on the flowchart?
It is managed maintenance window, which basically means letting user to pick a time window such as Saturday 8PM-9PM and as a service provider you ensure that non critical maintenances happen at that time window.
Your blog doesn't really mention any of the turn key PostgreSQL deployment options out there. These days, especially on Kubernetes, it has never been easier to run a SaaS-equivalent PostgreSQL stack.
I think you may benefit from researching the ecosystem some more.
The author seems to have spent the better part of a decade working professionally with Postgres. I think editorial choice, rather than ignorance, might be why they’re not mentioning more prior art.
Yes. :) We quite like k8s-based managed Postgres solutions. In fact, we convinced Patroni's author to come work with us in a previous gig at Microsoft. We find that a good number of companies successfully use k8s to manage Postgres for themselves.
In this blog post, we wanted to focus on running Postgres for others. AWS, Azure, and GCP's managed Postgres services, and those offered by startups like Crunchy, don't use k8s-based solutions. For us, one reason was operational simplicity. You can write a control plane for managed Postgres in 20K lines of code, including unit tests. This way, if anything breaks at scale, you can quickly figure out the issue without having to dive into dependencies.
We also understand different tradeoffs apply when you're running Postgres for yourself or for others. In this blog post, we wanted to focus on the latter scenario.
I'm not sure if 20k lines of code and and writing a controller on k8s are really that incompatible. kubebuilder has made it pretty approachable.
Being critical without posting actual better solutions isn't so helpful.
If you have concrete knowledge, please share it and don't be cryptic!
> Security: Did you know that with one simple trick you can drop to the OS from PostgreSQL and managed service providers hate that? The trick is COPY table_name from COMMAND
I certainly did not know that.
If anyone actually needs the extra performance from avoiding streaming over the Postgres protocol, this could have been done with some dignity using pipes and splice or using SCM_RIGHTS. The latter technology has been around for a long time.
only on localhost
That's kinda the point - there is an argument that one should not be baking arbitrary shell command execution into database server at all. Such execution will lack critical security features - setting right user, process tracking, cleanup, etc..
If you need to execute commands on database server for admin work, use something designed for this (such as ssh) - this will keep right management and logging simple, only one source of shell command execution.
If you need to execute commands periodically, use some sort of task scheduler, running as a dedicated user. To avoid 2nd connection, you may use use postgres-controllable job queues. Either way, limit to allowed commands only, so that even if postgres credentials are leaked no arbitrary commands can be executed.
Inboth approaches, this would have allow high speed, localhost-specific transports instead of local shell.. if postgresql would have supported them.
actually i think what i said was wrong, because i guess the shell command generating the data to import runs on the database server, not the client, so there's no reason the database server couldn't be using pipes or file descriptor passing for this already. in fact i'm not clear why amluto thinks it isn't
huh?
Postgres already has "COPY FROM STDIN" command, which makes database server use postgres connection for the raw data. However, since it uses the existing connection, it needs to be compatible with Postgres protocol, which means that there is an extra overhead in wrapping streaming data.
On the other hand, "COPY FROM COMMAND" has no wrapping overhead, as it opens direct pipe to/from command, so no Postgres protocol parsing is involved - as only short command string is sent via postgres connection, while bulk data goes over dedicated channel. This makes it faster, although I am not sure how much does this actually save.
amluto's point was that one can achieve no-wrapping performance of "COPY FROM COMMAND" if one could pass dedicated data descriptor to "COPY FROM STDIN". This could be done using SCM_RIGHTS (a AF_UNIX protocol feature that passes descriptors between processes) or pipes+slice (not 100% sure how those would help). But with SCM_RIGHTS, you'd have your psql client create pipe, exec process, and then pass the descriptor to the sql server. This would have exactly the same speed as "COPY FROM COMMAND" (no overhead, dedicated pipe) but would not mix security contexts and would execute any code under server's username - overall better solution.
Your point was "only on localhost", which I interpreted as "this approach would only work if psql runs on localhost (because SCM_RIGHTS only works on localhost); while "COPY FROM COMMAND" could even be executed remotely".
This is 100% correct, but as I said, I think an ability to execute commands remotely as a server user is a bad idea and should have never existed.
You need to be a super user or a member of the pg_execute_server_program group for this to work.
In Red Hat ecosystem there is an Ansible role to that end : https://github.com/linux-system-roles/postgresql I don't know if it will help everyone but it could be a good way to standardize and maintain an instance configuration
Is there something similar for MySQL that covers backup and restore too?
Not that I'm aware of.
Here the project homepage with list of supported software: https://linux-system-roles.github.io/#toggleMenu
From the article:
"The problem with the server certificate is that someone needs to sign it. Usually you would want that certificate to be signed by someone who is trusted globally like DigiCert. But that means you need to send an external request; and the provider at some point (usually in minutes but sometimes it can be hours) signs your certificate and returns it back to you. This time lag is unfortunately not acceptable for users. So most of the time you would sign the certificate yourself. This means you need to create a certificate authority (CA) and share it with the users so they can validate the certificate chain. Ideally, you would also create a different CA for each database."
Couldn't you automate this with Let's Encrypt instead?
Thanks!
Sure you can, but Let's Encrypt, just like DigiCert, is a 3rd party provider and they don't guarantee that you would get a signed certificate in few minutes. If they have an outage, it could take hours to get a certificate and you wouldn't be able to provision any database servers during that time. In our previous gig at Microsoft, we had multiple DigiCert outages which blocked the provisionings.
I personally, anecdotally, haven't had any problems with this the last years, and it doesn't seem like this is a big issue based on the information from the incident forum posts: https://community.letsencrypt.org/c/incidents/16/l/top
Self signing probably causes quite a few other issues, even though you have more control of the process, doesn't it?
Thanks!
I cannot comment on Let's Encrypt's reliability. Maybe I had just too many bad experiences from DigiCert outages and I'm bit pessimistic. However, their status page does not give much confidence https://letsencrypt.status.io/pages/history/55957a99e800baa4...
I think if you need to generate a certificate once in a while, using Let's Encrypt or DigiCert is OK. Even if they are down, you can wait for few hours. If you need to generate a certificate every few minutes, few hours of downtime means hundreds of failed provisionings. Hence, we opted for self-signing.
In terms of reliability, it is great, because we control everything. It is also quite fast; it takes few seconds to generate and sign a certificate. The biggest drawback is that you need to distribute the certificate for CA as well. Historically, this was fine, because you need to pass CA cert to PostgreSQL as a parameter anyway, so the additional friction for users that we introduced due to CA cert distribution was low. However with PG16, now there is an option sslrootcert=system, which automatically uses OS trusted CA roots certs. Now the alternative is much seamless and requires almost no action from user, which tilted the balance in favor of globally trusted CAs, but still it doesn't give me enough reason for the switch.
I have few ideas around simultaneously self signing a cert and also requesting certificate from Let's Encrypt. The database can start with the self signed certificate at the beginning and we can switch to Let's Encrypt certificate when it is ready. Maybe I'd implement something like that in the future.
Thanks for your detailed explanation!
I feel like cloudnative-pg takes away majority of pain using well known kubernetes operator pattern.
At another thread in this page, I wrote more about this, but in summary; we also like k8s-based managed Postgres solutions. They are quite useful if you are running Postgres for yourself. In managed Postgres services offered by hyperscalers or companies like Crunchy though, it is not used very commonly.
I always assumed crunchy was using their own operator for their managed offering. Is that not the case?
https://github.com/CrunchyData/postgres-operator
They have a Ruby control plane; https://www.crunchydata.com/blog/crunchy-bridges-ruby-backen...
> it is not used very commonly.
Is this a problem of multi-tenancy in k8s specifically or something else?
At k8s, isolation is at the container level, thus properly isolating (for security purposes) system calls is quite difficult. This wouldn't be a concern if you are running Postgres for yourself.
Also for us, one reason was operational simplicity. You can write a control plane for managed Postgres in 20K lines of code, including unit tests. This way, if anything breaks at scale, you can quickly figure out the issue without having to dive into dependencies.
I'm a little confused about the point-in-time restore functionality, I'm pretty sure there must be a way to not have to force those one minute WAL boundaries. DigitalOceans managed PostgreSQL for instance just allows you to specify a timestamp and restore, and when looking into the PostgreSQL docs I remember seeing an option to specify a timestamp as well.
You can still restore to a given minute even without one minute WAL boundaries (most of the time). Consider the case where you have a very low write activity and you would be able to fill up one WAL file (16MB) in 1 hour. That WAL file won't be archived until it is full and if you lose your database for some reason, you won't have last 1 hour's data in your backups. That means you cannot to restore any minute in that one hour window. Shorter WAL boundaries reduces your exposure. If you set archive_timeout to 1 minute, then you can restore any minute in the past with the exception of the last minute (in practice, it is possible to lose last few minutes because their WAL file might not be archived yet, but still the exposure would be much less)
DigitalOcean uses 5 minutes as archive_timeout, which is also a reasonable value. In our experience, we saw that most of our customers prefer less exposure and we settled on 1 minute as archive_timeout value.
Roughly archive_timeout defines your RPO(recovery point objective).
Thanks for the article; it's an important checklist.
I would think you'd want at most one certificate authority per customer rather than per database. Why am I wrong?
You are not wrong. There are benefits for sharing certificate authority per customer. We actually considered using one authority per organization. However in our case, there is no organization entity in our data model. There is projects, which is similar but not exactly same. So we were not entirely sure where should we put the boundary and decided to go with more isolated approach.
It is likely that we would add a organization-like entity in the future to our data model and at that time sharing certificate authority would make more sense.
[dead]