Because everyone insists on insanely distributed architectures, most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond. That lets you do stuff in the application that you just can't do with Postgres. Postgres kicks ass obviously but it is not running in memory on the same machine as the application.
If all you want to do is queues and whatnot, then sure, you don't need an in-memory KV store.
The point of an in-memory KV store is to do stuff that needs the performance characteristics of RAM. You obviously can't get the performance characteristics of RAM over a network connection. This is like, a tautology.
If the setup is that only one local process will use on-machine Redis as an in-memory cache, you're better off using the data structures available in your programming language.
Yes, that's assuming a single local process. Redis bills itself as an in-memory data structure store, and at least in my memory, it was mainly pitched for doing interprocess communication with real data structures.
Unless you need persistence across process respawning, ipc across lots of subprocesses, or want to trivially inspect and namespace your entire in memory data.
But yeah redis is not an answer for most single machine questions.
Because it's really, really slow compared to in-language variables? I've not yet ran out of memory because of program variables, and speed shouldn't be an argument for low memory footprint when it's already at best 1/10 of it's performance.
Redis locally is good to do IPC though. You have many scenarios where multi local processes have to exchange data. You have alternatives but Redis works and can scale up.
all the other reasons beside - because at some point I might want to move the data cache off of the single machine and then I can do that with a couple of minutes work, yes then I am back to do I need redis why not use postgres, but I got months of better performance out of redis than I would have with postgres, if I need to move to separate machine I am probably doing good enough I can afford it.
Redis helps you get to the point (need cache on separate machines) where you might want to consider moving to postgres, or not because you don't feel it warrants the investment of programmer hours at that point.
on edit: got downvoted for ..? I guess what I said was so supremely stupid I deserved it. I do admit it was perhaps slightly obvious, but not so obvious that it warranted downvoting
the comment was quite clear that you are building today what you need today, and when you need to have another solution tomorrow you don't need to rebuild everything but just move it and update some connections - but if you build the other solution - keeping cache in the application - when you need the other solution you will need to rebuild everything.
How could that very clear statement be misunderstood as to be well I am building a bunch of stuff I might need to tomorrow, it is the exact opposite of that!
I guess it's my fault, nobody on HN seems to understand a single thing I say even though the words seem reasonably clear to me.
If you need that, you can use an embedded data store like leveldb/rocksdb or sqlite. Why bring another application running as its own process into the equation?
There's no universe where a single threaded embedded persistence implementation is slower than a single threaded application synchronously talking to a single threaded database over the network stack.
As far as isolation goes, if you are worried about the properties of reading and writing data to the disk then I simply don't know what to tell you. Isolation from what?
Running in its own process, and, better yet, in its own cgroup (container) makes potential bugs in it, including RCEs, harder to exploit. It also makes it easier to limit resources it consumes, monitor its functioning, etc. Upgrading it does not require you to rebuild and redeploy your app, which may be important if a bug or performance regression occurs is triggered, and you need a quick upgrade or downgrade with (near) zero downtime.
Ideally every significant part should live in its own universe, only interacting with other parts via well-defined interfaces. Sadly, it's either more expensive (even as Unix processes, to say nothing of Windows), slower (Erlang / Elixir), or both (microservices).
Either fork the process so the forked copy can dump its data (I think Redis itself does something like this), or launch a new process (with updated code if desired), then migrate the data and open sockets to it through some combination of unix domain sockets and maybe mmap. Or if we had OS's that really used x86 segmentation capability as it was designed for (this is one thing the 386 and later did cleverly) it could all be done with segments.
Redis is nice but you take a huge speed hit depending on what you're doing, compared to using in-memory structures. Note that Redis can also persist to disk, either by forking or by writing an append-only op log that can be replayed for reloading or replication. Anyway, you'veve forgotten the case where you want not only the data structures, but also the open network connections to persist across process restarts. That's what passing file descriptors through unix domain sockets lets you do. It gets you the equivalent of Erlang's hot upgrades.
Do you mean fork is a blocking operation? That's surprising but hmm ok. I didn't realize that. I thought that the process's pages all became copy-on-write, so that page updates could trigger faults handled by in-memory operations. Maybe those could be considered blocking (I hadn't thought of it like that) or maybe you mean something different?
> but it is not running in memory on the same machine as the application.
What's the overhead of postgres vs redis, when run locally? Why do you think postgres isn't run locally?
There's nothing special about postgres. It's just a program that runs in another process, just like redis. For local connections, it uses fast pipes, to reduce latency, and you get access to some faster data bulk transfer methods. I've used it in this way on many occasions.
Postgres has a shared memory cache, which can be set the same as redis, so your operations will all happen with in memory, with some background stuff putting it onto disk for you in the case your computer shuts off. Storage won't be involved.
BUT, postgres still has ~6x the latency [1], even when run from memory.
Pretty pointless benchmark since it does not use prepared statements (that much is obvious from the low TPS, plus I confirmed it with how he ran the benchmark). You need to pass "-M prepared" to pgbench. And it is very possible that the Redis benchmark is equally flawed.
If you have to parse and plan queries every time PostgreSQL is obviously much slower than Redis. It is much more interesting to see what happens if prepared statements are used.
Thanks, I spotted the lack of prepared statements and stopped looking after that but you spotted the rest of the issues too. And, agreed, it is likely that the Redis benchmark is flawed too but I do not know Redis well enough.
But the whole point here is that PostgreSQL will be used for other tasks e.g. storing all of your business data. So it will be fighting for the shared cache as well as the disk. And of course storage will still be involved as again you can't have in-memory only tables.
And having a locking system fluctuate in latency between milliseconds and seconds would cause all sorts of headaches.
If you are both small enough that you’re considering cohosting the app and DB, the odds are good that your working set is small enough to comfortably fit into RAM on any decently-sized instance.
> And having a locking system fluctuate in latency between milliseconds and seconds would cause all sorts of headaches.
With the frequency that a locking system is likely to be used, it’s highly unlikely that those pages would ever get purged from the buffer pool.
As pointed out by blackenedgem above: PostgreSQL has tablespaces, and one may simply declare tables which should stay in RAM in a tablespace built upon a tmpfs with enough reserved RAM to store them all. There is a only small associated burden (restoring the tablespace while starting PG).
No but it does have the concept of tablespaces. If you want you can map RAM to a disk location, set that up as a tablespace, then tell postgres to use that tablespace for your given table. Also set the table as UNLOGGED while you're at it.
A bit more work yes that could be simplified, but fully supported if you control the stack.
Yes. Putting a Postgres tablespace on a RAM disk (tmpfs) does wonders. Even if NVMe may be comparable to RAM by bandwidth, whatever Postgres has to do to ensure that data are written to the durable store as a part of a commit is still significantly slower compared to RAM.
Highly recommended for running tests, especially in CI/CD pipelines. Doing this simple change can speed up DB-heavy tests 30-50%.
> PostgreSQL does not have the concept of in-memory tables.
Is that relevant though? Some benchmarks on the web show Postgres outperforming Redis in reads as well as low-volume writes (less than 1k key-value pairs), and Redis only beating Postgres for high volume key-value writes.
And I think this is the main use case they were looking for. If you have a web app where each request is a separate process/call (not uncommon), and you don’t have a good shared global state, Redis is a great tool. It is an in-memory data structure store that can respond to requests from different processes. I always considered it an evolution from memcached.
If you only have one long lived process or good global variable control, then it is much less appealing in the single-server scenario. Similarly, if you require access from multiple hosts, it becomes a less obvious choice (especially if you already have a database in the loop). And redis is also overkill is you’re using it only as a cache.
As in performance improvement - cache should never be considered a datastore, e.g. you can pull the plug and nothing else happens (aside losing performance). It'd be a lot more beneficial all the processes to have a local cache, themselves. The latter is at least 4 orders of magnitude faster than redis. Now you may like some partitioning, too.
Perhaps you could elaborate? It would be helpful to understand what Redis can do that cannot be done easily with local memory.
Acting as shared memory for an inherently single-CPU language like JS is one I can think of. However, I don't use Redis, so you'd be better placed to drive the discussion forward with examples.
Redis provides low-level persistent data structures which can be used to implement business logic distributed safely across a number of machines. That’s a LOT harder than in-memory in-process.
My Sidekiq background job system runs entirely on top of Redis. Structures like Sorted Sets become the basis for indexes. Lists provide extremely fast queue behavior and Hashes map easily to persistent objects. Databases, traditionally, have not performed well when used as queues.
Those are the big 3 structures necessary to implement anything: trees, lists and maps.
I'm a bit confused here, because the original comment was that "most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond", to which the response was "there's more to redis than just being a K/V store".
I do see the point of Redis if you have multiple hosts, but I was unsure why someone would use it on just one host.
Right, it's still useful as shared data for multiple processes on the same machine. A SQL database mostly forces your data into one structure: the table. Redis instead provides commands which operate directly on those different data structures I mentioned.
I have Docker running in swarm mode where it will spin up multiple load balanced instances of my web app (where requests can get routed randomly to any instance). So I use Redis to store "User Session Information" so the session can be read/written from any instance, supposedly faster than using a DB.
When I've used redis to store web sessions, it is in fact acting as a k/v store. The session ID is the key, and the (serialized) session state is the value.
Just to add -- sometimes I use Redis when I don't have a trusted impl of HyperLogLog or sorted set (and I have a vague suspicion that I am going to do IPC later -- so not worth it to wrap my own HLL).
Overengineering/premature distribution is a real problem, but Redis stands for "Remote Dictionary Server." The purpose is very much not to run it locally! (Though that's a legitimate design choice, especially if your language's native dictionary doesn't support range queries.)
It very well could be running on the same machine, and communicating with the app using unix sockets, which is a hell of a lot faster than TCP. But no one seems to be doing that much either.
I feel that the virtualize and distribute everything to hell and back-trend might actually be about to break, there are signs, and G knows it's about time. The amounts wasted on cloud providers for apps that would run everything just fine on a single server, the effort wasted configuring their offerings, surreal.
> Because everyone insists on insanely distributed architectures, most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond.
I don't think this is a realistic scenario at all.
If you need a KV store, you want it to be the fastest by keeping it in-memory, you want it to run on each node, and you don't care how much it cost to scale vertically, then you do not run a separate process on the same node. You just keep a plain old dictionary data structure. You do not waste cycles deserializing and serializing queries.
You only adopt something like Redis when you need far more than that, namely you need to have more than one process access shared data. That's at the core of all Redis usecases.
I develop and maintain multiple applications that use a worker pool, and are small enough to run on a single host. We used pg for the user sessions, which get read and written on every single page request. Some of our apps are Internet-facing, and web crawlers can create sessions that get read and written (recording recent pages) as they browse the site. We switched to a redis service on the same host as the app and saw 3 main benefits: faster session loading and saving, less disk activity on the Pg server (so all other queries run faster) and less writes to the Pg WAL, so our backups require drastically less GB per day of retention.
After the significant success of the first conversion, we've been working to convert all the rest of our apps.
And no, host language data structures aren't useful because they aren't in shared memory between all the worker processes, and even if we found a module that implemented them in shared memory, we like to be able to preserve the sessions across a host restart, and then we'd need a process to save the data structures to disk and load them back, and by the time we did that we'd have just reinvented redis.
This is the best response so far. Session churn creates lots of db activity but lots of it is of low business value. Better to offload to a separate process.
Also session data is often Blobs which db's don't process as efficiently as columnar data.
Over the network became feasible when HDD got markedly slower than NICs. It’s a nearer thing with NVMe.
I want a “redis” with something akin to the Consul client - which is a sidecar that participates in the Raft cluster and keeps up to date, cheaper lookups for all of the processes running on the same box.
The few bit of data we needed to invalidate on infrequent writes went into consul, and the rest went into the dumbest (as in boring, not foolish) memcached cluster you can imagine.
But as you say there was the network overhead, and what would be lovely is a 2 tier KV store that cached recent reads locally and distributed cache invalidation over Raft. Consistent hashing for the systems of record, broadcast data on put or delete so the local cache stays coherent.
> [Redis] can respond in much less than a millisecond.
I have no idea how fast Redis can get, but it is entirely possible for an RDBMS to execute a query in well under a millisecond. I have instrumentation proving it. If everything is on the same machine, I would wager that IPC would ultimately be the bottleneck for both cases.
Django has caching built in with support for Redis, and it also has an in-memory caching option which they label as “not for production” (because if you have multiple instances of Django serving requests, their in-memory caches will diverge which is...bad).
But for lots of cases, especially internal business tools, we can scale up a single instance for a long time, and this in-memory caching makes things super fast.
There’s a library, django-cachalot [1], that handles cache invalidation automatically any time a write happens to a table. That’s a rather blunt way to handle cache invalidation, but it’s wonderful because it will give you a boost for free with virtually no effort on your part, and if your internal business app has infrequent updates it basically runs entirely in RAM, and falls back to regular database queries if the data isn’t in the cache.
Many arguments here have been from the Redis point of view (and are quite defensive). Of course Redis has niches where it is the better choice.
But that’s not what I thought the point of the article was. I think the main argument from the article can be summed up with this line:
> PostgreSQL has a lot more capabilities than you may expect when you approach it from the perspective of just another SQL database or some mysterious entity that lives behind your ORM.
To me, this is the key line. If you’re only using your database from behind an ORM (any database, IMO), you’re probably missing functionality. And if you need to add another service (like Redis), you may be better served by using the database you already have setup, rather than adding another dependency.
It’s good to understand all of the things Postgres can do. It’s a powerful database.
A counterpoint is that the barrier to using redis is tiny and in exchange you have a very high performing system that has extensive library support and takes load from your primary database.
If you want to cache API responses, for example, could you do that in Postgres? Sure. And you could support things like TTLs with cron jobs sweeping stale cache values. Or you can just use redis.
Advisory locks are cool and useful. They can be a little problematic you want something like PGBouncer and you’re stuck between session advisory locks and transaction interleaving.
Having separate systems has downsides (network calls, availability, domain knowledge), but the tradeoffs with Redis of all things are pretty low.
This is a fairly old article, but this has become a very common pattern. For 90% of projects that just need a task queue for sending email or generating reports, you're not going to be doing millions of messages a second and anything that simplifies your stack is worth looking at. I've used these patterns so often to work around issues encountered with Celery that I've split them into their own framework, https://github.com/TkTech/chancy (feedback appreciated ;)
There are lots of these, several of which are commercial ventures so there's definite interest behind it:
Thank you for sharing this. I’ve been trying to find a simple task queue and Huey is ok, but since I’m on Postgres and the tasks I’m queuing are like once an hour, anything redid based has always felt like overkill.
PGQueuer provides job queuing, locking, and real-time notifications using PostgreSQL's FOR UPDATE SKIP LOCKED and LISTEN/NOTIFY. It's a minimalist alternative for those already using PostgreSQL, avoiding the need for Redis.
A big deal-breaker for Postgres pub/sub is the messages have a maximum size of 8000 bytes. The recommended work around is to put data in a table and send IDs, but then you have to deal with garbage collecting that if you don't want it to live forever, and you are adding operations to each message.
Obviously this can be fine, but for plenty of Redis use cases it really isn't equivalent with that limit.
Caching, often the most important application of Redis: oops.
Updates in Postgres are notoriously more expensive than inserts (generate garbage, require vacuuming); durability guarantees, not important for caching, make writes significantly slower.
Automatic expiration is very convenient and fool-proof.
You can disable most of what you don't need, e.g. by making a table unlogged: https://www.postgresql.org/docs/current/sql-createtable.html.... Synchronous commits, autovacuum, etc. can also be turned off. Redis of course is still going to be faster, just not in a way that the median company should care about.
It’s really shocking how wide the difference is here between Postgres and MySQL.
One of the Planetscale guys did a podcast and said at GitHub every MySQL instance is doing like 50k+ connections while if you need more than 100 connections in Postgres you already need PgBouncer.
It's just an architectural decision to spawn a process per connection that Postgres made long time ago.
It's a tradeoff like most decisions.
Back in the days MySQL had huge issues with scaling to multi-core systems (maybe they fixed it now, I haven't used MySQL for a long time) while Postgres never had this problem.
When designing an application around Postgres you just have to take into account that:
1. opening a new connection is rather expensive, so you should have a connection pool
2. opening too many connections is problematic, so you should keep the size of your pool rather small and return connections back into the pool ASAP
MySQL never had a problem with scaling to multi-core systems; Linux on the other hand did have a problem and its problems became everyone's problems. In the early 2000's the answer to multi-core Linux issues was to buy a system running one of the commercial UNIX. By 2006 or so this became a non-issue for Linux ( for the most part ).
Postgres uses an old design model which predates threads; I have no idea if they have made any progress in updating their design. In the past I have heard the core devs talk about how difficult it would be to do this.
Oracle Database moved to a hybrid process/thread model at some point, this is the superior solution ( I have no idea if it was done well or not, but from standpoint of how to achieve better leverage of CPU vs IO, this is the way ).
If the PG devs had enough time/money, I am sure they would move towards a hybrid model where the focus would be on processor affinity with IO being all tied to events.
Why would you do 50k+ connections if they can't be active all at once either way? Unless you have 50k+ cores and IO beefy enough not to get overwhelmed due to that.
You can have as much connections as you want, but you'll have to trade it for having lower work mem numbers, which hurts performance. Traditional advice is to keep it below 500 per PostgreSQL instance (I'd say physical host).
I've ran dozens of micro services handling thousands of requests per second with a total connection limit of around 200 of which most was still unused - all without any server-side pooler.
> After the commit, the Connection object associated with that transaction is closed, causing its underlying DBAPI connection to be released back to the connection pool associated with the Engine to which the Session is bound.
I expect other reasonably sane libs for working with transactional databases do the same.
So, if you are doing pooling correctly, you can only run out of available connections if you want to have a lot of long running transactions.
So, why would you want every of your 50k frontends keep an open transaction simultaneously?
Because there's an overhead to make a connection, authenticate, set the default parameters on the connection, etc. I've never seen a framework that closed db connections between requests.
Of course, the better design is to write a nonblocking worker that can run async requests on a single connection, and not need a giant pool of blocking workers, but that is a major architecture plan that can't be added late in a project that started as blocking worker pools. MySQL has always fit well with those large blocking worker pools. Postgres less so.
As I said, you can return the connection to the connection pool.
From the perspective of keeping the number of open connections low it doesn't really matter if you close it or return to the pool, because in either case the connection becomes available to other clients.
I might not be understanding what you're pointing out here. It sounds to me like sqlalchemy is talking about a pool of connections within one process, in which case releasing back to that pool does not close the connection by that process to the database. Parent comment is talking about one connection per process with 50k processes. My comment was that you don't need that many processes if each process can handle hundreds of web requests asynchronously.
If you are saying that a connection pool can be shared between processes without pgbouncer, that is news to me.
The most common design for a Web app on Linux in the last 20 years is to have a pool of worker processes, each single-threaded and ready to serve one request. The processes might be apache ready to invoke PHP, or mod-perl, or a pool of ruby-on-rails or perl or python processes receiving the requests directly. Java tends to be threads instead of processes. I've personally never needed to go past about 100 workers, but I've talked to people who scale up to thousands, and they happen to be using MySQL. I've never used pgbouncer, but understand that's the tool to reach for rather than configuring Pg to allow thousands of connections.
Redis isnt exactly hard to deploy. May as well just use it instead of planning for a switch later that may come with god knows what unforeseen side effects
The problem with this is that eventually the database becomes everything to everyone. You start seeing the frontend UIs suffer due to errors with unrelated tables or table locking operations (like migrations).
Redis is pretty awesome. So is rabbitmq. When you’re a < 100 person startup, probably stick with postgres. Over 100 and I would seriously start considering adding more specialized platform tools.
Back in the day Id always have redis around for my rails cache. We already had it for cache so also using it for job queues and sidekiq was a no brainer.
These days with m.2 disks and database based cacheing (solid cache) and first class support in Rails, and postgres able to handle the job queues and things like GoodJob the choice isn’t so clear cut to “just use redis”.
I won’t be replacing sidekiq in our current production app any time soon but any new projects would get some serious consideration.
it’s generally better to avoid the distributed systems problems if you can. Postgres is very flexible and you get very far with it if you use all it has to offer
The point of view from over 30y carres. I love Postgre, I've known Redis recent. Each have great functions. Every products have unique functions at their early season, but getting fat by adding similarity functions with others. Implementing function is not difficult in there because it is a software, but sometimes it makes lose its the first strength. I've loved Postgre since its v3, but once it had been pushed by MySQL, because of LAMP boom. I wish Postgre does not be a fat system alike Oracle. And also want Redis to take the right direction.
I mean Postgre did not need to support pub/sub and so on.
Actually pub/sub works best when you can commit it in a transaction along with your data changes. When they are separate systems, you get a race condition where you could commit the change but crash before the event gets published.
The queue approach of `SELECT...FOR UPDATE SKIP LOCKED` is great for low throughput needs. One thing to note: as your throughput increases you will have to deal with bloat.
When the processed message rows are deleted, the pages sitting on the disk are not deleted until vacuum is run. If you run an online vacuum it doesn’t delete the page from disk, and a full vacuum (that will free up the disk space) locks the database while it completes the vacuum (which if you’re now dealing with high throughput, is not great).
One approach to address bloat without dealing with vacuum is to setup a timestamp-based partition on your queue table. This way you just drop old partitions and it frees the disk space without needing vacuum.
To add some feedback as someone whose company replaced redis with PgSQL after I lobbied for it (leveraging my experience using it in bioinfo HPC beforehand):
- agreed, autovacuum is definitely something to keep a close watch on – but as long as disk I/Os are fast enough (i.e. not EBS), our queue tables could fill the space freed for PgSQL but not the OS (the non full VACUUM mentioned by my GP), so in the end, assuming the queue did not grow infinitely, actual disk usage stabilized, and we never had to VACUUM FULL outside of buggy (on our side) situations;
- high-throughput is relative: we ran in the range of hundreds to thousands of tasks per second without any issue nor any particular customization, so sure, PgSQL will not hold the load if you are Google, but I'd bet it should satisfy the needs of at least up to the last centile of companies.
- I like your idea about timestamp-defined partitions, I will definitely keep this one in mind if the need arises.
Now on the pros:
- being able to store & manipulate basically all the data handled by our stack from a single source sounds like a minute detail, but actually helps a lot when we have to debug an issue;
- being able to use virtually any language and still be able to access all the persistent data from our stack from a single source revealed itself to be an unexpected advantage to develop a quick'n dirty e.g. python/nim/racket script to explore data/debug/generate stats/etc.;
- our PgSQL machines ran much better “by themselves” than we feared at first, and finally were far from being the main point of contention on day-to-day ops – in the end we just altered a couple of settings and just... let them be;
- PgSQL provides a lot of tools to inspect it live, which provides additional help to identify bottlenecks in (i) your server setup and (ii) your stack;
- PgSQL support of JSON/JSONB is actually very good, and very helpful to store that hard-to-normal-form-ize last part of your data.
“Tool” may not be the right word; but being able to live monitor the vacuumer, to understand what table was taking how much space and why, to track min/mean/max execution time of queries, and, more generally, all the internal metrics that PgSQL exposes through a couple of SELECT has been a boon when debugging or optimizing.
Plain vacuum is generally all you want. At that point the space is freed for new inserts and updates to the table. Full vacuum is for giving the space back to the OS, which rarely makes sense if you'll still be writing more rows. In cases where freeing to the OS really is what you need, though, use something like pg_repack.
Do you need oranges? Apples have seeds, skin, and grow on trees.
Redis has many niches where it is more suitable than a database. In particular, the low latency and simple concurrency model make it a perfect choice for building shared token buckets, for example. For simple operations that need to be executed many times at low latency, SQL is usually not the best choice. And sometimes, you really just need K/V and don't want to deal with schema migrations, DBA chores, debugging auto vacuum...
>Advisory locks allow you to leverage the same locking engine PostgreSQL uses internally for your own application-defined purposes.
That is like telling me that I can get across the river in a nuclear submarine instead of walking over the footbridge. PostgreSQL locks cause me plenty of headaches already without my application logic being dependent on them.
Disagree. If you need a global locking mechanism that all of your back ends can agree on without setting up an additional service, advisory locks are a totally valid use case. All of your back ends already have access to the DB, why bring in another whole service for just the locking that is required to implement some small thing in your giant SaaS product? Just lean on PG and you can go really far.
I mean, yes, but a fairer comparison would be Redis vs. Memcache. Redis provides functionality missing from Memcache, it is not meant to be an alternative to Postgresql. This article reminds me of one about configuring MySQL to act like a NoSQL datastore, I read when MongoDB hit the limelight. Apples to organges.
Because everyone insists on insanely distributed architectures, most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond. That lets you do stuff in the application that you just can't do with Postgres. Postgres kicks ass obviously but it is not running in memory on the same machine as the application.
If all you want to do is queues and whatnot, then sure, you don't need an in-memory KV store.
The point of an in-memory KV store is to do stuff that needs the performance characteristics of RAM. You obviously can't get the performance characteristics of RAM over a network connection. This is like, a tautology.
If the setup is that only one local process will use on-machine Redis as an in-memory cache, you're better off using the data structures available in your programming language.
Redis can cluster alongside your backend cluster, so you'll have the advantage that sometimes the data will be local.
Yes, that's assuming a single local process. Redis bills itself as an in-memory data structure store, and at least in my memory, it was mainly pitched for doing interprocess communication with real data structures.
This is the vanillajs.com of data stores.
Unless you need persistence across process respawning, ipc across lots of subprocesses, or want to trivially inspect and namespace your entire in memory data.
But yeah redis is not an answer for most single machine questions.
Why? Most languages are less memory efficient than redis, even in a separate process, and I don’t want to reimplement half of its features.
Because it's really, really slow compared to in-language variables? I've not yet ran out of memory because of program variables, and speed shouldn't be an argument for low memory footprint when it's already at best 1/10 of it's performance.
Redis locally is good to do IPC though. You have many scenarios where multi local processes have to exchange data. You have alternatives but Redis works and can scale up.
You can do a mutex in like 10 bytes, how could redis be more memory efficient?
you are using the minority language if that's the case for you.
all the other reasons beside - because at some point I might want to move the data cache off of the single machine and then I can do that with a couple of minutes work, yes then I am back to do I need redis why not use postgres, but I got months of better performance out of redis than I would have with postgres, if I need to move to separate machine I am probably doing good enough I can afford it.
Redis helps you get to the point (need cache on separate machines) where you might want to consider moving to postgres, or not because you don't feel it warrants the investment of programmer hours at that point.
on edit: got downvoted for ..? I guess what I said was so supremely stupid I deserved it. I do admit it was perhaps slightly obvious, but not so obvious that it warranted downvoting
Don't build today what you might need tomorrow.
the comment was quite clear that you are building today what you need today, and when you need to have another solution tomorrow you don't need to rebuild everything but just move it and update some connections - but if you build the other solution - keeping cache in the application - when you need the other solution you will need to rebuild everything.
How could that very clear statement be misunderstood as to be well I am building a bunch of stuff I might need to tomorrow, it is the exact opposite of that!
I guess it's my fault, nobody on HN seems to understand a single thing I say even though the words seem reasonably clear to me.
Not really. Because then once you restart that process you lose everything.
And it's far more likely you are continuously upgrading your process than Redis.
If you need that, you can use an embedded data store like leveldb/rocksdb or sqlite. Why bring another application running as its own process into the equation?
But having separate applications can give you better perfomance and isolation than one single threaded process?
Or do you compare using separate processes to having multiple threads in a single process?
There's no universe where a single threaded embedded persistence implementation is slower than a single threaded application synchronously talking to a single threaded database over the network stack.
As far as isolation goes, if you are worried about the properties of reading and writing data to the disk then I simply don't know what to tell you. Isolation from what?
Running in its own process, and, better yet, in its own cgroup (container) makes potential bugs in it, including RCEs, harder to exploit. It also makes it easier to limit resources it consumes, monitor its functioning, etc. Upgrading it does not require you to rebuild and redeploy your app, which may be important if a bug or performance regression occurs is triggered, and you need a quick upgrade or downgrade with (near) zero downtime.
Ideally every significant part should live in its own universe, only interacting with other parts via well-defined interfaces. Sadly, it's either more expensive (even as Unix processes, to say nothing of Windows), slower (Erlang / Elixir), or both (microservices).
At the cost of requiring a lot of IPC and memory copying.
Either fork the process so the forked copy can dump its data (I think Redis itself does something like this), or launch a new process (with updated code if desired), then migrate the data and open sockets to it through some combination of unix domain sockets and maybe mmap. Or if we had OS's that really used x86 segmentation capability as it was designed for (this is one thing the 386 and later did cleverly) it could all be done with segments.
Now you're making it complicated again.
I think a good rule of thumb is:
* If you need it from a single process and it doesn't need to survive process restarts, just use basic in-process data structures.
* If you need it from multiple processes or it needs to survive process restarts, but not system restarts, use redis.
* If it needs persistence across system restarts, use a real DB.
Redis is nice but you take a huge speed hit depending on what you're doing, compared to using in-memory structures. Note that Redis can also persist to disk, either by forking or by writing an append-only op log that can be replayed for reloading or replication. Anyway, you'veve forgotten the case where you want not only the data structures, but also the open network connections to persist across process restarts. That's what passing file descriptors through unix domain sockets lets you do. It gets you the equivalent of Erlang's hot upgrades.
>I think Redis itself does something like this)
It does... and it's a blocking operation (Bad). It's used when storing data to a file to allow restarts w/o losing the said data.
Do you mean fork is a blocking operation? That's surprising but hmm ok. I didn't realize that. I thought that the process's pages all became copy-on-write, so that page updates could trigger faults handled by in-memory operations. Maybe those could be considered blocking (I hadn't thought of it like that) or maybe you mean something different?
> but it is not running in memory on the same machine as the application.
What's the overhead of postgres vs redis, when run locally? Why do you think postgres isn't run locally?
There's nothing special about postgres. It's just a program that runs in another process, just like redis. For local connections, it uses fast pipes, to reduce latency, and you get access to some faster data bulk transfer methods. I've used it in this way on many occasions.
PostgreSQL does not have the concept of in-memory tables.
Except for temporary tables which are wiped after each session.
That's simplifying things a bit.
Postgres has a shared memory cache, which can be set the same as redis, so your operations will all happen with in memory, with some background stuff putting it onto disk for you in the case your computer shuts off. Storage won't be involved.
BUT, postgres still has ~6x the latency [1], even when run from memory.
[1] https://medium.com/redis-with-raphael-de-lio/can-postgres-re....
Pretty pointless benchmark since it does not use prepared statements (that much is obvious from the low TPS, plus I confirmed it with how he ran the benchmark). You need to pass "-M prepared" to pgbench. And it is very possible that the Redis benchmark is equally flawed.
If you have to parse and plan queries every time PostgreSQL is obviously much slower than Redis. It is much more interesting to see what happens if prepared statements are used.
The article is flawed, here's an issue of someone fixing some of the issues:
https://github.com/raphaeldelio/redis-postgres-cache-benchma...
Thanks, I spotted the lack of prepared statements and stopped looking after that but you spotted the rest of the issues too. And, agreed, it is likely that the Redis benchmark is flawed too but I do not know Redis well enough.
But the whole point here is that PostgreSQL will be used for other tasks e.g. storing all of your business data. So it will be fighting for the shared cache as well as the disk. And of course storage will still be involved as again you can't have in-memory only tables.
And having a locking system fluctuate in latency between milliseconds and seconds would cause all sorts of headaches.
If you are both small enough that you’re considering cohosting the app and DB, the odds are good that your working set is small enough to comfortably fit into RAM on any decently-sized instance.
> And having a locking system fluctuate in latency between milliseconds and seconds would cause all sorts of headaches.
With the frequency that a locking system is likely to be used, it’s highly unlikely that those pages would ever get purged from the buffer pool.
Once again. PostgresSQL does not have in-memory tables. So:
a) Writes will be happening on a slow, highly contentious disk.
b) You have no guarantees that the data will be in-memory.
Both of which make it a poor solution for use cases such as locking.
As pointed out by blackenedgem above: PostgreSQL has tablespaces, and one may simply declare tables which should stay in RAM in a tablespace built upon a tmpfs with enough reserved RAM to store them all. There is a only small associated burden (restoring the tablespace while starting PG).
> And of course storage will still be involved as again you can't have in-memory only tables.
That's fine. If what you want is a single-node deployment combined with in-memory purism them SQLite has your back.
No but it does have the concept of tablespaces. If you want you can map RAM to a disk location, set that up as a tablespace, then tell postgres to use that tablespace for your given table. Also set the table as UNLOGGED while you're at it.
A bit more work yes that could be simplified, but fully supported if you control the stack.
Yes. Putting a Postgres tablespace on a RAM disk (tmpfs) does wonders. Even if NVMe may be comparable to RAM by bandwidth, whatever Postgres has to do to ensure that data are written to the durable store as a part of a commit is still significantly slower compared to RAM.
Highly recommended for running tests, especially in CI/CD pipelines. Doing this simple change can speed up DB-heavy tests 30-50%.
Well if your working set fits RAM your tables will be stored in memory in the shared buffers.
> PostgreSQL does not have the concept of in-memory tables.
Is that relevant though? Some benchmarks on the web show Postgres outperforming Redis in reads as well as low-volume writes (less than 1k key-value pairs), and Redis only beating Postgres for high volume key-value writes.
> Redis only beating Postgres for high volume key-value writes
I wonder what type of use cases have high volume writes.
Maybe queuing, locking and pub/sub ?
Usually none of them. Queuing is almost never high volume and rarely is pub sub.
You know what performs better than redis in this setup? An hashmap.
Why Postgres wouldn’t run on the same server as the app? It’s actually pretty common.
What if there are multiple processes that require access to a shared cache?
And I think this is the main use case they were looking for. If you have a web app where each request is a separate process/call (not uncommon), and you don’t have a good shared global state, Redis is a great tool. It is an in-memory data structure store that can respond to requests from different processes. I always considered it an evolution from memcached.
If you only have one long lived process or good global variable control, then it is much less appealing in the single-server scenario. Similarly, if you require access from multiple hosts, it becomes a less obvious choice (especially if you already have a database in the loop). And redis is also overkill is you’re using it only as a cache.
>shared cache
As in performance improvement - cache should never be considered a datastore, e.g. you can pull the plug and nothing else happens (aside losing performance). It'd be a lot more beneficial all the processes to have a local cache, themselves. The latter is at least 4 orders of magnitude faster than redis. Now you may like some partitioning, too.
There's more to redis than just being a K/V store.
Perhaps you could elaborate? It would be helpful to understand what Redis can do that cannot be done easily with local memory.
Acting as shared memory for an inherently single-CPU language like JS is one I can think of. However, I don't use Redis, so you'd be better placed to drive the discussion forward with examples.
Redis provides low-level persistent data structures which can be used to implement business logic distributed safely across a number of machines. That’s a LOT harder than in-memory in-process.
My Sidekiq background job system runs entirely on top of Redis. Structures like Sorted Sets become the basis for indexes. Lists provide extremely fast queue behavior and Hashes map easily to persistent objects. Databases, traditionally, have not performed well when used as queues.
Those are the big 3 structures necessary to implement anything: trees, lists and maps.
I'm a bit confused here, because the original comment was that "most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond", to which the response was "there's more to redis than just being a K/V store".
I do see the point of Redis if you have multiple hosts, but I was unsure why someone would use it on just one host.
Right, it's still useful as shared data for multiple processes on the same machine. A SQL database mostly forces your data into one structure: the table. Redis instead provides commands which operate directly on those different data structures I mentioned.
I have Docker running in swarm mode where it will spin up multiple load balanced instances of my web app (where requests can get routed randomly to any instance). So I use Redis to store "User Session Information" so the session can be read/written from any instance, supposedly faster than using a DB.
When I've used redis to store web sessions, it is in fact acting as a k/v store. The session ID is the key, and the (serialized) session state is the value.
Yep, that's what I'm doing too. I bet it's the most common use case for Redis.
Just to add -- sometimes I use Redis when I don't have a trusted impl of HyperLogLog or sorted set (and I have a vague suspicion that I am going to do IPC later -- so not worth it to wrap my own HLL).
> Perhaps you could elaborate?
IPC, I suppose?
> There's more to redis than just being a K/V store.
That's perfectly fine. You can compare Redis to other specialized tools just like you can compare it with Postgres and SQLite.
There’s more data structures in your favorite language std lib than an hashmap, at least, I hope so :)
Like?
https://redis.io/docs/latest/commands/
TIL about a Cuckoo Filter.
Also, never realized redis has native support for a Bloom Filter.
Overengineering/premature distribution is a real problem, but Redis stands for "Remote Dictionary Server." The purpose is very much not to run it locally! (Though that's a legitimate design choice, especially if your language's native dictionary doesn't support range queries.)
The purpose may not be primarily or originally to run it locally, but that has definitely become a common use case.
That said, anitirez renaming it to lredis or reldis would be epic and one of my favorite moves of all time
It very well could be running on the same machine, and communicating with the app using unix sockets, which is a hell of a lot faster than TCP. But no one seems to be doing that much either.
I feel that the virtualize and distribute everything to hell and back-trend might actually be about to break, there are signs, and G knows it's about time. The amounts wasted on cloud providers for apps that would run everything just fine on a single server, the effort wasted configuring their offerings, surreal.
> Because everyone insists on insanely distributed architectures, most people will never really see the point of Redis, which is that if it is running on the same machine as the application, it can respond in much less than a millisecond.
I don't think this is a realistic scenario at all.
If you need a KV store, you want it to be the fastest by keeping it in-memory, you want it to run on each node, and you don't care how much it cost to scale vertically, then you do not run a separate process on the same node. You just keep a plain old dictionary data structure. You do not waste cycles deserializing and serializing queries.
You only adopt something like Redis when you need far more than that, namely you need to have more than one process access shared data. That's at the core of all Redis usecases.
https://redis.io/docs/latest/develop/interact/search-and-que...
I develop and maintain multiple applications that use a worker pool, and are small enough to run on a single host. We used pg for the user sessions, which get read and written on every single page request. Some of our apps are Internet-facing, and web crawlers can create sessions that get read and written (recording recent pages) as they browse the site. We switched to a redis service on the same host as the app and saw 3 main benefits: faster session loading and saving, less disk activity on the Pg server (so all other queries run faster) and less writes to the Pg WAL, so our backups require drastically less GB per day of retention.
After the significant success of the first conversion, we've been working to convert all the rest of our apps.
And no, host language data structures aren't useful because they aren't in shared memory between all the worker processes, and even if we found a module that implemented them in shared memory, we like to be able to preserve the sessions across a host restart, and then we'd need a process to save the data structures to disk and load them back, and by the time we did that we'd have just reinvented redis.
This is the best response so far. Session churn creates lots of db activity but lots of it is of low business value. Better to offload to a separate process.
Also session data is often Blobs which db's don't process as efficiently as columnar data.
> then you do not run a separate process on the same node
You might if you want the KV to persist between app restarts (for warm starts.)
> which is that if it is running on the same machine as the application
in that case just use a regulator hashmap - it has nanoseconds performance compared to the sub-millis.
Over the network became feasible when HDD got markedly slower than NICs. It’s a nearer thing with NVMe.
I want a “redis” with something akin to the Consul client - which is a sidecar that participates in the Raft cluster and keeps up to date, cheaper lookups for all of the processes running on the same box.
The few bit of data we needed to invalidate on infrequent writes went into consul, and the rest went into the dumbest (as in boring, not foolish) memcached cluster you can imagine.
But as you say there was the network overhead, and what would be lovely is a 2 tier KV store that cached recent reads locally and distributed cache invalidation over Raft. Consistent hashing for the systems of record, broadcast data on put or delete so the local cache stays coherent.
I always liked the idea of a distributed sidecar like DB.
I wonder if something like Cockroach DB might even work for small clusters.
> [Redis] can respond in much less than a millisecond.
I have no idea how fast Redis can get, but it is entirely possible for an RDBMS to execute a query in well under a millisecond. I have instrumentation proving it. If everything is on the same machine, I would wager that IPC would ultimately be the bottleneck for both cases.
Django has caching built in with support for Redis, and it also has an in-memory caching option which they label as “not for production” (because if you have multiple instances of Django serving requests, their in-memory caches will diverge which is...bad).
But for lots of cases, especially internal business tools, we can scale up a single instance for a long time, and this in-memory caching makes things super fast.
There’s a library, django-cachalot [1], that handles cache invalidation automatically any time a write happens to a table. That’s a rather blunt way to handle cache invalidation, but it’s wonderful because it will give you a boost for free with virtually no effort on your part, and if your internal business app has infrequent updates it basically runs entirely in RAM, and falls back to regular database queries if the data isn’t in the cache.
[1] https://github.com/noripyt/django-cachalot
But then you can have it in memory in your app.
Redis is not just in-memory.
Many arguments here have been from the Redis point of view (and are quite defensive). Of course Redis has niches where it is the better choice.
But that’s not what I thought the point of the article was. I think the main argument from the article can be summed up with this line:
> PostgreSQL has a lot more capabilities than you may expect when you approach it from the perspective of just another SQL database or some mysterious entity that lives behind your ORM.
To me, this is the key line. If you’re only using your database from behind an ORM (any database, IMO), you’re probably missing functionality. And if you need to add another service (like Redis), you may be better served by using the database you already have setup, rather than adding another dependency.
It’s good to understand all of the things Postgres can do. It’s a powerful database.
A counterpoint is that the barrier to using redis is tiny and in exchange you have a very high performing system that has extensive library support and takes load from your primary database.
If you want to cache API responses, for example, could you do that in Postgres? Sure. And you could support things like TTLs with cron jobs sweeping stale cache values. Or you can just use redis.
Advisory locks are cool and useful. They can be a little problematic you want something like PGBouncer and you’re stuck between session advisory locks and transaction interleaving.
Having separate systems has downsides (network calls, availability, domain knowledge), but the tradeoffs with Redis of all things are pretty low.
It's one less thing to manage in production. Start with postgresql then add niche systems for performance, scaling, cost.
This is a fairly old article, but this has become a very common pattern. For 90% of projects that just need a task queue for sending email or generating reports, you're not going to be doing millions of messages a second and anything that simplifies your stack is worth looking at. I've used these patterns so often to work around issues encountered with Celery that I've split them into their own framework, https://github.com/TkTech/chancy (feedback appreciated ;)
There are lots of these, several of which are commercial ventures so there's definite interest behind it:
Thank you for sharing this. I’ve been trying to find a simple task queue and Huey is ok, but since I’m on Postgres and the tasks I’m queuing are like once an hour, anything redid based has always felt like overkill.
Another one for Node is PG-Boss (https://github.com/timgit/pg-boss)
PGQueuer provides job queuing, locking, and real-time notifications using PostgreSQL's FOR UPDATE SKIP LOCKED and LISTEN/NOTIFY. It's a minimalist alternative for those already using PostgreSQL, avoiding the need for Redis.
- https://github.com/janbjorge/PGQueuer
disclaimer: Im the auther.
I love postgres but it has some limitations:
- Want a KVstore? Do you know autovacuum? Do you know the connections pool limit? Do you want throughput or safety?
- Want a queue? is it sequential? Is it throttled? Is it fanout? Is it split by topic?
- Want a pubsub? do you care about more than once? do you care about losing messages? do you care about replaying messages?
- Want a lock? Do you know the connections pool limit (again)? Do you know about statement_timeout?
Yes, you can solve almost all issues listed above, but it's not that trivial.
Also: Will your implementation fall over if there's a long running transaction that stops vacuum from removing tuples?
this. it's like the only thing the market hasn't done yet is replace the nanny of my kids and make my wife jealous.
A big deal-breaker for Postgres pub/sub is the messages have a maximum size of 8000 bytes. The recommended work around is to put data in a table and send IDs, but then you have to deal with garbage collecting that if you don't want it to live forever, and you are adding operations to each message.
Obviously this can be fine, but for plenty of Redis use cases it really isn't equivalent with that limit.
Queuing, locking, pub/sub: check.
Caching, often the most important application of Redis: oops.
Updates in Postgres are notoriously more expensive than inserts (generate garbage, require vacuuming); durability guarantees, not important for caching, make writes significantly slower.
Automatic expiration is very convenient and fool-proof.
You can disable most of what you don't need, e.g. by making a table unlogged: https://www.postgresql.org/docs/current/sql-createtable.html.... Synchronous commits, autovacuum, etc. can also be turned off. Redis of course is still going to be faster, just not in a way that the median company should care about.
Lets see pgsql deal with 15000 client connections.
Just use connection pooling, which is available point/click as AWS RDS Proxy.
It’s really shocking how wide the difference is here between Postgres and MySQL.
One of the Planetscale guys did a podcast and said at GitHub every MySQL instance is doing like 50k+ connections while if you need more than 100 connections in Postgres you already need PgBouncer.
Why is it "shocking"?
It's just an architectural decision to spawn a process per connection that Postgres made long time ago.
It's a tradeoff like most decisions.
Back in the days MySQL had huge issues with scaling to multi-core systems (maybe they fixed it now, I haven't used MySQL for a long time) while Postgres never had this problem.
When designing an application around Postgres you just have to take into account that:
1. opening a new connection is rather expensive, so you should have a connection pool
2. opening too many connections is problematic, so you should keep the size of your pool rather small and return connections back into the pool ASAP
That's it.
It's not that hard in practice.
MySQL never had a problem with scaling to multi-core systems; Linux on the other hand did have a problem and its problems became everyone's problems. In the early 2000's the answer to multi-core Linux issues was to buy a system running one of the commercial UNIX. By 2006 or so this became a non-issue for Linux ( for the most part ).
Postgres uses an old design model which predates threads; I have no idea if they have made any progress in updating their design. In the past I have heard the core devs talk about how difficult it would be to do this.
Oracle Database moved to a hybrid process/thread model at some point, this is the superior solution ( I have no idea if it was done well or not, but from standpoint of how to achieve better leverage of CPU vs IO, this is the way ).
If the PG devs had enough time/money, I am sure they would move towards a hybrid model where the focus would be on processor affinity with IO being all tied to events.
Years back (around 10) mysql did have multiple instances of false sharing[0].
Unrelated to mysql - I do consider using redis in any capacity a blunder (it's likely ok for nodejs users, I suppose)
[0]: http://dimitrik.free.fr/MySQL_Connect_2013/MySQL_Perf-Connec...
Why would you do 50k+ connections if they can't be active all at once either way? Unless you have 50k+ cores and IO beefy enough not to get overwhelmed due to that.
You can have as much connections as you want, but you'll have to trade it for having lower work mem numbers, which hurts performance. Traditional advice is to keep it below 500 per PostgreSQL instance (I'd say physical host).
I've ran dozens of micro services handling thousands of requests per second with a total connection limit of around 200 of which most was still unused - all without any server-side pooler.
because people run large amounts of front ends and workers that create a significant amount of connections. it doesn't matter if they are all active.
Why would you want every "frontend" keep an open connection all the time?
> it doesn't matter if they are all active
It does, if the connection is inactive (doesn't hold an open transaction) you should close it or return it to the pool.
so you are suggesting you close a connection between queries?
Between queries in the same transaction? No
Between transactions? Yes, absolutely
In fact, many libraries do it automatically.
For example, SQLAlchemy doc explicitly says [0]:
> After the commit, the Connection object associated with that transaction is closed, causing its underlying DBAPI connection to be released back to the connection pool associated with the Engine to which the Session is bound.
I expect other reasonably sane libs for working with transactional databases do the same.
So, if you are doing pooling correctly, you can only run out of available connections if you want to have a lot of long running transactions.
So, why would you want every of your 50k frontends keep an open transaction simultaneously?
[0] https://docs.sqlalchemy.org/en/20/orm/session_basics.html#co...
Because there's an overhead to make a connection, authenticate, set the default parameters on the connection, etc. I've never seen a framework that closed db connections between requests.
Of course, the better design is to write a nonblocking worker that can run async requests on a single connection, and not need a giant pool of blocking workers, but that is a major architecture plan that can't be added late in a project that started as blocking worker pools. MySQL has always fit well with those large blocking worker pools. Postgres less so.
As I said, you can return the connection to the connection pool.
From the perspective of keeping the number of open connections low it doesn't really matter if you close it or return to the pool, because in either case the connection becomes available to other clients.
I might not be understanding what you're pointing out here. It sounds to me like sqlalchemy is talking about a pool of connections within one process, in which case releasing back to that pool does not close the connection by that process to the database. Parent comment is talking about one connection per process with 50k processes. My comment was that you don't need that many processes if each process can handle hundreds of web requests asynchronously.
If you are saying that a connection pool can be shared between processes without pgbouncer, that is news to me.
Of course, you're right, it is not possible to to share a connection pool between processes without pgbouncer.
> Parent comment is talking about one connection per process with 50k processes.
It is actually not clear what parent comment was talking about. I don't know what exactly did they mean by "front ends".
The most common design for a Web app on Linux in the last 20 years is to have a pool of worker processes, each single-threaded and ready to serve one request. The processes might be apache ready to invoke PHP, or mod-perl, or a pool of ruby-on-rails or perl or python processes receiving the requests directly. Java tends to be threads instead of processes. I've personally never needed to go past about 100 workers, but I've talked to people who scale up to thousands, and they happen to be using MySQL. I've never used pgbouncer, but understand that's the tool to reach for rather than configuring Pg to allow thousands of connections.
To articulate the articles point further: Start with pg, switch to redis when the need arises.
Keep the number of moving parts as low as possible.
Redis isnt exactly hard to deploy. May as well just use it instead of planning for a switch later that may come with god knows what unforeseen side effects
The problem with this is that eventually the database becomes everything to everyone. You start seeing the frontend UIs suffer due to errors with unrelated tables or table locking operations (like migrations).
Redis is pretty awesome. So is rabbitmq. When you’re a < 100 person startup, probably stick with postgres. Over 100 and I would seriously start considering adding more specialized platform tools.
Back in the day Id always have redis around for my rails cache. We already had it for cache so also using it for job queues and sidekiq was a no brainer.
These days with m.2 disks and database based cacheing (solid cache) and first class support in Rails, and postgres able to handle the job queues and things like GoodJob the choice isn’t so clear cut to “just use redis”.
I won’t be replacing sidekiq in our current production app any time soon but any new projects would get some serious consideration.
I would like to be able to use pgsql for application lock but the api of advisory lock is not good for that un my opinion:
There is no namespace or table for that. Meaning that the single common locking space will be shared by all the users/apps of your database.
The key has to be an int with a quite low limit (like 30k) if my memory is good. So you can't use the key to use for namespacing.
The key is two ints, not one. I use the first int for namespacing.
Updates and replication becomes a whole lot trickier when you start using these functionalities in postgres.
Sure it's possible, but I prefer sticking to more widespread use of postgres.
Reminder that redis has been forked into valkey - https://valkey.io/
Pretty pointless article without doing performance benchmarks.
The entire point of Redis that it trades durability and convenience for performance.
it’s generally better to avoid the distributed systems problems if you can. Postgres is very flexible and you get very far with it if you use all it has to offer
The point of view from over 30y carres. I love Postgre, I've known Redis recent. Each have great functions. Every products have unique functions at their early season, but getting fat by adding similarity functions with others. Implementing function is not difficult in there because it is a software, but sometimes it makes lose its the first strength. I've loved Postgre since its v3, but once it had been pushed by MySQL, because of LAMP boom. I wish Postgre does not be a fat system alike Oracle. And also want Redis to take the right direction. I mean Postgre did not need to support pub/sub and so on.
Actually pub/sub works best when you can commit it in a transaction along with your data changes. When they are separate systems, you get a race condition where you could commit the change but crash before the event gets published.
Consider using SQLite on NAS id you don't care about performance to this extent.
Answer: No.
The queue approach of `SELECT...FOR UPDATE SKIP LOCKED` is great for low throughput needs. One thing to note: as your throughput increases you will have to deal with bloat.
When the processed message rows are deleted, the pages sitting on the disk are not deleted until vacuum is run. If you run an online vacuum it doesn’t delete the page from disk, and a full vacuum (that will free up the disk space) locks the database while it completes the vacuum (which if you’re now dealing with high throughput, is not great).
One approach to address bloat without dealing with vacuum is to setup a timestamp-based partition on your queue table. This way you just drop old partitions and it frees the disk space without needing vacuum.
To add some feedback as someone whose company replaced redis with PgSQL after I lobbied for it (leveraging my experience using it in bioinfo HPC beforehand):
- agreed, autovacuum is definitely something to keep a close watch on – but as long as disk I/Os are fast enough (i.e. not EBS), our queue tables could fill the space freed for PgSQL but not the OS (the non full VACUUM mentioned by my GP), so in the end, assuming the queue did not grow infinitely, actual disk usage stabilized, and we never had to VACUUM FULL outside of buggy (on our side) situations;
- high-throughput is relative: we ran in the range of hundreds to thousands of tasks per second without any issue nor any particular customization, so sure, PgSQL will not hold the load if you are Google, but I'd bet it should satisfy the needs of at least up to the last centile of companies.
- I like your idea about timestamp-defined partitions, I will definitely keep this one in mind if the need arises.
Now on the pros:
- being able to store & manipulate basically all the data handled by our stack from a single source sounds like a minute detail, but actually helps a lot when we have to debug an issue;
- being able to use virtually any language and still be able to access all the persistent data from our stack from a single source revealed itself to be an unexpected advantage to develop a quick'n dirty e.g. python/nim/racket script to explore data/debug/generate stats/etc.;
- our PgSQL machines ran much better “by themselves” than we feared at first, and finally were far from being the main point of contention on day-to-day ops – in the end we just altered a couple of settings and just... let them be;
- PgSQL provides a lot of tools to inspect it live, which provides additional help to identify bottlenecks in (i) your server setup and (ii) your stack;
- PgSQL support of JSON/JSONB is actually very good, and very helpful to store that hard-to-normal-form-ize last part of your data.
Great response!
> ”PgSQL provides a lot of tools to inspect it live”
Any particular tools you like for this?
“Tool” may not be the right word; but being able to live monitor the vacuumer, to understand what table was taking how much space and why, to track min/mean/max execution time of queries, and, more generally, all the internal metrics that PgSQL exposes through a couple of SELECT has been a boon when debugging or optimizing.
Plain vacuum is generally all you want. At that point the space is freed for new inserts and updates to the table. Full vacuum is for giving the space back to the OS, which rarely makes sense if you'll still be writing more rows. In cases where freeing to the OS really is what you need, though, use something like pg_repack.
"Do you need pgsql ? Redis does queuing, locking and pub/sub"
(I've not read the article: the title alone seems like an orange vs apple sales)
Do you need oranges? Apples have seeds, skin, and grow on trees.
Redis has many niches where it is more suitable than a database. In particular, the low latency and simple concurrency model make it a perfect choice for building shared token buckets, for example. For simple operations that need to be executed many times at low latency, SQL is usually not the best choice. And sometimes, you really just need K/V and don't want to deal with schema migrations, DBA chores, debugging auto vacuum...
>Advisory locks allow you to leverage the same locking engine PostgreSQL uses internally for your own application-defined purposes.
That is like telling me that I can get across the river in a nuclear submarine instead of walking over the footbridge. PostgreSQL locks cause me plenty of headaches already without my application logic being dependent on them.
Disagree. If you need a global locking mechanism that all of your back ends can agree on without setting up an additional service, advisory locks are a totally valid use case. All of your back ends already have access to the DB, why bring in another whole service for just the locking that is required to implement some small thing in your giant SaaS product? Just lean on PG and you can go really far.
I mean, yes, but a fairer comparison would be Redis vs. Memcache. Redis provides functionality missing from Memcache, it is not meant to be an alternative to Postgresql. This article reminds me of one about configuring MySQL to act like a NoSQL datastore, I read when MongoDB hit the limelight. Apples to organges.
[dead]