Bluesky April 2026 Outage Post-Mortem

83 points | by jcalabro 3 hours ago

23 comments

threecheese 3 hours ago
> What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.
That’ll do it.
[-]
- jandrese 5 minutes ago
  The incredible part about this is because their backend is all TCP/IP they were literally exhausting the ports by leaving all 65k of them in TIME_WAIT, and the workaround was to start randomizing the localhost address to give them another trillion ports or so.
- 98codes 2 hours ago
  Ahh, the three relevant numbers in development: 0, 1, and infinity.
- bombcar 3 hours ago
  Zero, one, many, many thousands.
- htx80nerd 35 minutes ago
  less than ideal if I had to be frank.
tapoxi 24 minutes ago
I don't really understand this architecture, but I thought Bluesky was distributed like Mastodon? How can it have an outage?
[-]
- pfraze 20 minutes ago
  This writeup is useful for backend engineers: https://atproto.com/articles/atproto-for-distsys-engineers
  The simple answer is that atproto works like the web & search engines, where the apps aggregate from the distributed accounts. So the proper analogy here would be like yahoo going down in 1999.
  [-]
  - tapoxi 7 minutes ago
    This is a fantastic write-up, thanks for sharing!
  - isodev 15 minutes ago
    Google and MSN Search were already available at this time. Also websites used to publish webrings and there was IRC and forums to ask people about things.
- isodev 19 minutes ago
  It’s more of a concept of a plan for being distributed. I even went through the trouble of hosting my own PDC and still, I was unable to use the service during the outage
- Retr0id 21 minutes ago
  Mastodon infra can have outages, too.
  [-]
  - tapoxi 13 minutes ago
    It's just confined to one instance if it goes down, not all of Mastodon.
mwkaufma 15 minutes ago
Tell us more about this buggy "new internal service" that's scraping batch data :P
goekjclo an hour ago
> The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.
I expect this is common.
gsibble 19 minutes ago
Did all 3 users notice?
electrondood 41 minutes ago
Great write up... curious about the RCA. Thanks!
rvz an hour ago
Thank you for the post mortem on this outage.
jonstaab an hour ago
nostr never goes down
[-]
- jandrese 12 minutes ago
  If nostr went down would people even notice?
  [-]
  - jonstaab 9 minutes ago
    probably not
- pfraze an hour ago
  All support to other decentralizers but nothing never goes down.
  [-]
  - jonstaab 36 minutes ago
    1000x redundancy makes it vanishingly unlikely. Although I know we're due for a pole shift so all bets are off I suppose.
jmclnx an hour ago
Lite Blue on a dark Blue background. That is a new one, I have seen grey text on lite grey, but blue on blue ?
The article does work in lynx, at least I can read it.