> During this incident, we discovered we had crossed a scale threshold where our log ingestion pipeline was being rate-limited and quietly discarding logs. Ironically, we ended up with less information as a result, which made it significantly harder to reconstruct what was actually happening.
Last year they posted about using New Relic, Datadog, and Grafana. Would this ‘silent deletion of log data due to quota’ problem be characteristic of any one of them in particular, or is it something we have to watch out for with all of them?
In general you do need to be aware of any agent-level rate limits as well as any ingestion limits from the provider. We do some pretty careful sampling and aggregations for most metrics, logs, and traces we store and as mmcclure said in this case it was the rules on the node agents themselves throwing the errors. The volume logging on some of the critical paths of the service got high enough that the logs were dropped due to our configured rate limits.
We don't use New Relic or Datadog (and never have, afaik), so I'm not sure what post you could be referring to for those two? We have talked publicly about our Grafana use, though, and going from an in-house stack to their cloud product. Actual OP can probably hop in later with a better answer, but it was hitting rate limits on the logging agent, not the logging system.
Why bother transcoding on the fly? Storage is cheaper than CPU and the work it takes to determine what needs encoding is excessive.
It implies that you guys are generating the playlists on the fly, tracking the client requests, then feeding that over to your transcoder - which then needs to get the original, seek, and transcode. Why bother?
First, it does save money. A meaningful percentage of videos on the internet are never watched in the first place, and an even larger percentage are watched soon after upload and never watched again. We're able to prune unwatched renditions, and if they happen to be requested years later, they're still playable. Transcoding on the fly lets us save both CPU and storage.
Second, it is ridiculously fast. Our median time-to-publish for a 5-20 minute video is 9 seconds. We had a customer (God bless them) complaining a few months ago that it took us something like 40 seconds to transcode a 40 minute video, which actually was slower than normal for us. If you do an async transcode up front, you're looking at 20 minutes, not <1 minute.
> During this incident, we discovered we had crossed a scale threshold where our log ingestion pipeline was being rate-limited and quietly discarding logs. Ironically, we ended up with less information as a result, which made it significantly harder to reconstruct what was actually happening.
Last year they posted about using New Relic, Datadog, and Grafana. Would this ‘silent deletion of log data due to quota’ problem be characteristic of any one of them in particular, or is it something we have to watch out for with all of them?
In general you do need to be aware of any agent-level rate limits as well as any ingestion limits from the provider. We do some pretty careful sampling and aggregations for most metrics, logs, and traces we store and as mmcclure said in this case it was the rules on the node agents themselves throwing the errors. The volume logging on some of the critical paths of the service got high enough that the logs were dropped due to our configured rate limits.
We don't use New Relic or Datadog (and never have, afaik), so I'm not sure what post you could be referring to for those two? We have talked publicly about our Grafana use, though, and going from an in-house stack to their cloud product. Actual OP can probably hop in later with a better answer, but it was hitting rate limits on the logging agent, not the logging system.
Ah! Thank you, that makes a lot more sense. I misunderstood https://data.mux.com/blog/off-with-our-head-how-we-re-making... as suggesting that Mux was making Mux infrastructure ‘play nice with’ the various providers.
Why bother transcoding on the fly? Storage is cheaper than CPU and the work it takes to determine what needs encoding is excessive.
It implies that you guys are generating the playlists on the fly, tracking the client requests, then feeding that over to your transcoder - which then needs to get the original, seek, and transcode. Why bother?
Mux founder here :wave:
Two answers.
First, it does save money. A meaningful percentage of videos on the internet are never watched in the first place, and an even larger percentage are watched soon after upload and never watched again. We're able to prune unwatched renditions, and if they happen to be requested years later, they're still playable. Transcoding on the fly lets us save both CPU and storage.
Second, it is ridiculously fast. Our median time-to-publish for a 5-20 minute video is 9 seconds. We had a customer (God bless them) complaining a few months ago that it took us something like 40 seconds to transcode a 40 minute video, which actually was slower than normal for us. If you do an async transcode up front, you're looking at 20 minutes, not <1 minute.
Blog post on this: https://www.mux.com/blog/how-to-transcode-video-100x-faster-...
“We didn’t handle errors, didn’t have logs, and now we do cuz next time” saved you a few mins