I clicked because of the bait-y title, but ended up reading pretty much the whole post, even though I have no reason to be interested in ZFS. (I skipped most of the stuff about logs...) Everything was explained clearly, I enjoyed the writing style, and the mobile CSS theme was particularly pleasing to my eyes. (It appears to be Pixyll theme with text set to the all-important #000, although I shouldn't derail this discussion with opinions on contrast ratios...)
For less patient readers, note that the concise summary is at the bottom of the post, not the top.
We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]
"And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."
This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.
Couple of comments. Firstly, you are talking about highly redundant information when referencing VM images (e.g. the C drive on all Windows Serer images will be virtually identical), whereas he was using his own laptop contents as an example.
Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.
Fair point. My experience is with enterprise storage arrays and I have always used dedupe/compression at the same time. Dedupe is going to be a lot less useful on single computers.
I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.
Base VM images would be a rare and specific workload. One of the few cases dedupe makes sense. However you are likely using better strategies like block or filesystem cloning if you are doing VM hosting off a ZFS filesystem. Not doing so would be throwing away one of it's primary differentiators as a filesystem in such an environment.
General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.
Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.
Certainly it makes sense to not have deep copies of VM base images, but the deduplication is not the right way to do it in ZFS. Instead, you can clone the base image and before changes it will take almost no space at all. This is thanks to the copy-on-write nature of ZFS.
ZFS deduplication instead tries to find existing copies of data that is being written to the volume. For some use cases it could make a lot of sense (container image storage maybe?), but it's very inefficient if you already know some datasets to be clones of the others, at least initially.
I haven't tried it myself, but the widely quoted number for old ZFS dedup is that you need 5GB of RAM for every 1TB of disk space. Considering that 1 TB of disk space currently costs about $15 and 5GB of server RAM about $25, you need a 3:1 dedupe ratio just to break even.
If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother
Other file systems tend to prefer offline dedupe which has more favorable economics
Why does it need so much RAM? It should only need to store the block hashes which should not need anywhere near that much RAM. Inline dedupe is pretty much standard on high-end storage arrays nowadays.
Assuming something reasonable like 20TB Toshiba MG10 HDDs and 64GB DDR4 ECC RAM, quick googling suggests that 1TB of disk space uses about 0.2-0.4W of power (0.2 in idle, 0.4 while writing), 5GB of RAM about 0.3-0.5W. So your break even on power is a bit earlier depending on the access pattern, but in the same ball park.
Not just rack space. At a certain amount of disks you also need to get a separate server (chassis + main board + cpu + ram) to host the disks. Maybe you need that for performance reasons any way. But saving disk space and only paying for it with some ram sounds cost effective.
VMs are known to benefit from dedupe so yes, you'll see benefits there. ZFS is a general-purpose filesystem not just an enterprise SAN so many ZFS users aren't running VMs.
Dedupe/compression works really well on syslog
I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.
They are not the same thing, but when you boil it down to the raw math, they aren't identical twins, but they're absolutely fraternal twins.
Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.
Well put. I like to say compression is just short range dedupe. Hash based dedupe wouldn't be needed if you could just to real-time LZMA on all of the data on a storage array but that just isn't feasible and hash-based dedupe is a very effective compromise.
Is "paternal twins" a linguistic borrowing of some sort? It seems a relatively novel form of what I've mostly seen referred to as monozygotic / 'identical' twins. Searching for some kind of semi-canonical confirmation of its widespread use turns up one, maybe two articles where it's treated as an orthodox term, and at least an equal number of discussions admonishing its use.
If anything I would expect the term “maternal” twin to be used as whether or not a twin is monozygotic or “identical” depends on the amount of eggs from the mother.
Even with the rudimentary Dedup features of NTFS on a Windows Hyper-V Server all running the same base image I can overprovision the 512GB partition to almost 2 GB.
You need to be careful and do staggered updates in the VMs or it'll spectacularly explode but it's possible and quite performant for less than mission critical VMs.
I think you mean 2TB volume? But yes, this works. But also: if you're doing anything production, I'd strongly recommend doing deduplication on the back-end storage array, not at the NTFS layer. It'll be more performant and almost assuredly have better space savings.
For text based logs I'm almost entirely sure that just using compression is more than enough. ZFS supports compression natively on block level and it's almost always turned on. Trying to use dedup alongside of compression for syslog most likely will not yield any benefits.
That makes sense considering Advanced Format harddrives already have a 4K physical sector size, and if you properly low-level format them (to get rid of the ridiculous Windows XP compatibility) they also have 4K logical sector size. I imagine there might be some real performance benefits to having all of those match up.
In the early days of VMware people had a lot of VMs that were converted from physical machines and this causes a nasty alignment issue between the VMDK blocks and the blocks on your storage array. The effect was to always add one block to every read operation, and in the worst case of reading one block would double the load on the storage array. On NetApp this could only be fixed when the VM wasn't running.
> In my experience 4KB is my preferred block size.
This probably has something to do with the VM's filesystem block size. If you have a 4KB filesystem and an 8KB file, the file might be fragmented differently but is still the same 2x4KB blocks just in different places.
Now I wonder if filesystems zero the slack space at the end of the last block in a file in hopes of better host compression. Vs leaving it as past bytes.
I figured he was mostly talking about using dedup on your work (dev machine) computer or family computer at home, not on something like a cloud or streaming server or other back end type operations.
I built a very simple, custom syslog solution, a syslog-ng server writing directly to a TimescaleDB hypertable (https://www.timescale.com/) that is then presented as a Grafana dashboard, and I am getting a 30x compression ratio.
Log rotate, cron, or simply having something like Varnish or Apache log to a pipe which is something like bzip2 or zstd. The main question is whether you want to easily access the current stream - e.g. I had uncompressed logs being forwarded to CloudWatch so I had daemons logging to timestamped files with a post-rotate compression command which would run after the last write.
That is one wrinkle of using storage based dedupe/compression is you need to avoid doing compression on the client to avoid compressing already compressed data. When a company I worked at first got their Pure array they were using windows file compression heavily and had to disable it as the storage array was now doing it automatically.
I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.
Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
> I just wish we had "offline" dedupe, or even "lazy" dedupe...
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
The neat thing about inline dedupe is that if the block hash already exists than the block doesn't have to be written. This can save a LOT of write IO in many situations. There are even extensions where a file copy between to VMs on a dedupe storage array will not actually copy any data but just increment the original blocks reference counter. You will see absurd TB/s write speeds in the OS, it is pretty cool.
This is only a win if the dedupe table fits in RAM; otherwise you pay for it in a LOT of read IO. I have a storage array where dedupe would give me about a 2.2x reduction in disk usage, but there isn't nearly enough RAM for it.
This array is a bit long-in-the-tooth and only has 192GB of RAM, but a bit over 40TB of net storage, which would be a 200GB dedup table size using the back-of-the-envelope estimate of 5GB/TB.
A more precise calculation on my actual data shows that today's data would allow the dedup table to fit in RAM, but if I ever want to actually use most of the 40TB of storage, I'd need more RAM. I've had a ZFS system swap dedup to disk before, and the performance dropped to approximately zero; fixing it was a PITA, so I'm not doing that anytime soon.
Be aware that ZFS performance rapidly drops off north of 80% utilization, when you head into 90%, you will want to buy a bigger array just to escape the pain.
The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.
You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.
The issue with this, in my experience, is that at some point that pro (exactly, and literally, only one copy of a specific bit of data despite many apparent copies) can become a con if there is some data corruption somewhere.
Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.
Efficiency being the enemy of reliability, sometimes.
Redundant copies on a single volume are a waste of resources. Spend less on size, spend more on an extra parity drive, or another backup of your most important files. That way you get more safety per gigabyte.
Notably, having to duplicate all data x2 (or more) is more of a waste than having 2 copies of a few files - if full drive failure is not the expected failure mode, and not all files should be protected this heavily.
It’s why metadata gets duplicated in ZFS the way it does on all volumes.
Having seen this play out a bunch of times, it isn’t an uncommon need either.
Well I didn't suggest that. I said important files only for the extra backup, and I was talking about reallocating resources not getting new ones.
The simplest version is the scenario where turning on dedup means you need one less drive of space. Convert that drive to parity and you'll be better off. Split that drive from the pool and use it to backup the most important files and you'll be better off.
If you can't save much space with dedup then don't bother.
> There was an implication in your statement that volume level was the level of granularity, yeah?
There was an implication that the volume level was the level of granularity for adding parity.
But that was not the implication for "another backup of your most important files".
> I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
You can't choose just by copying files around, but it's pretty easy to set copies=2 on specific directories. And I'd say that's generally a better option, because it keeps your copies up to date at all times. Just make sure snapshots are happening, and files in there will be very safe.
Manual duplication is the worst kind of duplication, so while it's good to warn people that it won't work with dedup on, actually losing the ability is not a big deal when you look at the variety of alternatives. It only tips the balance in situations where dedup is near-useless to start with.
The author of the new file-based block cloning code had this in mind. A backround process would scan files and identify dupes, delete the dupes and replace them with cloned versions.
There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.
I get the feeling that a hypothetical ZFS maintainer reading some literature on concurrent mark and sweep would be... inspirational, if not immediately helpful.
You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.
So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?
copy_file_range already works on zfs, but it doesn't guarantee anything interesting.
Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest.
(BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)
Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.
Note - anyone bored enough could already make any of these tools work by using FICLONERANGE (which ZFS already supports), but you'd have to do locking - lock, compare file ranges, clone, unlock.
Because FIDEDUPRANGE has the compare as part of the atomic guarantee, you don't need to lock in userspace around using it, and so no dedup utility bothers to do FICLONERANGE + locking.
Also, ZFS is the only FS that implements FICLONERANGE but not FIDEDUPRANGE :)
Knowing what you had to know to write that, would you dare using it?
Compression, encryption and streaming sparse files together are impressive already. But now we get a new BRT entry appearing out of nowhere, dedup index pruning one that was there a moment ago, all while correctly handling arbitrary errors in whatever simultaneous deduped writes, O_DIRECT writes, FALLOC_FL_PUNCH_HOLE and reads were waiting for the same range? Sounds like adding six new places to hold the wrong lock to me.
"Knowing what you had to know to write that, would you dare using it?"
It's no worse than anything else related to block cloning :)
ZFS already supports FICLONERANGE, the thing FIDEDUPRANGE changes is that the compare is part of the atomic guarantee.
So in fact, i'd argue it's actually better than what is there now - yes, the hardest part is the locking, but the locking is handled by the dedup range call getting the right locks upfront, and passing them along, so nothing else is grabbing the wrong locks. It actually has to because of the requirements to implement the ioctl properly. We have to be able to read both ranges, compare them, and clone them, all as an atomic operation wrt to concurrent writes. So instead of random things grabbing random locks, we pass the right locks around and everything verifies the locks.
This means fideduprange is not as fast as it maybe could be, but it does not run into the "oops we forgot the right kind of lock" issue. At worst, it would deadlock, because it's holding exclusive locks on all that it could need before it starts to do anything in order to guarantee both the compare and the clone are atomic. So something trying to grab a lock forever under it will just deadlock.
This seemed the safest course of implementation.
ficlonerange is only atomic in the cloning, which means it does not have to read anything first, it can just do blind block cloning. So it actually has a more complex (but theoretically faster) lock structure because of the relaxed constraints.
No, because none of these tools use copy_file_range. Because copy_file_range doesn't guarantee deduplication or anything. It is meant to copy data. So you could just end up copying data, when you aren't even trying to copy anything at all.
All modern tools use FIDEDUPRANGE, which is an ioctl meant for explicitly this use case - telling the FS that two files have bytes that should be shared.
Under the covers, the FS does block cloning or whatever to make it happen.
Nothing is copied.
ZFS does support FICLONERANGE, which is the same as FIDEDUPRANGE but it does not verify the contents are the same prior to cloning.
Both are atomic WRT to concurrent writes, but for FIDEDUPRANGE that means the compare is part of the atomicness. So you don't have to do any locking.
If you used FICLONERANGE, you'd need to lock the two file ranges, verify, clone, unlock
FIDEDUPRANGE does this for you.
So it is possible, with no changes to ZFS, to modify dedup tools to work on ZFS by changing them to use FICLONERANGE + locking if FIDEDUPRANGE does not exist.
You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.
I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.
Big thank you to all the people that worked on it!
BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.
Even with this change ZFS dedupe is still block-aligned, so it will not match repeated web assets well unless they exist at consistently identical offsets within the warc archives.
dm-vdo has the same behaviour.
You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)
I should clarify I don't use WARCs at all with archivebox, it just stores raw files on the filsystem because I rely on ZFS for all my compression, so there is no offset alignment issue.
The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.
I get the use case, but in most cases (and particularly this one) I'm sure it would be much better to implement that client-side.
You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.
That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
Ah cool, TIL, thanks for the link. I didn't realize that was possible.
I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.
In addition to the copy_file_range discussion at the end, it would be great to be able to applying deduplication to selected files, identified by searching the filesystem for say >1MB files which have identical hash.
General-purpose deduplication sounds good in theory but tends not to work out in practice. IPFS uses a rolling hash with variable-sized pieces, in an attempt to deduplicate data rysnc-style. However, in practice, it doesn't actually make a difference, and adds complexity for no reason.
I've used ZFS dedupe for a personal archive since dedupe was first introduced.
Currently, it seems to be reducing on-disk footprint by a factor of 3.
When I first started this project, 2TB hard drives were the largest available.
My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.
Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.
ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.
Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.
I'd love if dedicated hardware existing in disk controllers for calculating stuff like ECC could be enhanced to expose hashes of blocks to the system. Getting this for free for all your I/O would allow some pretty awesome things.
I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.
Internally ZFS is essentially an object store. There was some work which tried to expose it through an object store API. Sadly it seems to not have gone anywhere.
Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.
High-density drives are usually zoned storage, and it's pretty difficult to implement the regular filesystem API on top of that with any kind of reasonable performance (device- vs host- managed SMR). The S3 API can work great on zones, but only because it doesn't let you modify an existing object without rewriting the whole thing, which is an extremely rough tradeoff.
It’s only a ‘disaster’ if you are using it exclusively programmatically and want to do special tuning.
File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.
The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.
Keep in mind ZFS was created at a time when disks were glacial in comparison to CPUs. And, the fastest write is the one you don't perform, so you can afford some CPU time to check for duplicate blocks.
That said, NVMe has changed that balance a lot, and you can afford a lot less before you're bottlenecking the drives.
If the block to be written is already being stored then you will match the hash and the block won't have to be written. This can save a lot of write IO in real world use.
Unless your data-set is highly compressed media files.
In general, even during rsync operations one often turns off compression on large video files, as the compression operation has low or negative impact on storage/transfers while eating ram and cpu power.
De-duplication is good for Virtual Machine OS images, as the majority of the storage cost is a replicated backing image. =3
When the lookup key is a hash, there's no locality over the megabytes of the table. So don't all the extra memory accesses to support dedup affect the L1 and L2 caches? Has anyone at OpenZFS measured that?
It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.
Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.
I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).
Just store fingerprints in a database and run through that at night and fixup the block pointers...
You can also use DragonFlyBSD with Hammer2, which supports both online and offline deduplication. It is very similar to ZFS in many ways. The big drawback though, is lack of file transfer protocols using RDMA.
I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.
That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.
I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.
But I'm by no means a ZFS developer, so there's surely something I'm missing.
It looks like they’re playing more with indirection features now (created for vdev removal) for other features. One of the recent summit hackathons sketched out using indirect vdevs to perform rebalancing.
Once you get a lot of snapshots, though, the indirection costs start to rise.
So many flaws. I want to see the author repeat this across 100TB of random data from multiple clients. He/she/whatever will quickly realize why this feature exists. One scenario I am aware of that uses another filesystem in a cloud setup saved 43% of disk space by using dedupe.
No, you won't save much on a client system. That isn't what the feature is made for.
When ZFS first came out I had visions of it being a turnkey RAID array replacement for nontechnical users. Pop out the oldest disk, pop in a new (larger one), wait for the pretty lights to change color. Done.
It is very clear that consumer was never a priority, and so I wonder what the venn diagram is of 'client system' and 'zfs filesystem'. Not that big right?
I assuming the author is aware why the feature exists since they state in the second sentence they funded the improvement over the course of two years?
My reaction also. Dedupe is a must have for when you are storing hundreds of VMs. you WILL save so much data and inline dedupe will save a lot of write IO.
It's an odd notion in the age of containers where dedupe is like, one of the core things we do (but stupidly: amongst dissimilar images there's definitely more identical files then different ones).
I tried two of the most non-random archives I had and was disappointed just as the author. For mail archives, I got 10%. For entire filesystems, I got.. just as much as with any other COW. Because indeed, I duplicate them only once. Later shared blocks are all over the place.
can someone smarter than me explain what happens when instead of the regular 4kb block size in kernel builds we use 16kb or 64kb block size or is that only for the memory part, i am confused. Will a larger block size make this thing good or bad?
Generally the smaller the dedupe block the better as you are far more likely to find a matching block. But larger blocks will reduce the number of hashes you have to store. In my experience 4KB is the sweet spot to maximize how much data you save.
My dream Git successor would use either dedupe or a simple cache plus copy-on-write so that repos can commit toolchains and dependencies and users wouldn’t need to worry about disk drive bloat.
The built in Photos duplicate feature is the best choice for most people: it’s not just generic file-level dedupe but smart enough to do things like take three versions of the same photo and pick the highest-quality one, which is great if you ever had something like a RAW/TIFF+JPEG workflow or mixed full res and thumbnails.
Or better yet. A single photo I take of the kids will be stored in my camera roll. I will then share it with family using three different messengers. Now I have 4 copies. Each of the individual (recoded) are stored inside those messengers and also backed up. This even happens when sharing the same photo multiple times in different chats with the same messenger.
Is there any way to do de duplication here? Or just outright delete all the derivatives?
I clicked because of the bait-y title, but ended up reading pretty much the whole post, even though I have no reason to be interested in ZFS. (I skipped most of the stuff about logs...) Everything was explained clearly, I enjoyed the writing style, and the mobile CSS theme was particularly pleasing to my eyes. (It appears to be Pixyll theme with text set to the all-important #000, although I shouldn't derail this discussion with opinions on contrast ratios...)
For less patient readers, note that the concise summary is at the bottom of the post, not the top.
It scrolls horizontally :(
It's because of this element in one of the final sections [1]:
Typesetting code on a narrow screen is tricky![1] https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-...
We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]
"And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."
This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.
Couple of comments. Firstly, you are talking about highly redundant information when referencing VM images (e.g. the C drive on all Windows Serer images will be virtually identical), whereas he was using his own laptop contents as an example.
Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.
Fair point. My experience is with enterprise storage arrays and I have always used dedupe/compression at the same time. Dedupe is going to be a lot less useful on single computers.
I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.
Yeah agreed, very closely related - even more so on ZFS where the compression (AFAIK) is on a block level rather than a file level.
ZFS compression is for sure at the block level - it's fully transparent to the userland tools.
It could be at a file level and still transparent to user land tools, FYI. Depending on what you mean by ‘file level’, I guess.
Base VM images would be a rare and specific workload. One of the few cases dedupe makes sense. However you are likely using better strategies like block or filesystem cloning if you are doing VM hosting off a ZFS filesystem. Not doing so would be throwing away one of it's primary differentiators as a filesystem in such an environment.
General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.
Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.
Certainly it makes sense to not have deep copies of VM base images, but the deduplication is not the right way to do it in ZFS. Instead, you can clone the base image and before changes it will take almost no space at all. This is thanks to the copy-on-write nature of ZFS.
ZFS deduplication instead tries to find existing copies of data that is being written to the volume. For some use cases it could make a lot of sense (container image storage maybe?), but it's very inefficient if you already know some datasets to be clones of the others, at least initially.
I haven't tried it myself, but the widely quoted number for old ZFS dedup is that you need 5GB of RAM for every 1TB of disk space. Considering that 1 TB of disk space currently costs about $15 and 5GB of server RAM about $25, you need a 3:1 dedupe ratio just to break even.
If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother
Other file systems tend to prefer offline dedupe which has more favorable economics
Why does it need so much RAM? It should only need to store the block hashes which should not need anywhere near that much RAM. Inline dedupe is pretty much standard on high-end storage arrays nowadays.
The linked blog post covers this, and the improvements made to make the new dedup better.
That doesn't account for OpEx, though, such as power...
Assuming something reasonable like 20TB Toshiba MG10 HDDs and 64GB DDR4 ECC RAM, quick googling suggests that 1TB of disk space uses about 0.2-0.4W of power (0.2 in idle, 0.4 while writing), 5GB of RAM about 0.3-0.5W. So your break even on power is a bit earlier depending on the access pattern, but in the same ball park.
What about rack space?
Not just rack space. At a certain amount of disks you also need to get a separate server (chassis + main board + cpu + ram) to host the disks. Maybe you need that for performance reasons any way. But saving disk space and only paying for it with some ram sounds cost effective.
VMs are known to benefit from dedupe so yes, you'll see benefits there. ZFS is a general-purpose filesystem not just an enterprise SAN so many ZFS users aren't running VMs.
Dedupe/compression works really well on syslog
I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.
They are not the same thing, but when you boil it down to the raw math, they aren't identical twins, but they're absolutely fraternal twins.
Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.
Well put. I like to say compression is just short range dedupe. Hash based dedupe wouldn't be needed if you could just to real-time LZMA on all of the data on a storage array but that just isn't feasible and hash-based dedupe is a very effective compromise.
Is "paternal twins" a linguistic borrowing of some sort? It seems a relatively novel form of what I've mostly seen referred to as monozygotic / 'identical' twins. Searching for some kind of semi-canonical confirmation of its widespread use turns up one, maybe two articles where it's treated as an orthodox term, and at least an equal number of discussions admonishing its use.
If anything I would expect the term “maternal” twin to be used as whether or not a twin is monozygotic or “identical” depends on the amount of eggs from the mother.
compression tends NOT to use a global dictionary. So to me they are vastly different even if they have the same goal of reducing the output size.
Compression with a global dict would like do better than dedup yet it will have a lot of other issues.
If we're being pedants, then storing the same information in fewer bits than the input is by definition a form of compression, no?
(Although yes I understand that file-level compression with a standard algorithm is a different thing than dedup)
Even with the rudimentary Dedup features of NTFS on a Windows Hyper-V Server all running the same base image I can overprovision the 512GB partition to almost 2 GB.
You need to be careful and do staggered updates in the VMs or it'll spectacularly explode but it's possible and quite performant for less than mission critical VMs.
I think you mean 2TB volume? But yes, this works. But also: if you're doing anything production, I'd strongly recommend doing deduplication on the back-end storage array, not at the NTFS layer. It'll be more performant and almost assuredly have better space savings.
For text based logs I'm almost entirely sure that just using compression is more than enough. ZFS supports compression natively on block level and it's almost always turned on. Trying to use dedup alongside of compression for syslog most likely will not yield any benefits.
> In my experience 4KB is my preferred block size
That makes sense considering Advanced Format harddrives already have a 4K physical sector size, and if you properly low-level format them (to get rid of the ridiculous Windows XP compatibility) they also have 4K logical sector size. I imagine there might be some real performance benefits to having all of those match up.
In the early days of VMware people had a lot of VMs that were converted from physical machines and this causes a nasty alignment issue between the VMDK blocks and the blocks on your storage array. The effect was to always add one block to every read operation, and in the worst case of reading one block would double the load on the storage array. On NetApp this could only be fixed when the VM wasn't running.
> In my experience 4KB is my preferred block size.
This probably has something to do with the VM's filesystem block size. If you have a 4KB filesystem and an 8KB file, the file might be fragmented differently but is still the same 2x4KB blocks just in different places.
Now I wonder if filesystems zero the slack space at the end of the last block in a file in hopes of better host compression. Vs leaving it as past bytes.
I would think VMs qualify as a specific workload, since cloning is almost a given.
I figured he was mostly talking about using dedup on your work (dev machine) computer or family computer at home, not on something like a cloud or streaming server or other back end type operations.
> Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
Don’t you compress these directly? I normally see at least twice that for logs doing it at the process level.
Yes, that ratio is very small.
I built a very simple, custom syslog solution, a syslog-ng server writing directly to a TimescaleDB hypertable (https://www.timescale.com/) that is then presented as a Grafana dashboard, and I am getting a 30x compression ratio.
What software?
Log rotate, cron, or simply having something like Varnish or Apache log to a pipe which is something like bzip2 or zstd. The main question is whether you want to easily access the current stream - e.g. I had uncompressed logs being forwarded to CloudWatch so I had daemons logging to timestamped files with a post-rotate compression command which would run after the last write.
That is one wrinkle of using storage based dedupe/compression is you need to avoid doing compression on the client to avoid compressing already compressed data. When a company I worked at first got their Pure array they were using windows file compression heavily and had to disable it as the storage array was now doing it automatically.
Definitely. We love building abstraction layers but at some point you really need to make decisions across the entire stack.
Logrotate is the rhel utility, likely present in Fedora, that is easily adapted for custom log handling. I still have rhel5 and I use it there.
CentOS made it famous. I don't know if it has a foothold in the Debian family.
logrotate is used on Debian and plenty of other distros. It seems pretty widely used, though maybe not as much so now that things log through systemd.
Logrotate
I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.
Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
> I just wish we had "offline" dedupe, or even "lazy" dedupe...
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
The neat thing about inline dedupe is that if the block hash already exists than the block doesn't have to be written. This can save a LOT of write IO in many situations. There are even extensions where a file copy between to VMs on a dedupe storage array will not actually copy any data but just increment the original blocks reference counter. You will see absurd TB/s write speeds in the OS, it is pretty cool.
This is only a win if the dedupe table fits in RAM; otherwise you pay for it in a LOT of read IO. I have a storage array where dedupe would give me about a 2.2x reduction in disk usage, but there isn't nearly enough RAM for it.
yes inline dedupe has to fit in RAM. Perhaps enterprise storage arrays have spoiled me.
This array is a bit long-in-the-tooth and only has 192GB of RAM, but a bit over 40TB of net storage, which would be a 200GB dedup table size using the back-of-the-envelope estimate of 5GB/TB.
A more precise calculation on my actual data shows that today's data would allow the dedup table to fit in RAM, but if I ever want to actually use most of the 40TB of storage, I'd need more RAM. I've had a ZFS system swap dedup to disk before, and the performance dropped to approximately zero; fixing it was a PITA, so I'm not doing that anytime soon.
Be aware that ZFS performance rapidly drops off north of 80% utilization, when you head into 90%, you will want to buy a bigger array just to escape the pain.
The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.
You need block pointer rewrite for this?
You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.
The issue with this, in my experience, is that at some point that pro (exactly, and literally, only one copy of a specific bit of data despite many apparent copies) can become a con if there is some data corruption somewhere.
Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.
Efficiency being the enemy of reliability, sometimes.
Redundant copies on a single volume are a waste of resources. Spend less on size, spend more on an extra parity drive, or another backup of your most important files. That way you get more safety per gigabyte.
Notably, having to duplicate all data x2 (or more) is more of a waste than having 2 copies of a few files - if full drive failure is not the expected failure mode, and not all files should be protected this heavily.
It’s why metadata gets duplicated in ZFS the way it does on all volumes.
Having seen this play out a bunch of times, it isn’t an uncommon need either.
> having to duplicate all data x2
Well I didn't suggest that. I said important files only for the extra backup, and I was talking about reallocating resources not getting new ones.
The simplest version is the scenario where turning on dedup means you need one less drive of space. Convert that drive to parity and you'll be better off. Split that drive from the pool and use it to backup the most important files and you'll be better off.
If you can't save much space with dedup then don't bother.
There was an implication in your statement that volume level was the level of granularity, yeah?
I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
Note: I assume volume means pool?
> There was an implication in your statement that volume level was the level of granularity, yeah?
There was an implication that the volume level was the level of granularity for adding parity.
But that was not the implication for "another backup of your most important files".
> I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.
You can't choose just by copying files around, but it's pretty easy to set copies=2 on specific directories. And I'd say that's generally a better option, because it keeps your copies up to date at all times. Just make sure snapshots are happening, and files in there will be very safe.
Manual duplication is the worst kind of duplication, so while it's good to warn people that it won't work with dedup on, actually losing the ability is not a big deal when you look at the variety of alternatives. It only tips the balance in situations where dedup is near-useless to start with.
You can use any of the offline dupe finders to do this.
Like jdupes or duperemove.
I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.
I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.
The author of the new file-based block cloning code had this in mind. A backround process would scan files and identify dupes, delete the dupes and replace them with cloned versions.
There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.
I get the feeling that a hypothetical ZFS maintainer reading some literature on concurrent mark and sweep would be... inspirational, if not immediately helpful.
You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.
I run rdfind[1] as a cronjob to replace duplicates with hardlinks. Works fine!
https://github.com/pauldreik/rdfind
So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?
copy_file_range already works on zfs, but it doesn't guarantee anything interesting.
Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest. (BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)
Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.
Quite cool, though it's not as storage saving as deduplicating at e.g. N byte blocks, at block level.
But then you have to be careful not to remove the one which happens to be the "original" or the hardlinks will break, right?
No, pointing to an original is how soft links work.
Hard links are all equivalent. A file has any number of hard links, and at least in theory you can't distinguish between them.
The risk with hardlinks is that you might alter the file. Reflinks remove that risk, and also perform very well.
btrfs has this. You can deduplicate a filesystem after the fact, as an overnight cron job or whatever. I really wish ZFS could do this.
I sent a PR to add support for the necessary syscall (FIDUPERANGE) to zfs that i just have to clean up again.
Once that is in, any of the existing dupe finding tools that use it (IE jdupes, duperemove) will just work on ZFS.
Note - anyone bored enough could already make any of these tools work by using FICLONERANGE (which ZFS already supports), but you'd have to do locking - lock, compare file ranges, clone, unlock.
Because FIDEDUPRANGE has the compare as part of the atomic guarantee, you don't need to lock in userspace around using it, and so no dedup utility bothers to do FICLONERANGE + locking. Also, ZFS is the only FS that implements FICLONERANGE but not FIDEDUPRANGE :)
Knowing what you had to know to write that, would you dare using it?
Compression, encryption and streaming sparse files together are impressive already. But now we get a new BRT entry appearing out of nowhere, dedup index pruning one that was there a moment ago, all while correctly handling arbitrary errors in whatever simultaneous deduped writes, O_DIRECT writes, FALLOC_FL_PUNCH_HOLE and reads were waiting for the same range? Sounds like adding six new places to hold the wrong lock to me.
"Knowing what you had to know to write that, would you dare using it?"
It's no worse than anything else related to block cloning :)
ZFS already supports FICLONERANGE, the thing FIDEDUPRANGE changes is that the compare is part of the atomic guarantee.
So in fact, i'd argue it's actually better than what is there now - yes, the hardest part is the locking, but the locking is handled by the dedup range call getting the right locks upfront, and passing them along, so nothing else is grabbing the wrong locks. It actually has to because of the requirements to implement the ioctl properly. We have to be able to read both ranges, compare them, and clone them, all as an atomic operation wrt to concurrent writes. So instead of random things grabbing random locks, we pass the right locks around and everything verifies the locks.
This means fideduprange is not as fast as it maybe could be, but it does not run into the "oops we forgot the right kind of lock" issue. At worst, it would deadlock, because it's holding exclusive locks on all that it could need before it starts to do anything in order to guarantee both the compare and the clone are atomic. So something trying to grab a lock forever under it will just deadlock.
This seemed the safest course of implementation.
ficlonerange is only atomic in the cloning, which means it does not have to read anything first, it can just do blind block cloning. So it actually has a more complex (but theoretically faster) lock structure because of the relaxed constraints.
Shouldn't jdupes like tools already work now that ZFS has reflink copy support?
No, because none of these tools use copy_file_range. Because copy_file_range doesn't guarantee deduplication or anything. It is meant to copy data. So you could just end up copying data, when you aren't even trying to copy anything at all.
All modern tools use FIDEDUPRANGE, which is an ioctl meant for explicitly this use case - telling the FS that two files have bytes that should be shared.
Under the covers, the FS does block cloning or whatever to make it happen.
Nothing is copied.
ZFS does support FICLONERANGE, which is the same as FIDEDUPRANGE but it does not verify the contents are the same prior to cloning.
Both are atomic WRT to concurrent writes, but for FIDEDUPRANGE that means the compare is part of the atomicness. So you don't have to do any locking.
If you used FICLONERANGE, you'd need to lock the two file ranges, verify, clone, unlock
FIDEDUPRANGE does this for you.
So it is possible, with no changes to ZFS, to modify dedup tools to work on ZFS by changing them to use FICLONERANGE + locking if FIDEDUPRANGE does not exist.
You should use:
You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.
Big thank you to all the people that worked on it!
BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.
Even with this change ZFS dedupe is still block-aligned, so it will not match repeated web assets well unless they exist at consistently identical offsets within the warc archives.
dm-vdo has the same behaviour.
You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)
I should clarify I don't use WARCs at all with archivebox, it just stores raw files on the filsystem because I rely on ZFS for all my compression, so there is no offset alignment issue.
The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.
I get the use case, but in most cases (and particularly this one) I'm sure it would be much better to implement that client-side.
You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.
WARC only does deduping within a single WARC, I'm talking about deduping across millions of WARCs.
That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
https://support.archive-it.org/hc/en-us/articles/208001016-A...
Ah cool, TIL, thanks for the link. I didn't realize that was possible.
I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.
While a slightly different use case, I suspect you’d like zbackup if you don’t know about it.
In addition to the copy_file_range discussion at the end, it would be great to be able to applying deduplication to selected files, identified by searching the filesystem for say >1MB files which have identical hash.
General-purpose deduplication sounds good in theory but tends not to work out in practice. IPFS uses a rolling hash with variable-sized pieces, in an attempt to deduplicate data rysnc-style. However, in practice, it doesn't actually make a difference, and adds complexity for no reason.
I've used ZFS dedupe for a personal archive since dedupe was first introduced.
Currently, it seems to be reducing on-disk footprint by a factor of 3.
When I first started this project, 2TB hard drives were the largest available.
My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.
Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.
ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.
Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.
Do you have data that is very obviously dedupeable? Or just a mix of things? A factor of three is not to be sniffed at.
Cache or ZIL (SLOG device)?
I'd love if dedicated hardware existing in disk controllers for calculating stuff like ECC could be enhanced to expose hashes of blocks to the system. Getting this for free for all your I/O would allow some pretty awesome things.
I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.
Internally ZFS is essentially an object store. There was some work which tried to expose it through an object store API. Sadly it seems to not have gone anywhere.
Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.
Why is it a disaster and what would you replace it with? Is the AWS S3 style API an improvement?
High-density drives are usually zoned storage, and it's pretty difficult to implement the regular filesystem API on top of that with any kind of reasonable performance (device- vs host- managed SMR). The S3 API can work great on zones, but only because it doesn't let you modify an existing object without rewriting the whole thing, which is an extremely rough tradeoff.
It’s only a ‘disaster’ if you are using it exclusively programmatically and want to do special tuning.
File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.
The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.
If writing performance is critical, why bother with deduplication at writing time? Do deduplication afterwards, concurrently and with lower priority?
Keep in mind ZFS was created at a time when disks were glacial in comparison to CPUs. And, the fastest write is the one you don't perform, so you can afford some CPU time to check for duplicate blocks.
That said, NVMe has changed that balance a lot, and you can afford a lot less before you're bottlenecking the drives.
Because to make this work without a lot of copying, you would need to mutate things that ZFS absolutely does not want to make mutable.
If the block to be written is already being stored then you will match the hash and the block won't have to be written. This can save a lot of write IO in real world use.
Kinda like log structured merge tree?
Forget dedupe just use zfs compression, a lot more bang for your buck
Unless your data-set is highly compressed media files.
In general, even during rsync operations one often turns off compression on large video files, as the compression operation has low or negative impact on storage/transfers while eating ram and cpu power.
De-duplication is good for Virtual Machine OS images, as the majority of the storage cost is a replicated backing image. =3
When the lookup key is a hash, there's no locality over the megabytes of the table. So don't all the extra memory accesses to support dedup affect the L1 and L2 caches? Has anyone at OpenZFS measured that?
It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.
Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.
Any timing attacks possible on a virtualized system using dedupe?
Eg find out what my neighbours have installed.
Or if the data before an SSH key is predictable, keep writing that out to disk guessing the next byte or something like that.
I don't think you even need timing attacks if you can read the zpool statistics; you can ask for a histogram of deduped blocks.
Guessing one byte at a time is not possible though because dedupe is block-level in ZFS.
Off topic, any tool to deduplicate files across different external Hard disks?
Over the years I made multiple copies of my laptop HDD to different external HDDs, ended up with lots of duplicate copies of files.
How would you want the duplicates resolved? Just reported in some interface or would you want the duplicates deleted off some machines automatically?
There are a few different ways you could solve it but it depends on what final outcome you need.
I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).
Just store fingerprints in a database and run through that at night and fixup the block pointers...
You can also use DragonFlyBSD with Hammer2, which supports both online and offline deduplication. It is very similar to ZFS in many ways. The big drawback though, is lack of file transfer protocols using RDMA.
I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.
> and fixup the block pointers
That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.
I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.
But I'm by no means a ZFS developer, so there's surely something I'm missing.
[1]: http://eworldproblems.mbaynton.com/posts/2014/zfs-block-poin...
[2]: https://github.com/openzfs/zfs/issues/3582
It looks like they’re playing more with indirection features now (created for vdev removal) for other features. One of the recent summit hackathons sketched out using indirect vdevs to perform rebalancing.
Once you get a lot of snapshots, though, the indirection costs start to rise.
Fixup block pointers is the one thing ZFS didn't want to do.
So many flaws. I want to see the author repeat this across 100TB of random data from multiple clients. He/she/whatever will quickly realize why this feature exists. One scenario I am aware of that uses another filesystem in a cloud setup saved 43% of disk space by using dedupe.
No, you won't save much on a client system. That isn't what the feature is made for.
When ZFS first came out I had visions of it being a turnkey RAID array replacement for nontechnical users. Pop out the oldest disk, pop in a new (larger one), wait for the pretty lights to change color. Done.
It is very clear that consumer was never a priority, and so I wonder what the venn diagram is of 'client system' and 'zfs filesystem'. Not that big right?
I assuming the author is aware why the feature exists since they state in the second sentence they funded the improvement over the course of two years?
My reaction also. Dedupe is a must have for when you are storing hundreds of VMs. you WILL save so much data and inline dedupe will save a lot of write IO.
It's an odd notion in the age of containers where dedupe is like, one of the core things we do (but stupidly: amongst dissimilar images there's definitely more identical files then different ones).
I tried two of the most non-random archives I had and was disappointed just as the author. For mail archives, I got 10%. For entire filesystems, I got.. just as much as with any other COW. Because indeed, I duplicate them only once. Later shared blocks are all over the place.
can someone smarter than me explain what happens when instead of the regular 4kb block size in kernel builds we use 16kb or 64kb block size or is that only for the memory part, i am confused. Will a larger block size make this thing good or bad?
Generally the smaller the dedupe block the better as you are far more likely to find a matching block. But larger blocks will reduce the number of hashes you have to store. In my experience 4KB is the sweet spot to maximize how much data you save.
So in this case I think it would make sense to have a separate pool where you store large files like media so you can save on the dedup for them.
Is there an inherent performance loss of using 64kB blocks on FS level when using storage devices that are 4kB under the hood?
My dream Git successor would use either dedupe or a simple cache plus copy-on-write so that repos can commit toolchains and dependencies and users wouldn’t need to worry about disk drive bloat.
Maybe someday…
It does dedup using Sha-1 on entire files. you might try git-lfs for your usecase though.
OT: does anyone have a good way to dedupe iCloud Photos. Or my Dropbox photos?
digiKam can dedupe on actual similarity (so different resizes and formats of the same image). But it does take some time to calculate all the hashes.
- https://github.com/markfasheh/duperemove
- https://codeberg.org/jbruchon/jdupes / https://www.jdupes.com/
- https://github.com/adrianlopezroche/fdupes
- https://github.com/pauldreik/rdfind
The built in Photos duplicate feature is the best choice for most people: it’s not just generic file-level dedupe but smart enough to do things like take three versions of the same photo and pick the highest-quality one, which is great if you ever had something like a RAW/TIFF+JPEG workflow or mixed full res and thumbnails.
Or better yet. A single photo I take of the kids will be stored in my camera roll. I will then share it with family using three different messengers. Now I have 4 copies. Each of the individual (recoded) are stored inside those messengers and also backed up. This even happens when sharing the same photo multiple times in different chats with the same messenger.
Is there any way to do de duplication here? Or just outright delete all the derivatives?
Edit: disregard this, I was wrong and missed the comment deletion window.
HN will automatically redirect the submitter to a recent submission instead of allowing a new post... if it had a significant number of comments.
https://news.ycombinator.com/newsfaq.html