No file system attributes or metadata on records which also means no (xattrs/fattrs) being written or updated, no checks to see if it's a physical file or a pipe/symlink, no permission checks, no block size alignment mismatches, single open command.
Makes sense when you consider you're throwing out functionality and disregarding general purpose design.
If you use a fuse mapping to SQLite, mount that directory and access it, you'd probably be very similar performance (perhaps even slower) and storage use as you'd need to add additional columns in the table to track these attributes.
I have no doubt that you could create a custom tuned file system on a dedicated mount with attributes disabled, minimized file table and correct/optimized block size and get very near to this perf.
Let's not forget the simplicity of being able to use shell commands (like rsync) to browse and manipulate those files without running the application or an SQL client to debug.
Makes sense for developers to use SQLite for this use case though for an appliance-type application or for packaged static assets (this is already commonplace in game development - a cab file is essentially the same concept)
For example, Ceph uses RocksDB as their metadata DB (and it's recommend to put it) directly on a block device, with the WAL on yet another separate raw device
(smaller block size, 100,000 inode file table entries (tuned to the number of blobs), no journal, no checksumming, no extended file attributes, use smaller integer file offset IDs, 32 bit padded vs 64 bit)
Then mount it and run the same test.
You could go even further and tune fopen BUFSIZE to be no greater than 12,000 bytes. You can even create this mount on a file inside your existing mount... which is essentially akin to having an sqlite file without needing a client library to read/write to it.
Anyway - if the purpose is to speed up reads and save disk space on small blob files, there is little need to ditch the file system and it's many many upsides.
Let's also note the 4x speed increase on windows 10, once again underlining just how slow windows filesystem calls are, when compared to direct access, and other (kernel, filesystem) combinations.
NTFS is really horrible handling many small files. When compiling/watching node modules (easily 10-100k files), we've seen a 10x size difference internally (same hardware, just different OSes). At some point that meant a compile time difference of 10-30 sec vs 6-10 min. Not fun.
NTFS is perfectly fine at handling small files and performs on-par with other modern file systems.
The issue is Defender in sync mode/other AV/other file system filters.
DevDrive as noted by default uses an async scanning technique as well as ReFS. ReFS will suffer the exact same performance issues with Defender (or other AV/other file system filters) doing its thing when running in sync mode, which it does by default for ReFS-formatted drives in Windows Server.
> Except for CloseHandle(). These calls were often taking 1-10+ milliseconds to complete.
> While I didn't realize it at the time, the cause for this was/is Windows Defender. Windows Defender (and other anti-virus / scanning software) typically work on Windows by installing what's called a filesystem filter driver.
This doesn't take away from your point that _it is slow_, but the reasons are not due to the file system in use.
>The issue is Defender in sync mode/other AV/other file system filters.
I've had folders take a full minute to open on an SSD.
It got to the point where I went to open the folder, it started loading. I needed the file quickly, so I searched for it online, found it, and opened it before windows finished loading that folder for me.
After exempting that folder from Windows Defender the folder loads instantly. For the life of me I cannot understand why Defender blocks Explorer.
Probably because Explorer hosts shell hooks which can potentially execute arbitrary code.
Just one example: File icons or thumbnails can be dynamically generated by shell extensions based on the file contents. A maliciously crafted file could potentially exploit a vulnerability in such a shell extension.
> For the life of me I cannot understand why Defender blocks Explorer.
I suppose if you wanted to find out, you could use dtrace/ETW.
Explorer has other things going on, though, including other apps that hook into it (shell extensions, like Adobe Reader, TortiseGit/SVN, and so on) which can certainly cause performance issues.
Technically, they're because of the filesystem in use: it's providing the APIs these garbage-ware utilize... which causes the performance issues ( ◠ ‿ ・ ) —
File system filter drives apply to all (RW) file systems on Windows. It's not exclusive to NTFS or ReFS.
Windows has an extensible model. It's a different approach from most (all?) other OSes. It offers a different set of features.
Sure, AV could perhaps be done in a different manner that would be more effective/faster, I can't comment on that as I lack the insight required -- only MSFTies that work on kernel code could respond in any authoritative way.
I was of the understanding that these sync APIs are only available on Windows filesystems, so a fat32 formatted filesystem wouldn't suffer the same performance impact, which is why windows provides "virtual drives" for performance on their cloud instances that give you extra performance... Precisely by formatting them with a filesystem that doesn't support these sync/blocking Apis.
But I'm not particularly knowledgeable either on this topic, just a (forced) consumer of the operating system with the occasional reading on the side
I find this stuff super fun to learn about. My knowledge is far from complete but the more you know, the more pedantic you can be about things like 'virtual memory' and the misnomer of setting the min/max page file to 1.5*RAM (was used to ensure full memory dumps, not for performance).
If you want to know more, grab the book Windows Internals.
I don't know if async, AV, NTFS, or what, but some years ago we could demonstrate reliably that running a "big" maven compile on a lot of .java files was faster in Linux than Windows.
The kicker... Linux was running inside a VirtualBox VM inside the very same Windows host.
This could also be some variance in the `javac` command between OS's, granted.
The MFT describes resident (in-MFT) metadata about files, or non-resident (external to the MFT, but the MFT has a pointer) if the metadata does not fit within the MFT. The MFT itself is a database; Windows makes use of file system caching, be it in-memory or another caching mechanism, like on-device disk cache.
So why is it still pretty slow for file operations on DevDrive (no filters, async antivirus)? I found the performance improvement not that significant.
> NTFS is really horrible handling many small files.
To pile onto NTFS, it's performance is so notoriously bad that there are developer teams working on Windows projects that configure their build farm to do cross builds from Linux to Windows just to avoid the performance penalty.
> Turns out there was zero difference, or negligible (Windows won), between compilation times.
I think there was something seriously flawed in your test. If you Google for a minute, you find multiple posts on how moving the same builds to Linux led to performance improvements in the range of 40% drops in build times.
Some anecdotes even compare doing the same builds in Ubunto with NTFS to see double-digit gains.
NTFS is notoriously awful in scenarios involving reading/writing many small projects. This is the bottleneck in Windows builds. There is a myriad of benchmarks documenting this problem.
Nowadays there are plenty of cross-platform projects to serve as benchmarks. Checking this can be as easy as checking out a project, start a full rebuild, and check how long it takes.
To begin with, it takes rustup 3m30s to install on Windows. After rejigging the code again and again away from its naive implementation which works fine on Linux, to perform sympathetically towards NTFS, it takes the same rustup 14s to install. That's quite a performance gain! But it needed a lot of changes to rustup, and to Windows itself.
Yeah you can work around the filesystem perf problems on Windows to some extend. Unity builds are not just popular because it reduces duplicated compiler work but also because it reduces filesystem access. There is a reason why Windows games tend to pack their assets into archives.
> That's survivor bias. Fewer people blog about unsuccessful initiatives.
I don't understand what point you're trying to make. Should we give equal credit to unsuccessful initiatives when their failure is due to screwing up a critical part of the project?
I mean, the successful attempts document what is possible. What do you expect to report when you couldn't even manage to get something working?
Say there are N initiatives to move to cross-compilation. Of those, there are M that deliver the hoped-for benefits, where M <= N. But practically, M < N significantly, say, M < 0.9 N.
Assume bloggers blog mainly about content that contains a positive message. I'm asserting that people blog more readily about their success than their failures.
So when you look at the blog literature, your population is not N, it is M. You don't see the failures because they don't tell the tale.
Java/dotnet pipelines don't have Windows penalty. Both have dependencies as compiled modules zipped into large jar/dll. While even simple node project have 10s of thousands of files inside node_modules folder.
For the most part the two are pretty equivalent in performance, it's particularly constant blocking on small file IO where Windows falls way behind. Also a bit in the "tons of cores" and "tons of networking" categories but not as nearly as bad as small file IO.
You can't have windows laptop without running a ton of Antivirus and other crapware .e.g MS store, HP/Dell diagnostics and MS Teams. If you factor that in, Linux will win everytime.
Not the case unfortunately; you can. I'm not defending Windows here (bitter hatred of Microsoft here, since they started charging 2k for the compilation tools back in the 90s).
In your particular corporate environment that might be the case, but not in this case, I had free run of a fresh install and no offensive AV there, and detuned to remove the crap.
Other posters have said certain optimizations (which I'm not sure would help, it was pure compilation, no large files that I'm aware of). Just saying, always good to keep an open mind.
Were all your tests pure unit tests? Then sure—the OS isn’t getting in the way (and if your Linux distribution didn’t set the correct cpu scheduler, windows could be faster!)
If, however, your tests make any filesystem calls or fork a child process, there’s slim chances that Linux doesn’t absolutely trounce Windows.
To throw in a morsel of anecdata: same laptop with a dual boot runs the PhotoStructure core test suite (some 11,000 tests which have a pleasing melange of system and integration tests, not just unit tests) in 8m30s on Windows 11. Almost all the same tests complete in 3m30s on Linux.
Seems weird to base the argument on compiler performance alone. I'd think that developer tools will be a much bigger factor for the time difference but that depends on what you are used to.
Anecdotally though, git is unbearably slow under Windows and compiles make all filesystem operations lag while I have never seen such problems under Linux.
Yeah sorry, I left out a lot of details for a quick anecdote. It was Java compilation, so the different JDK's may've made a difference (almost certainly). But I figured the better file system handling etc etc would've gone in favour of Linux, versus Windows.
In any case it wasn't much of a fun codebase. But I think a good lesson was, always test it, always measure. Not casting shade on either OS.
Much of the cost is imposed by semantics and the security model; there's no silver bullet solution that can be imposed without turning it into a different filesystem with different semantics and/or on disk representation.
At one point they planned to replace it with a database filesystem, but that was too complicated and abandoned. That was probably the end of replacement work on NTFS. https://en.wikipedia.org/wiki/WinFS
WinFS was never designed as a replacement as it rode on top of NTFS. It was a fancy SQL Server database that exposed files as .NET Objects for API interaction.
Sans the API piece, think of it like storing blobs in SQL Server, just like SharePoint does.
I was lucky enough to play around with beta 1. Not much you could do with it, though.
> Why wouldn't they work on improving NTFS performance instead?
There are other old-school techniques which are far easier to implement and maintain, such as using RAM drives/partitions. Expensing 32GB of RAM is simpler than maintaining weird NTFS configurations.
Splitting your project into submodules/subpackages also helps amortize the impact of long build times. You can run multiple builds in parallel and then have a last build task to aggregate the all. Everyone can live with builds that take 5 minutes instead of 3.
Even Microsoft doesn't touch NTFS code (if they haven't lost it). All new file system features like new compression methods are implemented on layers above ntfs.
It's more a case of Linux programs not being written to work around the performance issues of Windows filesystems + layers above them. NTFS doesn't offer magical featurs that fix the performance.
That, and continuous scanning of all file IO corporations love so much can 2x-10x that time still. Dev drive is excluded from Defender scrutiny by default because of that.
I’ve benchmarked deleting files (around ~65,000 small node_modules sort of files) and it takes 40 seconds through Explorer, 20 seconds with rd, and roughly a second inside WSL2 (cloned to the VM’s ext4 virtual hard drive).
Their "windows dev drive" setup addresses this. I haven't tested it myself but I saw a couple of inexpertly executed tests showing significant performance gains. I honestly have no idea if my compile times are quicker.
But the primary improvement comes from async A/V, not the file system.
ReFS offers other improvements such as CoW and instant file initialization which can benefit developers. From this standpoint, it is the correct choice over NTFS.
With the right code, NTFS is not much slower than ext4, for example. Nearly 3% for tar.gz and more than 30% for a heavily multi-threaded use like Pack.
It allows you to do something else while waiting for the request to complete. Linux has historically bad hacks to enable this. Windows NT has had IOCP since it's inception, but not only for storage devices, as io_uring is limited to, but networking, TCP sockets, mail slots, named pipes, etc.
"Are you done yet?" vs "Get back to me when you're ready, gonna go do something else". Aka blocking i/o vs nonblocking i/o.
(I always wonder why Windows NT doesn't get more high performance implementations for networking et. al. with this feature; licensing? trade off in perf elsewhere? can't tweak kernel params to your heart's desire?)
io_uring implemented an additional feature of a ring buffer, but Microsoft has followed this feature; I think it was first introduced in a later build of Windows 10 and should be in 11 & 2022.
I did some research in a database research lab, and we had a lot of colleagues working on OS research. It was always interesting to compare the constraints and assumptions across the two systems. I remember one of the big differences was the scale of individual records we expected to be working with, which in turn affected how memory and disk was managed. Most relational databases are very much optimized for small individual records and eventual consistency, which allows them to cache a lot more in memory. On the other hand, performance often drops sharply with the size of your rows.
This is precisely why I'm considering appending to a sqlite DB in WAL2 mode instead of plain text log files. Almost no performance penalty for writes but huge advantages for reading/analysis. No more Grafana needed.
Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it to your pdp-11 running in your basement. Also something about the "Unix philosophy" :p
Seriously though, I think this is a great idea, and would be interested in how easy it is to write sqlite output adaptors for the various logging libraries out there.
Interesting, that's a requirement for an out of band tailing tool.
Maybe a user defined function bound to an INSERT trigger. But I believe that functions are also connection specific which is fine for the tail tool, but what happens when the user defined function connection goes away.
> Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it
I wish Splunk and friends would have an interface like that. Sure it does basic grep, and it is a much more powerful language, but sometimes you just needed some command line magic to find what you wanted.
I've learned so much about Splunk this month. I hate it. The UX is hot garbage. Why are settings scattered everywhere? Why does a simple word search not return any results? Why is there no obvious way to confirm data is being forwarded; like actual packets, not just what connections are configured.
I've been doing this for years. I keep SQLite log databases at varying grains depending on the solution. One for global logs, one per user/session/workflow, etc. I've also done things like in-memory SQLite trace databases that only get written to disk if an exception occurs.
SQLite doesn't look like a good fit for large logs - nothing can beat liner write at least on HDD and with plain text logs you will have it (though linear write of compressed data even better but rare software supports it out of the box). With SQLite I would expect more write requests for the same stream of logs (may be not much more). Reading analysis will be faster than using grep over plain text log only if you'll create indices which add write cost (and space overhead).
ClikcHouse works really well when you need to store and analyze large logs but compare to SQLite it would require to maintain a server(s). There is DuckDB which is embedded like SQLite and it could be a better than SQLite fit for logs but I have no experience with DuckDB.
Logs can be different but most of the time it is close to analytics than to transaction processing. Also the way to get a fast SELECT in SQLite (and most traditional relational databases) is to use indexes and indexes IMHO don't work well with logs - you may need to use any column in a WHERE condition and having all columns indexed requires a lot of space and reduces write speed. Additionally many columns in logs have low cardinality and an index doesn't significantly reduces query time compare to a full scan. So with logs one often have to use full-scan queries and full-scan is where columnar storage works much better than a row-oriented storage. Additionally with a columnar storage you are getting a better compression ratio which allows to save space (often important for logs) and increase read speed.
I recently had the idea to record every note coming out of my digital piano in real-time. That way if I come up with a good idea when noodling around I don’t have to hope I can remember it later.
I was debating what storage layer to use and decided to try SQLite because of its speed claims — essentially a single table where each row is a MIDI event from the piano (note on, note off, control pedal, velocity, timestamp). No transactions, just raw inserts on every possible event. It so far has worked beautifully: it’s performant AND I can do fun analysis later on, e.g. to see what keys I hit more than others or what my average note velocity is.
If you play ten note chords — one for each finger — in quick succession, that can rack up a lot of inserts in short time period (say, medium-worst case, 100Hz, for playing a chord like that five times per second, counting both “on” and “off” events).
It’s also worth taking into consideration damper pedal velocity changes. When you go from “off” (velocity 0) to fully “on” and depressed (velocity 127), a lot of intermediate values will get fired off at high frequency.
Ultimately though you are right; it’s not enough frequency of information to overload SQLite (or a file system), probably by several orders of magnitude.
A 1/100th of a second is still an eternity for a modern computer. It's easy to forget just how insanely fast computers are when user-facing software still lags all the time. You really do need to go out of your way to make performance an issue here (or stack too many abstractions).
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database
When something built on top of the filesystem is "faster" than the filesystem, it just means "when you use the filesystem in a less-than-optimal manner, it will be slower than an app that uses it in a sophisticated manner." An interesting point, but perhaps obvious...
To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.
As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.
SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.
For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:
CREATE TABLE files (path TEXT UNIQUE COLLATE NOCASE);
Then you can do this:
SELECT path FROM files WHERE path LIKE "./some/path/%";
This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).
The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.
EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.
i'm sure your suggestion exhibits creative step-one thinking, but i've described several reasons why it doesn't make sense to recreate a filesystem in sqlite, the combination of which should make it clear why doing so is naive
> To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so things kept breaking as my dependencies upgraded.
I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?
Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.
Do you have or know of a clear example of how to do this? I have to ask because I spent half of yesterday trying to make it work. The blob_open command wouldn't work until I set a default value on the blob column and then the blob_write command wouldn't work because you can't resize a blob. It was very weird but I'm pretty confident it's because I'm missing something stupid.
I’m pretty sure you can’t resize blobs using this API. The intended usage is to insert/update a row using bind_zeroblob and then update that in place (sans journaling) using the incremental API. It’s a major limitation especially for writing compressed data.
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
I wasn’t aware of PG Large Objects, though I do know about MSSQL Filestream, which iirc are actual files on disk and remotely transparently accessed via SMB - which in this context would be the moral equivalent of storing just the path in a SQLite db. The functionality you quote seems to be deprecated?
SQL server has image, text, varbinary(max) and varchar(max) none of those except varbinary(max) are filestream enabled, and varbinary is only file stream if that is setup and the column is specified file stream when created otherwise its in the DB like the others.
image and text with READTEXT and WRITETEXT are deprecated but still work fine, varbinary(max) and substring UPDATE.WRITE are the modern equivalent and use the same implementations underneath.
Filestream allows larger than 2 gigs, stores the blob data in the filesystem but otherwise is accessed like the other blobs along with some special capability like getting a SMB file pointer from the client for direct access. It is also backed up and replicated like normal, definitely not just like storing a path in the DB.
Filestream performs worse the in db for blobs under about 100kb I believe where it would be recommended to keep them in db for max perf.
I have used MSSQL blobs long before filestream existed and it works well except for the 2 gig limit and once the db gets large backup and log management get more unwieldy than if you just stored them outside the db but it does keep them transactional consistent, which filestream also does.
You can always use a chunked API to read data into a contiguous buffer. Just allocate a large enough contiguous block of memory, then copy chunk by chunk into that contiguous memory until you've read everything.
> so i found myself caching files to disk for export/import
Could use a named pipe.
I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.
e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path in sqlite and work with it in Python. E.g. `os.listdir(my_folder)`
This makes total sense and it is also "frowned upon" by people who take a too purist view of databases
(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)
> As my app matured, I found that I often wanted hierarchical folder-like functionality.
In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.
All fs/drive access is managed by the OS. No DB systems have raw access to sectors or direct raw access to files.
Having a database file on the disc, offers a "cluster" of successive blocks on the hard drive (if it's not fragmented), resulting in relatively short moving distances of the drive head to seek the necessary sectors. There will still be the same sectors occupied, even after vast insert/write/del operations. Absolutely no change of DB file's position on hard drive. It's not a problem with SSDs, though.
So, the following apply:
client -> DB -> OS -> Filesystem
I think, you already can see the DB part is an extra layer. So, if one wouldn't have this, it would be "faster" in terms of execution time. Always.
If it's slower, then you use the not-optimal settings for your use case/filesystem.
My father did this once. He took H2 and made it even more faster :) incredible fast on Windows in direct comparison of H2/h2-modificated with same data.
So having a DBMS is convenient and made in decisions to serve certain domains and their problems. Having it is convenient, but that doesn't mean it's the most optimized way of doing it.
> No DB systems have raw access to sectors or direct raw access to files.
Oracle can use raw disks without having a filesystem on them, though it's more common to use ASM (Automatic Storage Management) which is Oracle's alternative to raw or OS managed disks.
SQLite can be faster than FileSystem for small files. For big files, it can do more than 1 GB/s.
On Pack [1], I benchmarked these speeds, and you can go very fast. It can be even 2X faster than tar [2].
In my opinion, SQLite can be faster in big reads and writes too, but the team didn't optimise it as much (like loading the whole content into memory) as maybe it was not the main use of the project. My hope is that we will see even faster speeds in the future.
Let's assume that filesystems are fairly optimized pieces of software. Let's assume that the people building them heard of databases and at some point along the way considered things like the costs of open/close calls.
How much more performance could you get by bypassing the filesystem and writing directly to the block device? Of course, you'd need to effectively implement your own filesystem, but you'd be able to optimise it more for the specific workload.
Because SQLite not being an FS is apparently the reason why it’s fast:
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files.
Additionally, it may also be that accesses to a single file allows the OS to efficiently retrieve (and IIRC in the case of Windows, predict) the working set allowing the reduction of access times; which is not the case if you open multiple files.
Not so much working set, as you only have to check the access control once. Windows does a lot of stuff when you open a file, and creating a process is even worse.
Proxmox puts the VM configuration information in a SQLite database and exposes it through a FUSE file system. It even gets replicated across the cluster using their replication algorithm. It’s a bespoke implementation, but it’s a SQLite-backed filesystem.
I remember reading someone's comments about how instead of databases using their own data serialisation formats for persistence and then optimizing writes and read over that they should just utilize the FS directly and let all of the optimisations built by FS authors be taken advantage of.
I wish I could find that comment, because my explanation doesn't do it justice. Very interesting idea, someone's probably going to explain how it's already been tried in some old IBM database a long time ago and failed due to whatever reason.
I still think it should be tried with newer technologies though, sounds like a very interesting idea.
> instead of databases using their own data serialisation formats for persistence and then optimizing writes and read over that they should just utilize the FS directly and let all of the optimisations built by FS authors be taken advantage of.
The original article effectively argues the opposite: if your use case matches a database, then that will be way faster. Because the filesystem is both fully general, multi-process and multi-user, it's got to be pessimistic about its concurrency.
This is why e.g. games distribute their assets as massive blobs which are effectively filesystems - better, more predictable seek performance. Starting from the Doom WAD onwards.
For an example of databases that use the file system, both the mbox and maildir systems for email probably count?
ReiserFS was built on the premise that doing a filesystem right could get us to a point where the filesystem is the database for a lot of use cases.
It's now "somewhat possible" in that modern filesystem are overall mostly less broken about handling large number of small (or at least moderately small) files than they used to be.
But databases are still far more optimized for handling small pieces of data in the ways we want to handle data we put into databases, which typically also includes a need to index etc.
As far as I can remember MongoDB did not have any dedicated block caching mechanism in its earlier releases.
They basically mmap’ed the database file and argued that OS cache should do its job. Which makes sense but I guess it did not perform as well as any fune tuned caching mechanism.
Early MongoDB design choices are probably not great to call out for anything other than ignorance. mmap is a very naive view on how to easily work with data but it falls over pretty hard for any situation where ensuring your data doesn't get corrupted is important. MongoDB has come a long way, but its early technical decisions were not based on technical insight.
What do you expect the value proposition of something loosely described as a sqlitefs to be?
One of the main selling points of SQLite is that you can statically link it into a binary and no one needs to maintain anything between the OS and the client application. I'm talking about things like versioning.
What benefit would there be to replace a library with a full blown file system?
The idea of embedding SQLite in the kernel, reminds me of IBM OS/400 (the operating system of the IBM AS/400, nowadays known as IBM i). It contains a built-in relational database, although exactly how deeply integrated it is, is not entirely clear, due to lack of details of its inner workings.
Putting a relational database in the OS kernel is an interesting violation of standard layering. Obviously has the potential to unleash a lot of issues, but also could possibly enable novel features.
I don't think SQLite can run on a block device out of the box, it needs locking primitives and a second file for the journal or WAL plus a shared memory file in WAL mode
No file system attributes or metadata on records which also means no (xattrs/fattrs) being written or updated, no checks to see if it's a physical file or a pipe/symlink, no permission checks, no block size alignment mismatches, single open command.
Makes sense when you consider you're throwing out functionality and disregarding general purpose design.
If you use a fuse mapping to SQLite, mount that directory and access it, you'd probably be very similar performance (perhaps even slower) and storage use as you'd need to add additional columns in the table to track these attributes.
I have no doubt that you could create a custom tuned file system on a dedicated mount with attributes disabled, minimized file table and correct/optimized block size and get very near to this perf.
Let's not forget the simplicity of being able to use shell commands (like rsync) to browse and manipulate those files without running the application or an SQL client to debug.
Makes sense for developers to use SQLite for this use case though for an appliance-type application or for packaged static assets (this is already commonplace in game development - a cab file is essentially the same concept)
> tuned FS + dedicated mount
For example, Ceph uses RocksDB as their metadata DB (and it's recommend to put it) directly on a block device, with the WAL on yet another separate raw device
https://docs.ceph.com/en/latest/rados/configuration/bluestor...
More just this:
mke2fs -t ext4 -b 1024 -N 100000 -O ^has_journal,^uninit_bg,^ext_attr,^huge_file,^64bit [/dev/sdx]
(smaller block size, 100,000 inode file table entries (tuned to the number of blobs), no journal, no checksumming, no extended file attributes, use smaller integer file offset IDs, 32 bit padded vs 64 bit)
Then mount it and run the same test.
You could go even further and tune fopen BUFSIZE to be no greater than 12,000 bytes. You can even create this mount on a file inside your existing mount... which is essentially akin to having an sqlite file without needing a client library to read/write to it.
Anyway - if the purpose is to speed up reads and save disk space on small blob files, there is little need to ditch the file system and it's many many upsides.
> If you use a fuse mapping to SQLite, mount that directory and access it
Related ongoing discussion, if someone cares to test this:
https://news.ycombinator.com/item?id=41085856
Some nice write up on this topic, although not the resolution of all discussion. But rather hints and guesses to keep it alive :)
https://unixdigest.com/articles/battle-testing-php-fopen-sql...
Has been on HN lately..
That's a very rigorously written article.
Let's also note the 4x speed increase on windows 10, once again underlining just how slow windows filesystem calls are, when compared to direct access, and other (kernel, filesystem) combinations.
maybe the malware detection program adds to the performance as well
NTFS is really horrible handling many small files. When compiling/watching node modules (easily 10-100k files), we've seen a 10x size difference internally (same hardware, just different OSes). At some point that meant a compile time difference of 10-30 sec vs 6-10 min. Not fun.
NTFS is perfectly fine at handling small files and performs on-par with other modern file systems.
The issue is Defender in sync mode/other AV/other file system filters.
DevDrive as noted by default uses an async scanning technique as well as ReFS. ReFS will suffer the exact same performance issues with Defender (or other AV/other file system filters) doing its thing when running in sync mode, which it does by default for ReFS-formatted drives in Windows Server.
https://gregoryszorc.com/blog/2021/04/06/surprisingly-slow/
https://news.ycombinator.com/item?id=26737521
> Except for CloseHandle(). These calls were often taking 1-10+ milliseconds to complete.
> While I didn't realize it at the time, the cause for this was/is Windows Defender. Windows Defender (and other anti-virus / scanning software) typically work on Windows by installing what's called a filesystem filter driver.
This doesn't take away from your point that _it is slow_, but the reasons are not due to the file system in use.
>The issue is Defender in sync mode/other AV/other file system filters.
I've had folders take a full minute to open on an SSD.
It got to the point where I went to open the folder, it started loading. I needed the file quickly, so I searched for it online, found it, and opened it before windows finished loading that folder for me.
After exempting that folder from Windows Defender the folder loads instantly. For the life of me I cannot understand why Defender blocks Explorer.
Probably because Explorer hosts shell hooks which can potentially execute arbitrary code.
Just one example: File icons or thumbnails can be dynamically generated by shell extensions based on the file contents. A maliciously crafted file could potentially exploit a vulnerability in such a shell extension.
> For the life of me I cannot understand why Defender blocks Explorer.
I suppose if you wanted to find out, you could use dtrace/ETW.
Explorer has other things going on, though, including other apps that hook into it (shell extensions, like Adobe Reader, TortiseGit/SVN, and so on) which can certainly cause performance issues.
Technically, they're because of the filesystem in use: it's providing the APIs these garbage-ware utilize... which causes the performance issues ( ◠ ‿ ・ ) —
File system filter drives apply to all (RW) file systems on Windows. It's not exclusive to NTFS or ReFS.
Windows has an extensible model. It's a different approach from most (all?) other OSes. It offers a different set of features.
Sure, AV could perhaps be done in a different manner that would be more effective/faster, I can't comment on that as I lack the insight required -- only MSFTies that work on kernel code could respond in any authoritative way.
I was of the understanding that these sync APIs are only available on Windows filesystems, so a fat32 formatted filesystem wouldn't suffer the same performance impact, which is why windows provides "virtual drives" for performance on their cloud instances that give you extra performance... Precisely by formatting them with a filesystem that doesn't support these sync/blocking Apis.
But I'm not particularly knowledgeable either on this topic, just a (forced) consumer of the operating system with the occasional reading on the side
Azure Temporary Storage disks are stored on the host hypervisor machine. Non-temp storage is stored in Azure Storage. That's the only difference.
FAT[32] does implement minifilters.
https://learn.microsoft.com/en-us/windows-hardware/drivers/i...
https://www.osr.com/nt-insider/2019-issue1/the-state-of-wind...
Thanks for broadening my horizon, esp. that first link was extremely interesting!
I guess my mental model on this was just lacking/simply wrong.
I find this stuff super fun to learn about. My knowledge is far from complete but the more you know, the more pedantic you can be about things like 'virtual memory' and the misnomer of setting the min/max page file to 1.5*RAM (was used to ensure full memory dumps, not for performance).
If you want to know more, grab the book Windows Internals.
I don't know if async, AV, NTFS, or what, but some years ago we could demonstrate reliably that running a "big" maven compile on a lot of .java files was faster in Linux than Windows.
The kicker... Linux was running inside a VirtualBox VM inside the very same Windows host.
This could also be some variance in the `javac` command between OS's, granted.
Doesn't it (Windows VFS layer) also lack the equivalent of dentry cache, making all metadata lookups slow?
The MFT describes resident (in-MFT) metadata about files, or non-resident (external to the MFT, but the MFT has a pointer) if the metadata does not fit within the MFT. The MFT itself is a database; Windows makes use of file system caching, be it in-memory or another caching mechanism, like on-device disk cache.
So why is it still pretty slow for file operations on DevDrive (no filters, async antivirus)? I found the performance improvement not that significant.
Doesn't match what I found.
> NTFS is really horrible handling many small files.
To pile onto NTFS, it's performance is so notoriously bad that there are developer teams working on Windows projects that configure their build farm to do cross builds from Linux to Windows just to avoid the performance penalty.
As an anecdote, we had a really long build time for our pipeline (going back prob 15 years).
I argued for a Linux laptop, and the boss said, "OK, prove it. Here's two equivalent laptops, time it.".
Turns out there was zero difference, or negligible (Windows won), between compilation times. That has always annoyed me.
> Turns out there was zero difference, or negligible (Windows won), between compilation times.
I think there was something seriously flawed in your test. If you Google for a minute, you find multiple posts on how moving the same builds to Linux led to performance improvements in the range of 40% drops in build times.
Some anecdotes even compare doing the same builds in Ubunto with NTFS to see double-digit gains.
NTFS is notoriously awful in scenarios involving reading/writing many small projects. This is the bottleneck in Windows builds. There is a myriad of benchmarks documenting this problem.
Nowadays there are plenty of cross-platform projects to serve as benchmarks. Checking this can be as easy as checking out a project, start a full rebuild, and check how long it takes.
NTFS can be especially awful if you're used to Linux behaviour, port code over to Windows and expect the same performance.
Here's a talk on porting rustup to Windows: https://www.youtube.com/watch?v=qbKGw8MQ0i8
To begin with, it takes rustup 3m30s to install on Windows. After rejigging the code again and again away from its naive implementation which works fine on Linux, to perform sympathetically towards NTFS, it takes the same rustup 14s to install. That's quite a performance gain! But it needed a lot of changes to rustup, and to Windows itself.
Yeah you can work around the filesystem perf problems on Windows to some extend. Unity builds are not just popular because it reduces duplicated compiler work but also because it reduces filesystem access. There is a reason why Windows games tend to pack their assets into archives.
That's survivor bias. Fewer people blog about unsuccessful initiatives.
> That's survivor bias. Fewer people blog about unsuccessful initiatives.
I don't understand what point you're trying to make. Should we give equal credit to unsuccessful initiatives when their failure is due to screwing up a critical part of the project?
I mean, the successful attempts document what is possible. What do you expect to report when you couldn't even manage to get something working?
Say there are N initiatives to move to cross-compilation. Of those, there are M that deliver the hoped-for benefits, where M <= N. But practically, M < N significantly, say, M < 0.9 N.
Assume bloggers blog mainly about content that contains a positive message. I'm asserting that people blog more readily about their success than their failures.
So when you look at the blog literature, your population is not N, it is M. You don't see the failures because they don't tell the tale.
It seems possible that their pipeline just had large files, or something like that.
Java/dotnet pipelines don't have Windows penalty. Both have dependencies as compiled modules zipped into large jar/dll. While even simple node project have 10s of thousands of files inside node_modules folder.
For the most part the two are pretty equivalent in performance, it's particularly constant blocking on small file IO where Windows falls way behind. Also a bit in the "tons of cores" and "tons of networking" categories but not as nearly as bad as small file IO.
Well yeah, if you avoid the OS (syscalls) as much as possible then all operating systems will have about the same performance.
Love that your boss was open to the idea but also wanted data to back it up. (imagine a slow clap)
A single datapoint from an experiment done by only one person you call it data? What's not to love...
It was the only data point which mattered for the decision that needed to be made.
Yep, was a good experience. I'm sure with fine tuning it (Linux) could've been better, but I ate that humble pie.
You can't have windows laptop without running a ton of Antivirus and other crapware .e.g MS store, HP/Dell diagnostics and MS Teams. If you factor that in, Linux will win everytime.
If your company has windows and Linux but only guards against threats on windows, then you may have a different problem.
I believe you can lock a Linux box down tighter than a windows box, but then you're trading compile times for other costs.
Not the case unfortunately; you can. I'm not defending Windows here (bitter hatred of Microsoft here, since they started charging 2k for the compilation tools back in the 90s).
In your particular corporate environment that might be the case, but not in this case, I had free run of a fresh install and no offensive AV there, and detuned to remove the crap.
Other posters have said certain optimizations (which I'm not sure would help, it was pure compilation, no large files that I'm aware of). Just saying, always good to keep an open mind.
Were all your tests pure unit tests? Then sure—the OS isn’t getting in the way (and if your Linux distribution didn’t set the correct cpu scheduler, windows could be faster!)
If, however, your tests make any filesystem calls or fork a child process, there’s slim chances that Linux doesn’t absolutely trounce Windows.
To throw in a morsel of anecdata: same laptop with a dual boot runs the PhotoStructure core test suite (some 11,000 tests which have a pleasing melange of system and integration tests, not just unit tests) in 8m30s on Windows 11. Almost all the same tests complete in 3m30s on Linux.
Seems weird to base the argument on compiler performance alone. I'd think that developer tools will be a much bigger factor for the time difference but that depends on what you are used to.
Anecdotally though, git is unbearably slow under Windows and compiles make all filesystem operations lag while I have never seen such problems under Linux.
Yeah sorry, I left out a lot of details for a quick anecdote. It was Java compilation, so the different JDK's may've made a difference (almost certainly). But I figured the better file system handling etc etc would've gone in favour of Linux, versus Windows.
In any case it wasn't much of a fun codebase. But I think a good lesson was, always test it, always measure. Not casting shade on either OS.
Do you remember which file system you used on Linux?
Didn't they introduced ReFS ("dev drives") to alleviate this?
Why wouldn't they work on improving NTFS performance instead?
Much of the cost is imposed by semantics and the security model; there's no silver bullet solution that can be imposed without turning it into a different filesystem with different semantics and/or on disk representation.
At one point they planned to replace it with a database filesystem, but that was too complicated and abandoned. That was probably the end of replacement work on NTFS. https://en.wikipedia.org/wiki/WinFS
WinFS was never designed as a replacement as it rode on top of NTFS. It was a fancy SQL Server database that exposed files as .NET Objects for API interaction.
Sans the API piece, think of it like storing blobs in SQL Server, just like SharePoint does.
I was lucky enough to play around with beta 1. Not much you could do with it, though.
> Why wouldn't they work on improving NTFS performance instead?
There are other old-school techniques which are far easier to implement and maintain, such as using RAM drives/partitions. Expensing 32GB of RAM is simpler than maintaining weird NTFS configurations.
Splitting your project into submodules/subpackages also helps amortize the impact of long build times. You can run multiple builds in parallel and then have a last build task to aggregate the all. Everyone can live with builds that take 5 minutes instead of 3.
Even Microsoft doesn't touch NTFS code (if they haven't lost it). All new file system features like new compression methods are implemented on layers above ntfs.
Because likely they aren’t a. allowed to, and/or b. have no file system code experience or understanding.
That may be due to a combination of Malware detection + most unix programs not really written to take advantage of the features NTFS has to offer
This is a great talk on the topic https://youtu.be/qbKGw8MQ0i8?si=rh6WJ3DV0jDZLddn
It's more a case of Linux programs not being written to work around the performance issues of Windows filesystems + layers above them. NTFS doesn't offer magical featurs that fix the performance.
Must be why windows 11 added that dev drive feature
That, and continuous scanning of all file IO corporations love so much can 2x-10x that time still. Dev drive is excluded from Defender scrutiny by default because of that.
Not excluded, but the scanning in a dev drive is asynchronous rather than synchronous.
It’s somewhat more complex than “NTFS is slow”. Here’s a good explanation: https://github.com/Microsoft/WSL/issues/873#issuecomment-425...
I’ve benchmarked deleting files (around ~65,000 small node_modules sort of files) and it takes 40 seconds through Explorer, 20 seconds with rd, and roughly a second inside WSL2 (cloned to the VM’s ext4 virtual hard drive).
Why not use https://github.com/bobranten/Ext4Fsd ??
So we should put node_modules into SQLite database?
Honestly, maybe? Would be a fun project to look at at least.
Their "windows dev drive" setup addresses this. I haven't tested it myself but I saw a couple of inexpertly executed tests showing significant performance gains. I honestly have no idea if my compile times are quicker.
You can get benches from the horses mouth.
https://devblogs.microsoft.com/visualstudio/devdrive/
But the primary improvement comes from async A/V, not the file system.
ReFS offers other improvements such as CoW and instant file initialization which can benefit developers. From this standpoint, it is the correct choice over NTFS.
https://devblogs.microsoft.com/engineering-at-microsoft/dev-...
With the right code, NTFS is not much slower than ext4, for example. Nearly 3% for tar.gz and more than 30% for a heavily multi-threaded use like Pack.
https://forum.lazarus.freepascal.org/index.php/topic,66281.m...
https://news.ycombinator.com/item?id=18783525 has previous discussion on why windows filesystem is slow
But is it faster than accessing the filesystem with io_uring as well? I feel like this article should be updated
io_uring/IOCP won't change how long it takes to access a file.
So what's exactly the performance benefits of using io_uring to access files?
It allows you to do something else while waiting for the request to complete. Linux has historically bad hacks to enable this. Windows NT has had IOCP since it's inception, but not only for storage devices, as io_uring is limited to, but networking, TCP sockets, mail slots, named pipes, etc.
"Are you done yet?" vs "Get back to me when you're ready, gonna go do something else". Aka blocking i/o vs nonblocking i/o.
(I always wonder why Windows NT doesn't get more high performance implementations for networking et. al. with this feature; licensing? trade off in perf elsewhere? can't tweak kernel params to your heart's desire?)
io_uring implemented an additional feature of a ring buffer, but Microsoft has followed this feature; I think it was first introduced in a later build of Windows 10 and should be in 11 & 2022.
Microsoft has a graphical example of IOCP:
https://learn.microsoft.com/en-us/windows/win32/fileio/synch...
More info on a comparison of ring implementations in Windows/Linux:
https://windows-internals.com/ioring-vs-io_uring-a-compariso...
I did some research in a database research lab, and we had a lot of colleagues working on OS research. It was always interesting to compare the constraints and assumptions across the two systems. I remember one of the big differences was the scale of individual records we expected to be working with, which in turn affected how memory and disk was managed. Most relational databases are very much optimized for small individual records and eventual consistency, which allows them to cache a lot more in memory. On the other hand, performance often drops sharply with the size of your rows.
This is precisely why I'm considering appending to a sqlite DB in WAL2 mode instead of plain text log files. Almost no performance penalty for writes but huge advantages for reading/analysis. No more Grafana needed.
Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it to your pdp-11 running in your basement. Also something about the "Unix philosophy" :p
Seriously though, I think this is a great idea, and would be interested in how easy it is to write sqlite output adaptors for the various logging libraries out there.
> some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log
And they won’t be wrong.
Unix philosophy still applies
sqlite3 logs.db "select log from logs" | grep whatever
by the same argument the systemd binary logs also follow the unix philosophy
As per the example above, for all practical purposes it does.
Indeed they do.
How would you tail or watch a sqlite log? (on a pdp-11 if necessary :)
Write a program using
https://www.sqlite.org/c3ref/update_hook.html
On a PDP-11, run this program via telnet, rsh, or rexec.
If you're more ambitious, porting SQLite to 2.11BSD would be a fun exercise.
Update hook doesn't trigger if the write happened from a different process.
Interesting, that's a requirement for an out of band tailing tool.
Maybe a user defined function bound to an INSERT trigger. But I believe that functions are also connection specific which is fine for the tail tool, but what happens when the user defined function connection goes away.
Nope. Triggers and user defined functions don't help across processes either.
`watch -n 5 sqlite3 logs.db "select log from logs" | grep whatever`
Just what I wanted, additional overhead and a five-second delay.
> Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it
I wish Splunk and friends would have an interface like that. Sure it does basic grep, and it is a much more powerful language, but sometimes you just needed some command line magic to find what you wanted.
I've learned so much about Splunk this month. I hate it. The UX is hot garbage. Why are settings scattered everywhere? Why does a simple word search not return any results? Why is there no obvious way to confirm data is being forwarded; like actual packets, not just what connections are configured.
Is there a way to mount the sqlite tables as a filesystem?
I've been doing this for years. I keep SQLite log databases at varying grains depending on the solution. One for global logs, one per user/session/workflow, etc. I've also done things like in-memory SQLite trace databases that only get written to disk if an exception occurs.
I didn’t know what WAL/WAL2 mode was, so I looked it up. For anyone else interested: https://www.sqlite.org/wal.html
Some people have thought of this before. Here is one implementation
https://git.sr.ht/~martijnbraam/logbookd
Although I'm not sure it uses WAL2 mode, but that should be a trivial change.
There’s a tool called lnav that will parse logfiles into a temporary SQLite database and allows to analyse them using SQL features:
https://lnav.org/
It could probably work. For a peculiar application I even used sqlite to record key frame-only video. (There was a reason)
One could flip it around and store logs in a multimedia container, but then you won't have nice indices like with sqlite, just the one big time index
Are you using the wal2 branch of SQLite?
SQLite doesn't look like a good fit for large logs - nothing can beat liner write at least on HDD and with plain text logs you will have it (though linear write of compressed data even better but rare software supports it out of the box). With SQLite I would expect more write requests for the same stream of logs (may be not much more). Reading analysis will be faster than using grep over plain text log only if you'll create indices which add write cost (and space overhead).
ClikcHouse works really well when you need to store and analyze large logs but compare to SQLite it would require to maintain a server(s). There is DuckDB which is embedded like SQLite and it could be a better than SQLite fit for logs but I have no experience with DuckDB.
SQLite is meant for transactional data, DuckDB for analytical data.
I am not sure which one would be better for logs, I would need to play around with it. But i am not sure if SQLite wouldn’t be a better fit.
Logs can be different but most of the time it is close to analytics than to transaction processing. Also the way to get a fast SELECT in SQLite (and most traditional relational databases) is to use indexes and indexes IMHO don't work well with logs - you may need to use any column in a WHERE condition and having all columns indexed requires a lot of space and reduces write speed. Additionally many columns in logs have low cardinality and an index doesn't significantly reduces query time compare to a full scan. So with logs one often have to use full-scan queries and full-scan is where columnar storage works much better than a row-oriented storage. Additionally with a columnar storage you are getting a better compression ratio which allows to save space (often important for logs) and increase read speed.
I recently had the idea to record every note coming out of my digital piano in real-time. That way if I come up with a good idea when noodling around I don’t have to hope I can remember it later.
I was debating what storage layer to use and decided to try SQLite because of its speed claims — essentially a single table where each row is a MIDI event from the piano (note on, note off, control pedal, velocity, timestamp). No transactions, just raw inserts on every possible event. It so far has worked beautifully: it’s performant AND I can do fun analysis later on, e.g. to see what keys I hit more than others or what my average note velocity is.
I wouldn't expect performance of pretty much any plausible approach to matter much. The notes just aren't going to be coming very quickly.
If you play ten note chords — one for each finger — in quick succession, that can rack up a lot of inserts in short time period (say, medium-worst case, 100Hz, for playing a chord like that five times per second, counting both “on” and “off” events).
It’s also worth taking into consideration damper pedal velocity changes. When you go from “off” (velocity 0) to fully “on” and depressed (velocity 127), a lot of intermediate values will get fired off at high frequency.
Ultimately though you are right; it’s not enough frequency of information to overload SQLite (or a file system), probably by several orders of magnitude.
A 1/100th of a second is still an eternity for a modern computer. It's easy to forget just how insanely fast computers are when user-facing software still lags all the time. You really do need to go out of your way to make performance an issue here (or stack too many abstractions).
What counts as plausible nowadays may surprise you given your grasp on true performance. Observe these professionals. https://www.primevideotech.com/video-streaming/scaling-up-th...
Insertion times are not a problem but maybe after a few years of playing you will be managing millions of records
Would you expect SQLite performance to degrade per insert once a table is very large (in a way that, say, an append-only log file wouldn’t)?
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database
I wonder how io_uring compares.
RecordIO would be a good choice for this use case
https://mesos.apache.org/documentation/latest/recordio/
Yeah but imagine a Beowulf cluster^H^H io_uring of SQLites
It seems like nobody has suggested putting SQLite databases inside SQLite blobs yet... you could have SQLite all the way down.
https://news.ycombinator.com/item?id=41085856
Imagine the peformance we could reach. With enough SQLite layers anything would be possible.
When something built on top of the filesystem is "faster" than the filesystem, it just means "when you use the filesystem in a less-than-optimal manner, it will be slower than an app that uses it in a sophisticated manner." An interesting point, but perhaps obvious...
TLDR; don't do it.
I've used SQLite blob fields for storing files extensively.
Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html
To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.
As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.
SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.
For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:
Then you can do this: This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.
EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.
The postgres ltree extension does this beautifully. I'm not sure if similar things exist for sqlite or other rdbmses.
https://www.postgresql.org/docs/current/ltree.html
MSSQL has something similar but no, most other solutions involve recursive queries.
cool. i don't want to recreate a filesystem in my app logic
If that SELECT query is too much for you, I agree, SQLite is maybe not meant for you. Not a very solid argument against SQLite, though.
stop making personal attacks.
i'm sure your suggestion exhibits creative step-one thinking, but i've described several reasons why it doesn't make sense to recreate a filesystem in sqlite, the combination of which should make it clear why doing so is naive
calling someone's idea naive is also kind of a personal attack fwiw
Just pasting a % on your directories to list files is naive. At the very least you need to block/escape % characters in the path.
> To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so things kept breaking as my dependencies upgraded.
I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?
The tools associated with every file type you support have to support reading/writing a buffer/bytestream or whatever it is called.
For example, `pd.read_parquet` accepts "file-like objects" as its first argument: https://pandas.pydata.org/docs/reference/api/pandas.read_par...
However, this is not the case for fringe tools
Maybe you need to encode bytes as text for sql?
Writing and reading bytes in sqlite is very easy: https://www.sqlite.org/c3ref/blob_open.html
> I've used SQLite blob fields for storing files extensively. Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html
Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.
Do you have or know of a clear example of how to do this? I have to ask because I spent half of yesterday trying to make it work. The blob_open command wouldn't work until I set a default value on the blob column and then the blob_write command wouldn't work because you can't resize a blob. It was very weird but I'm pretty confident it's because I'm missing something stupid.
I’m pretty sure you can’t resize blobs using this API. The intended usage is to insert/update a row using bind_zeroblob and then update that in place (sans journaling) using the incremental API. It’s a major limitation especially for writing compressed data.
Great. I had not seen bind_zeroblob when reading yesterday. Maybe that’s what I needed to get it moving. Thanks.
Most RDBMS's have streaming blob apis:
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
I wasn’t aware of PG Large Objects, though I do know about MSSQL Filestream, which iirc are actual files on disk and remotely transparently accessed via SMB - which in this context would be the moral equivalent of storing just the path in a SQLite db. The functionality you quote seems to be deprecated?
SQL server has image, text, varbinary(max) and varchar(max) none of those except varbinary(max) are filestream enabled, and varbinary is only file stream if that is setup and the column is specified file stream when created otherwise its in the DB like the others.
image and text with READTEXT and WRITETEXT are deprecated but still work fine, varbinary(max) and substring UPDATE.WRITE are the modern equivalent and use the same implementations underneath.
Filestream allows larger than 2 gigs, stores the blob data in the filesystem but otherwise is accessed like the other blobs along with some special capability like getting a SMB file pointer from the client for direct access. It is also backed up and replicated like normal, definitely not just like storing a path in the DB.
Filestream performs worse the in db for blobs under about 100kb I believe where it would be recommended to keep them in db for max perf.
I have used MSSQL blobs long before filestream existed and it works well except for the 2 gig limit and once the db gets large backup and log management get more unwieldy than if you just stored them outside the db but it does keep them transactional consistent, which filestream also does.
I wish the API was compatible with iovec though. As that's what all the c standard lib APIs use for non-contiguous memory
That sounds nice for chunking, but what if you need contiguous memory? E.g. viewing an image or running an AI model
You can always use a chunked API to read data into a contiguous buffer. Just allocate a large enough contiguous block of memory, then copy chunk by chunk into that contiguous memory until you've read everything.
> so i found myself caching files to disk for export/import
Could use a named pipe.
I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.
e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:
which expands to something like As long as the tool doesn’t seek around the file. (caveats are numerous enough that without -f, psub uses regular files.)> I’m reminded of what I often do at the shell with psub in fish.
ksh and bash too have this as <(…) and >(…) under Process Substitution.
An example from ksh(1) man page:
bash (at least) has a built-in mechanism to do that
diff <(cmd1 | grep foo) <(cmd2 | grep foo)
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path in sqlite and work with it in Python. E.g. `os.listdir(my_folder)`
This makes total sense and it is also "frowned upon" by people who take a too purist view of databases
(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)
To make it more queryable, you can have different classes for dataset types with metadata like: file_format, num_files, sizes
> As my app matured, I found that I often wanted hierarchical folder-like functionality.
In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.
> As my app matured, I found that I often wanted hierarchical folder-like functionality
(1) Slim table "items"
- id / parent_id / kind (0/1 file folder) integer
- name text
- Maybe metadata.
(2) Separate table "content"
- id integer
- data blob
There you have file-system-like structure and fast access times (don't mix content in the first table)
Or, if you wish for deduplication or compression, add item_content (3)
Can you describe how you stored the paths in sqlite? I'm not entirely getting it.
just a string field that points to the file path
The idea to emulate hierarchical folder-like functionality ala filepaths is quite brilliant - I might try it out.
Storing Hierarchical Data in Relational Databases
https://medium.com/@rishabhdevmanu/from-trees-to-tables-stor...
I was looking at self hosted RSS readers recently. The application is single user. I don't expect it to be doing a lot of DB intensive stuff.
It surprised me that almost all required PostgreSQL, and most of those that didn't opted for something otherwise complex such as Mongo or MySQL.
SQLite, with no dependencies, would have simplified the process no end.
Last I checked, FreshRSS[1] can use a SQLite database.
[1] https://freshrss.org
Depends, depends.. but just of logic:
All fs/drive access is managed by the OS. No DB systems have raw access to sectors or direct raw access to files.
Having a database file on the disc, offers a "cluster" of successive blocks on the hard drive (if it's not fragmented), resulting in relatively short moving distances of the drive head to seek the necessary sectors. There will still be the same sectors occupied, even after vast insert/write/del operations. Absolutely no change of DB file's position on hard drive. It's not a problem with SSDs, though.
So, the following apply:
client -> DB -> OS -> Filesystem
I think, you already can see the DB part is an extra layer. So, if one wouldn't have this, it would be "faster" in terms of execution time. Always.
If it's slower, then you use the not-optimal settings for your use case/filesystem.
My father did this once. He took H2 and made it even more faster :) incredible fast on Windows in direct comparison of H2/h2-modificated with same data.
So having a DBMS is convenient and made in decisions to serve certain domains and their problems. Having it is convenient, but that doesn't mean it's the most optimized way of doing it.
> No DB systems have raw access to sectors or direct raw access to files.
Oracle can use raw disks without having a filesystem on them, though it's more common to use ASM (Automatic Storage Management) which is Oracle's alternative to raw or OS managed disks.
Oh nooo, I forgot, there are real databases existing in the wild, too ... :)
https://docs.oracle.com/en/database/oracle/oracle-database/1...
Indeed. Offers 10-12 percent performance writing direct on a block device. But, this block device is still attached to a running system :)
MySQL also supported this a million years ago, I'm not sure if it still does
MySQL can work off a block device without a file system.
numbers are from 2017, update would be cool
This discussion never gets old.
* https://news.ycombinator.com/item?id=14550060 7 years ago
* https://news.ycombinator.com/item?id=20729930 5 years ago
* https://news.ycombinator.com/item?id=27137834 3 years ago
* https://news.ycombinator.com/item?id=27897427 3 years ago
* https://news.ycombinator.com/item?id=34387407 2 years ago
SQLite can be faster than FileSystem for small files. For big files, it can do more than 1 GB/s. On Pack [1], I benchmarked these speeds, and you can go very fast. It can be even 2X faster than tar [2].
In my opinion, SQLite can be faster in big reads and writes too, but the team didn't optimise it as much (like loading the whole content into memory) as maybe it was not the main use of the project. My hope is that we will see even faster speeds in the future.
[1] https://pack.ac [2] https://forum.lazarus.freepascal.org/index.php/topic,66281.m...
Let's assume that filesystems are fairly optimized pieces of software. Let's assume that the people building them heard of databases and at some point along the way considered things like the costs of open/close calls.
What is SQLite not doing that filesystems are?
A discussion in the comments of the cost of opening a file on Windows: https://stackoverflow.com/questions/21309852/what-is-the-mem...
Access control is a big cost. Some AV systems (like everyone's favourite Crowdstrike) also hook every open/close.
How much more performance could you get by bypassing the filesystem and writing directly to the block device? Of course, you'd need to effectively implement your own filesystem, but you'd be able to optimise it more for the specific workload.
Oracle, some other databases did this back in a day in 00s by wrong with block devices directly.
I am not sure if this is done anymore, because the performance gains were modest compared to the hassle of a custom formatted partition.
i.e. opening and closing many files from disk is slower than opening and closing one file and using memory.
It's important. But understandable.
Why hasn’t someone made sqlitefs yet?
Because SQLite not being an FS is apparently the reason why it’s fast:
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files.
Additionally, it may also be that accesses to a single file allows the OS to efficiently retrieve (and IIRC in the case of Windows, predict) the working set allowing the reduction of access times; which is not the case if you open multiple files.
Not so much working set, as you only have to check the access control once. Windows does a lot of stuff when you open a file, and creating a process is even worse.
Here you go:)
https://github.com/narumatt/sqlitefs
And it seems quite interesting:
"sqlite-fs allows Linux and MacOS to mount a sqlite database file as a normal filesystem."
"If a database file name isn't specified, sqlite-fs use in-memory-db instead of a file. All data will be deleted when the filesystem is closed."
Proxmox puts the VM configuration information in a SQLite database and exposes it through a FUSE file system. It even gets replicated across the cluster using their replication algorithm. It’s a bespoke implementation, but it’s a SQLite-backed filesystem.
I remember reading someone's comments about how instead of databases using their own data serialisation formats for persistence and then optimizing writes and read over that they should just utilize the FS directly and let all of the optimisations built by FS authors be taken advantage of.
I wish I could find that comment, because my explanation doesn't do it justice. Very interesting idea, someone's probably going to explain how it's already been tried in some old IBM database a long time ago and failed due to whatever reason.
I still think it should be tried with newer technologies though, sounds like a very interesting idea.
> instead of databases using their own data serialisation formats for persistence and then optimizing writes and read over that they should just utilize the FS directly and let all of the optimisations built by FS authors be taken advantage of.
The original article effectively argues the opposite: if your use case matches a database, then that will be way faster. Because the filesystem is both fully general, multi-process and multi-user, it's got to be pessimistic about its concurrency.
This is why e.g. games distribute their assets as massive blobs which are effectively filesystems - better, more predictable seek performance. Starting from the Doom WAD onwards.
For an example of databases that use the file system, both the mbox and maildir systems for email probably count?
ReiserFS was built on the premise that doing a filesystem right could get us to a point where the filesystem is the database for a lot of use cases.
It's now "somewhat possible" in that modern filesystem are overall mostly less broken about handling large number of small (or at least moderately small) files than they used to be.
But databases are still far more optimized for handling small pieces of data in the ways we want to handle data we put into databases, which typically also includes a need to index etc.
As far as I can remember MongoDB did not have any dedicated block caching mechanism in its earlier releases.
They basically mmap’ed the database file and argued that OS cache should do its job. Which makes sense but I guess it did not perform as well as any fune tuned caching mechanism.
Early MongoDB design choices are probably not great to call out for anything other than ignorance. mmap is a very naive view on how to easily work with data but it falls over pretty hard for any situation where ensuring your data doesn't get corrupted is important. MongoDB has come a long way, but its early technical decisions were not based on technical insight.
> Why hasn’t someone made sqlitefs yet?
What do you expect the value proposition of something loosely described as a sqlitefs to be?
One of the main selling points of SQLite is that you can statically link it into a binary and no one needs to maintain anything between the OS and the client application. I'm talking about things like versioning.
What benefit would there be to replace a library with a full blown file system?
There's quite a number of sqlite FUSE implementations around, if you want to head in that direction.
No mention of how it performs when you need random access (seek) into files. Perhaps it underperforms the file system at that?
Probably because you wouldn't seek into a database row?
I guess querying by PK has some similarities but it is not as unstructured and random as a seek.
Also side effects such as sparse files do not mean much from a database interface standpoint.
SQLite does have low-level C APIs for that:
https://www.sqlite.org/c3ref/blob_open.html
https://www.sqlite.org/c3ref/blob_read.html
I've not seen performance numbers for those. Could make for an interesting micro-benchmark.
POSIX interfaces (open, read, write, seek, close, etc) are very challenging to implement in an efficient/reliable way.
Using SQLite let's you tailor your data access patterns in a much more rigorous way and side step the POSIX tarpit.
There are at least two for macOS. But they run into trouble nowadays because FUSE wants kernel extensions.
Would you put sqlite in the kernel? Or using something like FUSE?
It seems to me that all the extra indirection from using FUSE would lead to more than a 35% performance hit.
Statically linking an sqlite into a kernel module and providing it with filesystem access seems like something non trivial to me.
Could we expect performance gains from Sqlite being in the kernel?
The idea of embedding SQLite in the kernel, reminds me of IBM OS/400 (the operating system of the IBM AS/400, nowadays known as IBM i). It contains a built-in relational database, although exactly how deeply integrated it is, is not entirely clear, due to lack of details of its inner workings.
Putting a relational database in the OS kernel is an interesting violation of standard layering. Obviously has the potential to unleash a lot of issues, but also could possibly enable novel features.
just point it at your block device
I don't think SQLite can run on a block device out of the box, it needs locking primitives and a second file for the journal or WAL plus a shared memory file in WAL mode
> I don't think SQLite can run on a block device out of the box, it needs locking primitives and a second file for the journal or WAL
It ships with an example VFS which shows you how to do this: https://www.sqlite.org/src/doc/trunk/src/test_onefile.c
Slight OT: does this apply to SQLite on OpenBSD?
Because with OpenBSD introduction of pinning all syscalls to libc, doesn’t this block SQLite from making syscall direct.
https://news.ycombinator.com/item?id=38579913
Reminds me of this talk: https://www.destroyallsoftware.com/talks/the-birth-and-death...
Deleting a lot of rows from an sqlite database can be awfully slow, compared with deleting a file.
> Reading is about an order of magnitude faster than writing
not a native speaker, what does it mean?
Order of magnitude usually refers to a 10x difference. Two orders of magnitude would be 100x difference.
(Sometimes the phrase is casually used to just mean "a lot", but here I think they mean 10x).
An order of magnitude is for example from 10 to 100, or from 1,000 to 10,000, so typically an increase by 10x or similar.
(Very) approximately 10 times faster
Reading is about 10 times faster than writing
[dead]
* for certain operations.
Which is a bit d'oh, since being faster for some things is one of the main motivations for a database in the first place.