Never write your own database or filesystem. It’s way harder than you expect.
— some wise programmer1
Databases and filesystems are hard to write in part because they have very strong requirements around never losing data (durability) even when unusual events happen such as unexpected process termination, hardware failure, out of disk space, or out of memory. I realized recently that Crystal - a program of mine for downloading websites - should strive to provide the same high reliability guarantees when saving websites to disk in its project format. And so began my ~3-week quest to shore up the ACID (atomicity, consistency, isolation, and durability) properties of how Crystal writes to its projects. Here are a few things I learned along the way.
A Crystal project is an ACID-compliant storage system for holding downloaded website data: multiple-reader, single-writer.
It contains a SQLite table of downloaded page revisions, each row of which corresponds to an on-disk file with the revision’s content.
Adding a revision requires creating both a database row and file at the same time.
It’s not safe to write a file directly to its final location because sudden process termination or disk disconnection could abort the write in the middle, leaving an incompletely written file in its final “committed” location.
Instead it’s safer to write a file first to a temporary location on the same filesystem as its committed location. After the write is complete, move it to its final place.
If sudden process termination stops the write, when Crystal reopens the project it will delete the temporary file. Additionally Crystal will check whether the highest numbered revision row in the database is missing its corresponding revision file, and delete the incomplete revision row if so.
Crystal projects using the Pack16 format have a more complex challenge: Each revision row now corresponds to an entry in a zip file that can be shared by up to 16 revisions. Thus writing one revision may require altering a zip file containing a different revision that is being concurrently read.
Similar to the write+rename strategy, we can use a write+replace strategy: Write a new revision to a new zip file and then replace the old zip file. Existing readers of the old zip file will not be disrupted; they will still be able to read from the copy of the zip file that they opened even though it is no longer visible in the filesystem.
On macOS and Linux the above strategy just works; it is possible to replace a file that is open for reading with a different file without doing anything special. However on Windows files are opened in an exclusive mode by default that prevents reading or writing the file while it is still open. You must explicitly open a file with FILE_SHARE_READ, FILE_SHARE_WRITE, or FILE_SHARE_DELETE modes on Windows if you want to allow concurrent readers, writers, or deleters. Crystal defines its own open_nonexclusive function to get this same non-exclusive open behavior across all operating systems.
However Windows presents another challenge: It does not allow you to replace a file that is opened for reading even if it is opened with all of FILE_SHARE_READ, FILE_SHARE_WRITE, and FILE_SHARE_DELETE, despite allowing you to delete it or rename it.
So on Windows, Crystal renames an old zip file to a transitional (but crash-durable) “moved-aside” location before renaming the new zip file to the original location of the old zip file. So the total process looks like:
FILE_SHARE_WRITEFILE_SHARE_DELETEThe above (non-atomic) process introduces several complexities:
In case B1, Crystal performs the repair at read time, when an entry from the zip file is first read after reopening the project.
When Crystal performs a read on a Pack16 project, looks for a zip file corresponding to a revision, and does not find it, it will then look for a moved-aside zip file that was left behind during scenario B1 above. If it finds such a moved-aside zip file it will repair it by moving it back to its normal location.
Note that this kind of repair is itself a kind of write which might happen during a high-level read operation. Since Crystal allows concurrent reads, the possibility of a repair happening during a read means that concurrent writes are now possible too… Yikes.
Concurrent writes to the same file are unsafe by default but can be made safe with mutual exclusion. Crystal maintains an in-memory mutex lock for each file it is in the middle of repairing to prevent concurrent writes from corrupting the project.
I expected this work to take 1–1.5 weeks. It took about three. Most of the overrun came from a cascade of complications I didn’t see coming:
FILE_SHARE_READ.FILE_SHARE_WRITE / FILE_SHARE_DELETE on the reader side.Each layer was reasonable on its own; the surprise was how much each one cost in aggregate.
Of the four ACID properties, atomicity and durability drove essentially all of this work. Consistency was largely a freebie given the architecture: the database lives in one place, only one process ever opens a project, and that process has at most one writer. Isolation was similarly cheap given the single-writer rule and the absence of in-place rewrites — with the wrinkle that read-repair turned a “read” into a {read + limited write}, requiring introducing per-revision locks.
It was a fun exercise in polishing one aspect of Crystal to a high standard. But I must say the caution around hand-rolling your own storage code is well earned: you may very well double or triple the time you expect it to take. 🙂
I can’t find a specific original person who said “Never write your own database or filesystem”. The closest discussion I was able to locate comes from Never Write Your Own Database. So perhaps this specific quote just comes from me now 🙂↩