Turning SQLite into a distributed database


mvSQLite is the open-source, SQLite-compatible distributed database. We decoupled SQLite's storage layer onto FoundationDB to offer massive yet "bottomless" scalability, point-in-time reads, and the strictest level of consistency.

On the surface

There are already so many nice "multi-machine" SQLite flavors: rqlite, dqlite, and Litestream. However I believe what mvSQLite offers is unique: it is not just replicated but really distributed, it offers not only read but also write scalability, it provides the strictest consistency, and it is a drop-in addition to SQLite.

Read and write scalability

The underlying technology of mvSQLite, FoundationDB, provides a solid foundation for scalability. FoundationDB is known to scale linearly at least until 500 cores. However, there is still something mvSQLite needs to do, especially on the write side.

Concurrent writes. Read scalability is trivial: you just add more machines to the cluster. Scaling writes without breaking serializability is the hard part though, and SQLite itself has not supported concurrent writes until very recently. Correctly implementing concurrent writes in a distributed system is even more difficult, and it looks like an interesting problem for us to solve!

In each transaction, SQLite reads a set of pages, and writes another set of pages. There is the following observation on serializability:

The decision to write to a page is correct under the causal assumption that none of the pages read in the same transaction have changed before the write.

Therefore, if the concurrent execution of two or more transactions do not break each other's causal assumptions, they don't conflict and can be committed together. From a transaction's perspective, it's always serializable and linearizable. This check is implemented in mvSQLite by checking at commit time the versions of the pages in the read set: if any version is larger than the read version of the current transaction, the commit is aborted and a conflict error is returned.

There are caveats though. For example, SQLite always modifies the “file change counter” field in the first page after each transaction. If this change is persisted, the entire concurrent-write mechanism would be useless - any two concurrent transactions on the same database will conflict. mvSQLite works around the problem by overwriting the change counter with zero in the persisted version, and maintaining a per-connection virtual change counter at the client side.

Automatic sharding. Other systems with synchronous replication like rqlite and dqlite are replicated systems, not fully distributed ones, because they replicate the entire database with a single consensus group. This works when you have 3 or 5 machines in the cluster and a single writer, but does not scale well further. In the mvSQLite case though, FoundationDB manages sharding automatically and safely, and scales well to hundreds of CPU cores.

Stricter than ACID

SQLite is an ACID database, and mvSQLite actually provides stricter guarantees than ACID!

External consistency. mvSQLite is externally consistent, which is stricter than strong consistency, and is the strictest consistency level for transaction-processing systems.

Synchronous replication. FoundationDB, does synchronous replication and ensures durability. When the SQLite API tells you a transaction is committed, it is. Application crashes, power failures, or losing replication_factor - 1 storage servers in the cluster do not cause data loss.

By the way, you still have the option to do global eventually-consistent low-latency reads when needed - FoundationDB supports asynchronous replication (DR) to different regions.

Point-in-time reads

As its name stands for, mvSQLite is a database with MVCC (multi-version concurrency control) support. It's not just the traditional type of MVCC where concurrent transactions see their own snapshots - you can actually open a snapshot of the database at any time in the past and read from it. So something like a DROP TABLE won't cause data loss any more, and the app can integrate with the mechanism to provide high-level versioning features.

Drop-in addition

Usually no changes to your SQLite-based application is needed to get it working with mvSQLite. Two integration methods are available:

  • LD_PRELOAD. If your application is dynamically linked to libsqlite3.so, this method is recommended because it is more efficient and has no OS-specific dependency. Set the LD_PRELOAD=libmvsqlite_preload.so environment variable and it should "just work".
  • Work-in-progress: FUSE (Filesystem in USErspace). This method is more universal and works with statically linked SQLite3 too.

However, if the application attempts to operate on the database in an unsupported way (e.g. enabling WAL by setting pragma journal_mode = wal, or reading/writing the database outside SQLite), mvSQLite will get confused and refuse to work. Fix the application if that's the case!

Inside the system

This section mostly contains answers to commonly-asked questions about mvSQLite.

Not just FoundationDB

From a first glance it might seem easy to glue a transactional block layer on top of FoundationDB: just map SQLite transactions onto FDB transactions, map SQLite VFS reads/writes to FDB ones, and you're done?

Well, you can do it, but the resulting system might not be very useful due to a few important limits of FoundationDB:

  • The longest transaction duration is 5 seconds.
  • The size (keys read and written + values written) of a transaction must not exceed 10MB.
  • All written data is kept in memory before commit.

Additionally, there is no way to read the version of the database from some time in the past, so we don't have the primitives to support point-in-time reads.

No local journals

SQLite was designed to run on real block devices. These devices do not support transactional writes across pages, so the application needs to come up with its own way to perform atomic commits. SQLite, along with most other databases, achieves that with journaling. SQLite supports two journal modes: the rollback journal and the WAL (write-ahead log).

mvSQLite does not rely on SQLite's own atomic commit mechanisms to ensure ACID properties. Instead, it treats the VFS “unlock” operation as the “transaction visibility fence” and performs atomic commit with primitives provided by FoundationDB. There's still a WAL all the way down on FDB's storage servers, but that's no longer something the SQLite layer needs to take care of.

Efficient data encoding

mvSQLite minimizes the overhead of storing every version of a page by using delta encoding. When a small portion of a page is changed, only the difference between the new and old versions are stored.

Maybe you feel that there are a lot of details uncovered in this post - and I think so too! I'm working on another blog post that introduces the deeper internals of mvSQLite. Stay tuned by subscribing to my RSS feed.

Worst case with beta software

Like with anything before 1.0, I need to put a big warning here: mvSQLite is beta software and contains bugs. It's normal to have concern about data reliability: although no data corruption bugs have yet been observed in the current version, what if under some special cases a rare code path is triggered and corrupts some random data? Let's think about the worst case: What if there is a bug in mvSQLite that somehow causes a clear_range('', '\\xff') to be executed and you lose all the data in the entire cluster?

That's where FoundationDB's backup system comes to be useful. You can do continuous, real-time backup to an external object storage service such as S3 and recover to any point in the past. No data is lost after a full range-clear. Even if all disks in the FDB cluster get broken at the same time, you only lose a few seconds of data. And with proper permissions on the S3 access keys, nobody can delete already backed up data even if they have the keys!

Discussions on Hacker News

© 2022 Heyang Zhou · 


Powered by Blueboat