|
Lite³
A JSON-Compatible Zero-Copy Serialization Format
|
Data serialization is a cornstone of distributed systems and even single systems relying on inter-process communication. We go through the effort of making data communication possible in first place: connecting wires, laying fibers, setting up pipelines, IPC etc. so that we can receive our first bytes. But when the bytestream finally arrives, we face a dreadful problem: how do we actually find the bytes inside the stream that we care about? How should these bytes be sent and organized in the first place? This is the problem that data serialization solves. It is also referred to as 'the presentation layer' or layer 6 in the OSI model.
The Lite³ project was started in the spring of 2025 by Elias de Jong, out of dissatisfaction with existing serialization formats. All existing technologies boiled down to a choice:
Text formats are very flexible and easy to read, however they perform badly. This fact is hardly noticable due to Moore's law and the incredible hardware we have today. Yet, anyone building high-performance applications or operating digital infrastructure at scale will know just how much performance is being left on the table. We force computers to parse text for our own convenience, and the computer will happily comply. But if we allowed computers to work on binary data which their instruction sets are natively optimized to handle, this can unlock orders-of-magnitude performance improvements purely from software. Instead of pushing hardware people to double CPU speed every 2 years, maybe we should fix our software first?
We have binary formats to serve exactly this purpose. They are faster, more compact and more efficient. Computers love binary data. But why are so few people using them? To answer this question, all you have to do is pick a binary format and try to build an application yourself. You will find out very quickly that suddenly you have to deal with 'schema definitions', 'IDL', protoc compilers, 'schema evolution', 'backwards compatibility' and more. You need extra tooling, extra build steps, and the simplicity of JSON seems very far and very desirable. It's not just 'plug and play'. Of course any motivated individual can pick up these concepts and start developing binary data communications. After all, using tools is better than hard-coding byte offsets manually.
But most people have a very pragmatic attitude. Software should 'just work' as a tool to help them achieve a certain task. Mastering esoteric tools is not a line that is easily crossed. It takes time and commitment. In practice, performance is inversely proportional to convenience. This can best be described as a scale or gradient:
How far we walk down the scale depends on project scope, budget and overall requirements. This applies to programming languages, but in general to all the tools we use, even outside programming.
Notice that we can achieve performance in two ways:
We talk about performance requirements a lot, but not enough about convenience requirements. In reality, we may have much stricter requirements for convenience. It affects development speed, flexibility, maintainability and the available developer hiring pool. We all have a budget for convenience (or inconvenience), either explicit or implicit.
'Just use JSON' would be an easy answer. But what if half-decent performance is a requirement?
To avoid the curse of schema definitions and esoteric tooling, dissatisfaction motivated a last search to find anything with a better convenience profile. This search yielded a paper published in December 2024 as Lite²:
Tianyi Chen †, Xiaotong Guan †, Shi Shuai †, Cuiting Huang † and Michal Aibin †
(2024).
Lite²: A Schemaless Zero-Copy Serialization Format
https://doi.org/10.3390/computers13040089
This format is able to support full zero-copy reads + writes. That is, the ability to read and modify data without ever parsing or serializing it. Wow!
A data format is described where all entries are organized as key-value pairs inside of a B-tree. The paper authors got their idea from SQL databases. They noticed how it is possible to insert arbitrary keys, therefore being schemaless. Also, performing a key lookup can be done without loading the entire DB in memory, thus being zero-copy.
They theorized that it would be possible to remove all the overhead associated with a full-fledged database system, such that it would be lightweight enough to be used as a serialization format. They chose the name Lite² since their format is lighter than SQLite.
Despite showing benchmarks, the paper authors did not include code artifacts.
The Lite³ project is an independent interpretation and implementation, with no affiliations or connections to the authors of the original Lite² paper.
Lite³ adds several improvements to the format as described in the paper:
NULL, BOOL, I64, F64, BYTES, STRING, and nested types OBJECT & ARRAY. These types are implemented as tagged unions.yyjson library is used.The idea of taking the SQLite format and applying it to a serialization context is genuinely innovative. Yet, the original Lite² publication did not include any reference implementation or code artifacts, even despite showing benchmarks of their format being competitive to industry standard formats like JSON, Protobuf etc.
To the writer's awareness, Lite³ is the first project to publish an implementation of this idea. It is done so in the form of a publically consumable C-library, hosted on GitHub and licensed under the MIT-license.
Besides publication, the most significant contribution is the addition of JSON compatibility, which adds a great amount of usability and interoperability with existing systems.
Many computer science people associate B-trees exclusively with disks. For memory datastructures, they rather opt for hashmaps and 'classic' binary or red-black trees. But actually, memory B-trees outperform binary or red-black trees, at least in theory. The reason has to do with better spatial locality and higher 'fanout'. A binary or RB-tree requires log₂(N) lookups, where N is the total number of elements. For a B-tree, the fanout is not limited to 2, but rather dependent on configuration. This limits the number of lookups to between logₜ(N) and logₘ(N) where t is the minimum and m the maximum number of children per node. Lite³ uses a default configuration of t = 4, m= 8, configurable up to t = 32, m = 64.
The downside of B-trees is that more key comparisons are required per node. However with the current trend of memory latencies lagging behind compared to processing speeds, the amount of lookups required (tree height) is the dominant performance factor, and therefore the B-tree stands out in this regard.
Despite these advantages, B-trees are less common in general-purpose use because the algorithms are more complex to implement and hash maps often perform better for simple key-value workloads. Hashmaps provide 'unbeatable' theoretical lookup times of O(1) vs B-trees O(log n). However, this fact glosses over some of the downsides of hashmaps, namely:
O(N) scan is required. O(1) up to a theoretical maximum of O(N). This is why most hashmaps are configured to trigger a 'rehash event' when the load factor exceeds some threshold (e.g. 0.85). During such an event, all the keys inside the map must be redistributed, causing a very large spike in latency. Realtime application such as control systems and finance may have no tolerance for unpredictable latency. Unlike hash maps, B-trees are able to provide hard latency guarantees as the number of lookups required never exceeds logₜ(N).In Lite³, all tree nodes and data entries are serialized inside a single byte buffer. Tree hierarchies are established using pointers; all parent nodes have pointers to their child nodes:
Nodes consist of key hashes stored alongside pointers to key-value pairs (data entries), as well as child node pointers:
The key-value pairs are stored anywhere between nodes, or at the end of the buffer:
Over time, deleted entries cause the contiguous byte buffer to accumulate 'holes'. For a serialization format, this is undesirable as data should be compact to occupy less bandwidth. Lite³ will use a defragmentation system to lazily insert new values into the gaps whenever possible. This way, fragmentation is kept under control without excessive data copying or tree rebuilding (STILL WIP, NOT YET IMPLEMENTED).
NOTE: By default, deleted values are overwritten with NULL bytes (0x00). This is a safety feature since not doing so would leave 'deleted' entries intact inside the datastructure until they are overwritten by other values. If the user wishes to maximize performance at the cost of leaking deleted data,
LITE3_ZERO_MEM_DELETEDshould be disabled.
Also, Lite³ does not store raw pointers, but rather 32-bit indexes relative to the buffer pointer. The buffer pointer always points to the zero-index, and the root node is always stored at the zero-index.
Lite³ stores key hash digests inside the B-tree nodes instead of storing strings directly, which is what most databases do.
It turns out that there is an entire research field dedicated to optimizing the layout of strings inside B-tree nodes. A lot of the problem space is shared with that of memory allocators, which also deal with layouts of variable-sized elements in memory. This is not a straighforward task, and there are many possible solutions with different trade-offs.
Despite the interesting solutions and work being done in this field, we cannot afford any complex solutions. To support a fast in-memory format, we must be able to find a given key quickly; remember that key comparisons are inside the critical path for node traversal. There is no dominating disk I/O latency that we can 'hide behind'. If we allow ourselves to compare against fixed-sized hashes of the keys, then the entire problem of dealing with variable-sized elements disappears. Variable string comparisons (loops, branching etc.) become single compare instructions at the machine code level. This yields enormous benefits in runtime performance, such that memory latency becomes the dominating factor.
Despite achieving the stated goal of fast runtime performance, we have acquired a new problem: collisions.
For the uninitiated, 'collisions' refer to a situation where two or more distinct inputs map to the same output. This is also known as the pigeonhole principle and an inevitable result of trying to squeeze arbitrary-size data into a fixed-size representation.
To remain JSON-compatible, Lite³ only supports string keys. Collisions can manifest themselves when the DJB2(key) hash of two different keys yield the same 4-byte output. Collisions can only happen within a single 'key space'. Every object has its own key space. So a key from object A can only collide with another key in object A, never with keys from another object B, regardless of nesting hierarchies.
Inserting a colliding key will not corrupt your data or have side effects. It will simply fail to insert. Here are the exact lines from the source code that are checking for collisions:
When this code is reached, we already assume that the target key has been found. Therefore if the stored key does not match the target key, we have ourselves a collision.
The mathematics of the birthday problem tell us that the probability of a collision is equal to 1 - exp(-n*(n-1)/(2*2^x)), where n is number of entries and x the number of hash digest bits. So with a 32-bit digest, we will find that after exactly 77,164 object entries, the probability of a collision for the first time exceeds 50%:
1 - exp(-77164*(77164-1)/(2*2^32)) ≈ 0.500006797931
This table illustrates collision probabilities for different numbers of object entries:
| Number of entries | Collision probability |
|---|---|
| 10 | 0.0000000104773789644 |
| 100 | 0.00000115251102195 |
| 1,000 | 0.000116292144049 |
| 10,000 | 0.0115728810571 |
| 20,000 | 0.0454963391113 |
| 30,000 | 0.0994686465822 |
| 50,000 | 0.252508603738 |
| 75,000 | 0.480468302096 |
| 100,000 | 0.687809461339 |
| 150,000 | 0.927148144667 |
| 200,000 | 0.990501197966 |
| 1,000,000 | 1.000000000 |
Notice the sharp rise between 10,000 and 100,000 entries? The probability rises quickly from ~1.1% to ~69%.
However, this calculation assumes a perfectly random key distribution. Given our choice of encoding (UTF-8), hash function (DJB2) and language patterns in our key names, the actual distribution is far from random. It is possible to insert millions of keys and convert Gigabytes of JSON data to Lite³ without ever encountering a collision. Alternatively, some unlucky 2nd key might collide immediately with the first. It is all dependent on the data. Therefore, take this as a rule-of-thumb rather than gospel.
So what is the takeaway? Use proper error handling to catch collisions during development, not in production. Luckily, competent engineers writing production code are already doing this anyways 😉.
One of the nice properties of hash functions is that they are deterministic. So if you know your keys beforehand, you also know if there will be a collision. If you want to exhaustively search whether a key will collide, all you have to do is DJB2 hash all of your UTF-8 encoded keys that will ever be inside of an object together.
Alternatively, write a test to serialize data using all possible object keys inside your app. If the test succeeds, you never need to worry about collisions. Otherwise, slightly remodel your data and try again.
Also, try to avoid patterns like this:
where all of these keys represent some kind of UUID. Inside the Lite³ format, all of these strings are hashed down to 4-byte representations. Sooner or later, this will trigger a collision. Instead, store your IDs inside the objects themselves:
Arrays can never have collisions. This is because indexes are unique. There can only be stored exactly one element at every index.
"But my app needs fast UUID lookups at runtime"
Then you have several options:
LITE3_TYPE_BYTESOf course, the possibility of collisions is not ideal. Therefore, this shortfall will be addressed in a future release of Lite³. We can take inspiration from hash maps, which use 'probing strategies' to resolve key collisions.
When a single key maps to a single value, the mapping is said to be 1-way associative. Hardware people call this a 'direct mapped cache'. We can allow multiple logical keys to map to a single physical key. This turns the datastructure from 1-way associative to N-way associative.
The exact details and workings of this mechanism have not yet been determined. As not to delay public release of the Lite³ library, this feature has been left up for the future. For daring and competent individuals, feel free to issue a pull request with a solution.
The ability of Lite³ to store semi-structured data makes it a suitable file format for small- to medium-sized documents. Currently, there is a size limit of 4GiB due to the use of 32-bit indexes. Objects and arrays can also store a maximum of 67,108,864 (2^26) elements or entries.
However, Lite³ is not a replacement for a database. Beware that when reading/writing Lite³ structure to disk, you do not have access to database features like:
If you need a high performance on disk key-value store, choose something like LMDB.
If you need complex queries and flexibility, opt for more traditional databases.
That said, the options for storing Lite³ structures on disk are:
mmap()For read-only operations, the choice does not matter.
Write operations however, the programmer needs to take some care:
direct file I/O
This is dangerous, because if the system crashes while syncing data to disk, the structure can end up in a corrupt intermediate state. The most reliable atomic pattern is to write your new data to a temporary file, flush and sync it to disk, then use the rename() syscall to replace the target file. On Linux, rename() is atomic at the filesystem level.
mmap()
Write-maps do not have built-in atomicity. Even when calling fsync() manually, you rely on the OS to sync changes to disk. An unexpected crash could leave your data in a corrupted and permanently irrecoverable state. If you choose this option, you must manually implement a mechanism to ensure changes are atomic.
Using byte array/BLOB fields inside of a database engine
This is the most reliable option. The database of your choice likely already implements atomic transactions out-of-the-box. Lite³ structures should be stored as byte array, BLOB or equivalent type. Some database engines support returning a 'view' object into the byte buffer which allow partial access through fixed chunks (~ 4-64 kB), however in most cases it is best to simply load the entire structure in memory.
While it may seem wasteful to store metadata for an entire B-tree inside a serialized message, remember that this B-tree is an unusually compact variant, where a single node only takes up 96 bytes for a fanout up to 8 (vs binary tree fanout 2). Doing this completely eliminates an otherwise required parsing/serializing step and reduces it to a single memcpy(). In effect, a tradeoff is made in favour of CPU-cycles over bandwidth. To give an idea of size: in the simdjson benchmark using real Twitter API data, the same data stored in Lite³ format occupied ~40% extra size compared to JSON. This is a dataset containing much nesting and Unicode strings.
While the Lite³ library does not directly include compression features, it employs various various compacting techniques such as bitfields and varint key length prefixes. For extra size reduction, it is possible to manually layer compression over Lite³, though this will lose the zero-copy capability. On the flip side, compression is very easy to parallelize on a separate thread and thus in practice it may prove beneficial. Sliding-window type codecs are especially good at eliminating redundancy across the entire dataset, which would be difficult to achieve by the format itself. Furthermore, manual choice of codecs can be tuned for specific size/performance trade-offs depending on the project requirements. When deploying compression, it is desirable to compress on a stream-basis rather than per-message basis, as the streaming approach allows for eliminating of low entropy across an entire session as opposed to mere individual messages.
Before using compression, make sure both LITE3_ZERO_MEM_DELETED and LITE3_ZERO_MEM_EXTRA are defined inside the lite3.h header, alternatively using compiler flags -DLITE3_ZERO_MEM_DELETED and -DLITE3_ZERO_MEM_EXTRA
A size benchmark to assess compression ratios for different codecs is on the TODO list.
Some applications require canonical encoding, meaning there should exist only one valid encoding of any given message. This property is important for cryptographic applications such as:
A format can be canonical on two levels:
Being canonical by specification is the strongest form of canonicalization: it follows that any correct encoder implementation should produce canonical encodings. An example of this is the ASN.1 format, specifically the DER variant.
Protobuf serialization is not (and cannot be) canonical for reasons related to handling of unknown fields.
Some formats are not canonical by default, but offer alternative specifications. An example of this is the JSON RFC 8785 standard. It provides additional rules for the ordering of keys, exact floating-point representation and whitespace/Unicode normalization.
For Lite³ there does not exist an official standard (yet) and currently this repository is the only existing implementation. However it is possible to achieve canonical encodings using the following encoding rules:
Future formats building on this work can find improvement through a number of ways. Some ideas include:
Systems are rarely fully isolated. In the modern age of microservices, IoT, and the internet, distributed systems are the common case. As such, the convergence of the industry on a number of data formats is completely expected. Standardization enables interoperability and compatibility, such that new technologies can attach themselves to existing ones without barriers. The alternative result: starving in isolation while suffering from 'data impedance mismatch'. New ideas might be be created, but if incompatible, they do not get used and their fate is sealed.
This problem is still very visible in the context of full stack applications:
Translation between the two manifests itself in various ORM layers, ETL (Extract, Transform, Load) pipelines and other transformations that introduce friction, complexity, brittleness and overall fragility in the system as a whole. Schema evolution can break down at the ABI layer before it even reaches the application layer.
Fundamentally, these problems stem from a view of seeing systems in isolation. Every programming language and framework has its own opinionated view of how data should be modeled, represented etc. Sometimes, this view might be justified. Many times, it complicates extensibility and interoperability. In a distributed world, one needs to think about the data source and destination. Standardized data formats have contributed immensely to innovation, enabling universal APIs, storage formats and more.
Yet as of today, no data format contains all of the following properties:
The combination of all could unify so many applications that a new standard is able to arise.
The name Lite³ was chosen since it is lighter than Lite².
TIP: To type
³on your keyboard on Linux holdCtrl+Shift+Uthen type00B3. On Windows, useAlt+(numpad)0179.