Introduction

Data serialization is a cornstone of distributed systems and even single systems relying on inter-process communication. We go through the effort of making data communication possible in first place: connecting wires, laying fibers, setting up pipelines, IPC etc. so that we can receive our first bytes. But when the bytestream finally arrives, we face a dreadful problem: how do we actually find the bytes inside the stream that we care about? How should these bytes be sent and organized in the first place? This is the problem that data serialization solves. It is also referred to as 'the presentation layer' or layer 6 in the OSI model.

The Lite³ project was started in the spring of 2025 by Elias de Jong, out of dissatisfaction with existing serialization formats. All existing technologies boiled down to a choice:

Text formats (JSON): slow, but convenient
Binary formats (Protobuf, Flatbuffers, Cap'n Proto): fast, but painful to use

Text formats are very flexible and easy to read, however they perform badly. This fact is hardly noticable due to Moore's law and the incredible hardware we have today. Yet, anyone building high-performance applications or operating digital infrastructure at scale will know just how much performance is being left on the table. We force computers to parse text for our own convenience, and the computer will happily comply. But if we allowed computers to work on binary data which their instruction sets are natively optimized to handle, this can unlock orders-of-magnitude performance improvements purely from software. Instead of pushing hardware people to double CPU speed every 2 years, maybe we should fix our software first?

We have binary formats to serve exactly this purpose. They are faster, more compact and more efficient. Computers love binary data. But why are so few people using them? To answer this question, all you have to do is pick a binary format and try to build an application yourself. You will find out very quickly that suddenly you have to deal with 'schema definitions', 'IDL', protoc compilers, 'schema evolution', 'backwards compatibility' and more. You need extra tooling, extra build steps, and the simplicity of JSON seems very far and very desirable. It's not just 'plug and play'. Of course any motivated individual can pick up these concepts and start developing binary data communications. After all, using tools is better than hard-coding byte offsets manually.

But most people have a very pragmatic attitude. Software should 'just work' as a tool to help them achieve a certain task. Mastering esoteric tools is not a line that is easily crossed. It takes time and commitment. In practice, performance is inversely proportional to convenience. This can best be described as a scale or gradient:

Low end: slap something together in Python/JavaScript that 'works'
High end: parallelize workloads and/or write in a low-level language like C, C++ or Rust
Extreme: write in Verilog + FPGA
Ultra-Extreme: fab custom ASICs

How far we walk down the scale depends on project scope, budget and overall requirements. This applies to programming languages, but in general to all the tools we use, even outside programming.

Notice that we can achieve performance in two ways:

we can make existing tools faster (i.e. writing faster JSON parsers)
we can make faster tools more accessible, such that they are convenient enough to replace slower tools (i.e. making binary formats easier to use)

We talk about performance requirements a lot, but not enough about convenience requirements. In reality, we may have much stricter requirements for convenience. It affects development speed, flexibility, maintainability and the available developer hiring pool. We all have a budget for convenience (or inconvenience), either explicit or implicit.

Original idea

'Just use JSON' would be an easy answer. But what if half-decent performance is a requirement?

To avoid the curse of schema definitions and esoteric tooling, dissatisfaction motivated a last search to find anything with a better convenience profile. This search yielded a paper published in December 2024 as Lite²:

‍Tianyi Chen †, Xiaotong Guan †, Shi Shuai †, Cuiting Huang † and Michal Aibin †

(2024).
Lite²: A Schemaless Zero-Copy Serialization Format
https://doi.org/10.3390/computers13040089

This format is able to support full zero-copy reads + writes. That is, the ability to read and modify data without ever parsing or serializing it. Wow!

A data format is described where all entries are organized as key-value pairs inside of a B-tree. The paper authors got their idea from SQL databases. They noticed how it is possible to insert arbitrary keys, therefore being schemaless. Also, performing a key lookup can be done without loading the entire DB in memory, thus being zero-copy.
They theorized that it would be possible to remove all the overhead associated with a full-fledged database system, such that it would be lightweight enough to be used as a serialization format. They chose the name Lite² since their format is lighter than SQLite.
Despite showing benchmarks, the paper authors did not include code artifacts.

Improvements

The Lite³ project is an independent interpretation and implementation, with no affiliations or connections to the authors of the original Lite² paper.

Lite³ adds several improvements to the format as described in the paper:

The paper describes a B-tree algorithm as commonly seen in literature. This kind of algorithm travels down from the root node to find a leaf, then if a node split occurs, it cascades updates all the way up to the root node.
The Lite³ implementation uses a more efficient 'one-pass' algorithm where all node updates happen during traversal, not after. This eliminates backtracking and cuts the number of traveled nodes in half.
The paper uses tree node values of 3 bytes. While this allows you to store slightly more values inside of a node, this is an awkward size for a computer. Lite³ uses 4-byte values, matching native instruction sizes of many architectures. This also raises the maximum message size from 16MiB to 4GiB and reduces the chance of hash collisions by a factor of 16.
Fields have been downsized and outright removed, such that the resulting message is more compact.
For example:
- The entire 'header' section has been removed, saving 15 bytes on every message.
- The one-pass algorithm has removed the need for parent pointers ('PARENT_OFFSET' & 'PARENT_IDX') saving 4 bytes per node.
- 'CTRL' and 'LEN' fields in entries have been removed, saving 4 bytes per entry.
- The 4-byte key size tag ('KEY_LEN') is made variable length, such that it takes up 1 byte for most small key sizes (< 64 bytes).
- The 3-byte 'VAL_LEN' tag is replaced by a 1-byte type tag.
An entirely new type system has been added, supporting NULL, BOOL, I64, F64, BYTES, STRING, and nested types OBJECT & ARRAY. These types are implemented as tagged unions.
The type system has enabled the implementation of full JSON interoperability, such that it is now possible to convert any JSON message to Lite³ format and vice-versa. For this, the yyjson library is used.

The idea of taking the SQLite format and applying it to a serialization context is genuinely innovative. Yet, the original Lite² publication did not include any reference implementation or code artifacts, even despite showing benchmarks of their format being competitive to industry standard formats like JSON, Protobuf etc.

To the writer's awareness, Lite³ is the first project to publish an implementation of this idea. It is done so in the form of a publically consumable C-library, hosted on GitHub and licensed under the MIT-license.

Besides publication, the most significant contribution is the addition of JSON compatibility, which adds a great amount of usability and interoperability with existing systems.

B-tree Literature

Many computer science people associate B-trees exclusively with disks. For memory datastructures, they rather opt for hashmaps and 'classic' binary or red-black trees. But actually, memory B-trees outperform binary or red-black trees, at least in theory. The reason has to do with better spatial locality and higher 'fanout'. A binary or RB-tree requires log₂(N) lookups, where N is the total number of elements. For a B-tree, the fanout is not limited to 2, but rather dependent on configuration. This limits the number of lookups to between logₜ(N) and logₘ(N) where t is the minimum and m the maximum number of children per node. Lite³ uses a default configuration of t = 4, m= 8, configurable up to t = 32, m = 64.

The downside of B-trees is that more key comparisons are required per node. However with the current trend of memory latencies lagging behind compared to processing speeds, the amount of lookups required (tree height) is the dominant performance factor, and therefore the B-tree stands out in this regard.

Despite these advantages, B-trees are less common in general-purpose use because the algorithms are more complex to implement and hash maps often perform better for simple key-value workloads. Hashmaps provide 'unbeatable' theoretical lookup times of O(1) vs B-trees O(log n). However, this fact glosses over some of the downsides of hashmaps, namely:

Hashmaps do not (efficiently) support range queries. Since the keys are stored in pseudorandom order, the access pattern to support a range query is also highly scattered and random, potentially triggering many cache misses. Alternatively, a full O(N) scan is required.
B-trees on the other hand store keys in sorted order. Any ordered access pattern consists of finding the start of the range, then traversing nodes in-order. This form of access is cache-oblivious and sequential, potentially providing supralinear improvements over hashmaps.
Hashmaps have unpredictable latency. A hashmap can never provide hard latency guarantees. As the load factor of a hash map approaches 1, the performance starts to degrade and latency increases from O(1) up to a theoretical maximum of O(N). This is why most hashmaps are configured to trigger a 'rehash event' when the load factor exceeds some threshold (e.g. 0.85). During such an event, all the keys inside the map must be redistributed, causing a very large spike in latency. Realtime application such as control systems and finance may have no tolerance for unpredictable latency. Unlike hash maps, B-trees are able to provide hard latency guarantees as the number of lookups required never exceeds logₜ(N).
Hashmaps have extra memory overhead. For example, a map with load factor 0.75 will leave 25% of allocated memory unused. This may be relevant for resource-constrained systems or very large memory key-value stores (e.g. Redis).

Byte Layout

In Lite³, all tree nodes and data entries are serialized inside a single byte buffer. Tree hierarchies are established using pointers; all parent nodes have pointers to their child nodes:

|----------------|-------|----------------|----------------|-------------|----------------|-------------
|      ROOT      |#DATA##|    CHILD #1    |    CHILD #2    |#####DATA####|    CHILD #3    |####DATA####...
|----------------|-------|----------------|----------------|-------------|----------------|-------------
   | | |___________________|                |                              |
   | |______________________________________|                              |
   |_______________________________________________________________________|

Nodes consist of key hashes stored alongside pointers to key-value pairs (data entries), as well as child node pointers:

 NODE
|--------------------------------------------------------------------------------------------|
|      h(key1) h(key2) h(keyN)      *kv1 *kv2 *keyN      *child1 *child2 *child3 *childN     |
|--------------------------------------------------------------------------------------------|
       |                            |                    |
      hashed keys                  kv-pair pointers     child node pointers
      used for key comparison      pointing to data     used for tree traversal
                                   entries

The key-value pairs are stored anywhere between nodes, or at the end of the buffer:

 KEY-VALUE PAIR              
|--------|-----------|-------------------------|
|  type  |    key    |          value          |
|--------|-----------|-------------------------|
   |          |
   1 byte     string
   type tag   key

On insertion: The new nodes or key-value pairs are appended at the end of the buffer.
On deletion: The key-value pairs are deleted in place. TODO: freed blocks are added to the GC (garbage collection) index.

Over time, deleted entries cause the contiguous byte buffer to accumulate 'holes'. For a serialization format, this is undesirable as data should be compact to occupy less bandwidth. Lite³ will use a defragmentation system to lazily insert new values into the gaps whenever possible. This way, fragmentation is kept under control without excessive data copying or tree rebuilding (STILL WIP, NOT YET IMPLEMENTED).

‍ NOTE: By default, deleted values are overwritten with NULL bytes (0x00). This is a safety feature since not doing so would leave 'deleted' entries intact inside the datastructure until they are overwritten by other values. If the user wishes to maximize performance at the cost of leaking deleted data, LITE3_ZERO_MEM_DELETED should be disabled.

Also, Lite³ does not store raw pointers, but rather 32-bit indexes relative to the buffer pointer. The buffer pointer always points to the zero-index, and the root node is always stored at the zero-index.

Hash Collisions

Problem statement

Lite³ stores key hash digests inside the B-tree nodes instead of storing strings directly, which is what most databases do.

It turns out that there is an entire research field dedicated to optimizing the layout of strings inside B-tree nodes. A lot of the problem space is shared with that of memory allocators, which also deal with layouts of variable-sized elements in memory. This is not a straighforward task, and there are many possible solutions with different trade-offs.

Despite the interesting solutions and work being done in this field, we cannot afford any complex solutions. To support a fast in-memory format, we must be able to find a given key quickly; remember that key comparisons are inside the critical path for node traversal. There is no dominating disk I/O latency that we can 'hide behind'. If we allow ourselves to compare against fixed-sized hashes of the keys, then the entire problem of dealing with variable-sized elements disappears. Variable string comparisons (loops, branching etc.) become single compare instructions at the machine code level. This yields enormous benefits in runtime performance, such that memory latency becomes the dominating factor.

Despite achieving the stated goal of fast runtime performance, we have acquired a new problem: collisions.

For the uninitiated, 'collisions' refer to a situation where two or more distinct inputs map to the same output. This is also known as the pigeonhole principle and an inevitable result of trying to squeeze arbitrary-size data into a fixed-size representation.

To remain JSON-compatible, Lite³ only supports string keys. Collisions can manifest themselves when the DJB2(key) hash of two different keys yield the same 4-byte output. Collisions can only happen within a single 'key space'. Every object has its own key space. So a key from object A can only collide with another key in object A, never with keys from another object B, regardless of nesting hierarchies.

Inserting a colliding key will not corrupt your data or have side effects. It will simply fail to insert. Here are the exact lines from the source code that are checking for collisions:

int cmp = memcmp(
        (const char *)(buf + *inout_ofs),
        key,
        (key_size < _key_size) ? key_size : _key_size
);
if (LITE3_UNLIKELY(cmp != 0)) {
        LITE3_PRINT_ERROR("HASH COLLISION\n");
        errno = EINVAL;
        return -1;
}

When this code is reached, we already assume that the target key has been found. Therefore if the stored key does not match the target key, we have ourselves a collision.

Collision Probability

The mathematics of the birthday problem tell us that the probability of a collision is equal to 1 - exp(-n*(n-1)/(2*2^x)), where n is number of entries and x the number of hash digest bits. So with a 32-bit digest, we will find that after exactly 77,164 object entries, the probability of a collision for the first time exceeds 50%:

    1 - exp(-77164*(77164-1)/(2*2^32)) ≈ 0.500006797931

This table illustrates collision probabilities for different numbers of object entries:

Number of entries	Collision probability
10	0.0000000104773789644
100	0.00000115251102195
1,000	0.000116292144049
10,000	0.0115728810571
20,000	0.0454963391113
30,000	0.0994686465822
50,000	0.252508603738
75,000	0.480468302096
100,000	0.687809461339
150,000	0.927148144667
200,000	0.990501197966
1,000,000	1.000000000

Notice the sharp rise between 10,000 and 100,000 entries? The probability rises quickly from ~1.1% to ~69%.

However, this calculation assumes a perfectly random key distribution. Given our choice of encoding (UTF-8), hash function (DJB2) and language patterns in our key names, the actual distribution is far from random. It is possible to insert millions of keys and convert Gigabytes of JSON data to Lite³ without ever encountering a collision. Alternatively, some unlucky 2nd key might collide immediately with the first. It is all dependent on the data. Therefore, take this as a rule-of-thumb rather than gospel.

Takeaways

So what is the takeaway? Use proper error handling to catch collisions during development, not in production. Luckily, competent engineers writing production code are already doing this anyways 😉.

One of the nice properties of hash functions is that they are deterministic. So if you know your keys beforehand, you also know if there will be a collision. If you want to exhaustively search whether a key will collide, all you have to do is DJB2 hash all of your UTF-8 encoded keys that will ever be inside of an object together.

Alternatively, write a test to serialize data using all possible object keys inside your app. If the test succeeds, you never need to worry about collisions. Otherwise, slightly remodel your data and try again.

Also, try to avoid patterns like this:

{
    "Z28gdG91Y2ggICAg": { ... },
    "c29tZSBncmFzcyAg": { ... },
    "eW91IG5lcmQgICAg": { ... },
    ...
}

where all of these keys represent some kind of UUID. Inside the Lite³ format, all of these strings are hashed down to 4-byte representations. Sooner or later, this will trigger a collision. Instead, store your IDs inside the objects themselves:

[
  { "id": "c2VyaW91c2x5ICAg", ... },
  { "id": "dGhlIHN1biBpcyAg", ... },
  { "id": "dml0YW1pbiBEICAg", ... },
  ...
]

Arrays can never have collisions. This is because indexes are unique. There can only be stored exactly one element at every index.

‍"But my app needs fast UUID lookups at runtime"

Then you have several options:

create a new object (with new key space) everytime a key collides
scan all IDs once to build a separate 'lookup index'
store a custom binary index structure as LITE3_TYPE_BYTES

Future Fix

Of course, the possibility of collisions is not ideal. Therefore, this shortfall will be addressed in a future release of Lite³. We can take inspiration from hash maps, which use 'probing strategies' to resolve key collisions.

When a single key maps to a single value, the mapping is said to be 1-way associative. Hardware people call this a 'direct mapped cache'. We can allow multiple logical keys to map to a single physical key. This turns the datastructure from 1-way associative to N-way associative.

The exact details and workings of this mechanism have not yet been determined. As not to delay public release of the Lite³ library, this feature has been left up for the future. For daring and competent individuals, feel free to issue a pull request with a solution.

On-Disk Storage

The ability of Lite³ to store semi-structured data makes it a suitable file format for small- to medium-sized documents. Currently, there is a size limit of 4GiB due to the use of 32-bit indexes. Objects and arrays can also store a maximum of 67,108,864 (2^26) elements or entries.

However, Lite³ is not a replacement for a database. Beware that when reading/writing Lite³ structure to disk, you do not have access to database features like:

complex queries
fast disk seeks
multi-versioning (MVCC)
atomicity guarantees
crash recovery

If you need a high performance on disk key-value store, choose something like LMDB.
If you need complex queries and flexibility, opt for more traditional databases.

That said, the options for storing Lite³ structures on disk are:

direct file I/O
mmap()
Using byte array/BLOB fields inside of a database engine

For read-only operations, the choice does not matter.

Write operations however, the programmer needs to take some care:

‍direct file I/O

This is dangerous, because if the system crashes while syncing data to disk, the structure can end up in a corrupt intermediate state. The most reliable atomic pattern is to write your new data to a temporary file, flush and sync it to disk, then use the rename() syscall to replace the target file. On Linux, rename() is atomic at the filesystem level.

‍mmap()

Write-maps do not have built-in atomicity. Even when calling fsync() manually, you rely on the OS to sync changes to disk. An unexpected crash could leave your data in a corrupted and permanently irrecoverable state. If you choose this option, you must manually implement a mechanism to ensure changes are atomic.

‍Using byte array/BLOB fields inside of a database engine

This is the most reliable option. The database of your choice likely already implements atomic transactions out-of-the-box. Lite³ structures should be stored as byte array, BLOB or equivalent type. Some database engines support returning a 'view' object into the byte buffer which allow partial access through fixed chunks (~ 4-64 kB), however in most cases it is best to simply load the entire structure in memory.

Compression

While it may seem wasteful to store metadata for an entire B-tree inside a serialized message, remember that this B-tree is an unusually compact variant, where a single node only takes up 96 bytes for a fanout up to 8 (vs binary tree fanout 2). Doing this completely eliminates an otherwise required parsing/serializing step and reduces it to a single memcpy(). In effect, a tradeoff is made in favour of CPU-cycles over bandwidth. To give an idea of size: in the simdjson benchmark using real Twitter API data, the same data stored in Lite³ format occupied ~40% extra size compared to JSON. This is a dataset containing much nesting and Unicode strings.

While the Lite³ library does not directly include compression features, it employs various various compacting techniques such as bitfields and varint key length prefixes. For extra size reduction, it is possible to manually layer compression over Lite³, though this will lose the zero-copy capability. On the flip side, compression is very easy to parallelize on a separate thread and thus in practice it may prove beneficial. Sliding-window type codecs are especially good at eliminating redundancy across the entire dataset, which would be difficult to achieve by the format itself. Furthermore, manual choice of codecs can be tuned for specific size/performance trade-offs depending on the project requirements. When deploying compression, it is desirable to compress on a stream-basis rather than per-message basis, as the streaming approach allows for eliminating of low entropy across an entire session as opposed to mere individual messages.

Before using compression, make sure both LITE3_ZERO_MEM_DELETED and LITE3_ZERO_MEM_EXTRA are defined inside the lite3.h header, alternatively using compiler flags -DLITE3_ZERO_MEM_DELETED and -DLITE3_ZERO_MEM_EXTRA

A size benchmark to assess compression ratios for different codecs is on the TODO list.

Canonicalization

Some applications require canonical encoding, meaning there should exist only one valid encoding of any given message. This property is important for cryptographic applications such as:

using a message as a key inside a hash table
taking a finderprint or checksum of a message
comparing serialized payloads for equality

A format can be canonical on two levels:

by specification
by implementation

Being canonical by specification is the strongest form of canonicalization: it follows that any correct encoder implementation should produce canonical encodings. An example of this is the ASN.1 format, specifically the DER variant.

Protobuf serialization is not (and cannot be) canonical for reasons related to handling of unknown fields.

Some formats are not canonical by default, but offer alternative specifications. An example of this is the JSON RFC 8785 standard. It provides additional rules for the ordering of keys, exact floating-point representation and whitespace/Unicode normalization.

For Lite³ there does not exist an official standard (yet) and currently this repository is the only existing implementation. However it is possible to achieve canonical encodings using the following encoding rules:

Lite³ canonical encoding rules:
`LITE3_ZERO_MEM_DELETED` and `LITE3_ZERO_MEM_EXTRA` features must be enabled
all numbers are encoded as little-endian
all object keys must be inserted in lexicographic order
all array elements must be inserted in ascending order
all nested objects/arrays must be inserted depth-first
all string data must be UTF-8 normalized as per [NFC / RFC 5198](https://www.rfc-editor.org/rfc/rfc5198.html)
IEEE float value `-0.0` must be normalized to `+0.0`
IEEE float with exponent bits all zero AND mantissa non-zero becomes `+0.0`
IEEE float value `NaN` must be normalized to `0x7FF8000000000000` (quiet `NaN`)
after message construction, it becomes immutable

Future work

Future formats building on this work can find improvement through a number of ways. Some ideas include:

Observing that leaf nodes allocate space for child pointers despite never having any children.
Implementing a proper mechanism to resolve key collisions in a non-failing way.
Reducing message size further by observing that many JSON schemas contain repeated keys/patterns. For this, inspiration can be drawn from deduplication/backreferencing techniques employed by the 'Smile' binary format.
Reducing branching pressure on the CPU during node traversal and key comparisons (SIMD?).
Better hiding latency via prefetching and/or batching overlapping lookups.
Improving performance limits related to 'type blindness' when converting between Lite³ and JSON.
Implementing conversion between Lite³ and JSON without memory allocations.
Employing varint encoding for length prefixes of string/bytes types.
Implementing arbitrary B-tree deletions with proper rebalancing logic and defragmentation for insertion of new values.
Enabling compatibility for 32-bit CPUs.
Adding functions for extraction and insertion of nested objects/arrays.

Final Statement

Systems are rarely fully isolated. In the modern age of microservices, IoT, and the internet, distributed systems are the common case. As such, the convergence of the industry on a number of data formats is completely expected. Standardization enables interoperability and compatibility, such that new technologies can attach themselves to existing ones without barriers. The alternative result: starving in isolation while suffering from 'data impedance mismatch'. New ideas might be be created, but if incompatible, they do not get used and their fate is sealed.

This problem is still very visible in the context of full stack applications:

The backend uses a relational database with table formats
The fronted communicates in a dynamic semi-structured text format

Translation between the two manifests itself in various ORM layers, ETL (Extract, Transform, Load) pipelines and other transformations that introduce friction, complexity, brittleness and overall fragility in the system as a whole. Schema evolution can break down at the ABI layer before it even reaches the application layer.

Fundamentally, these problems stem from a view of seeing systems in isolation. Every programming language and framework has its own opinionated view of how data should be modeled, represented etc. Sometimes, this view might be justified. Many times, it complicates extensibility and interoperability. In a distributed world, one needs to think about the data source and destination. Standardized data formats have contributed immensely to innovation, enabling universal APIs, storage formats and more.

Yet as of today, no data format contains all of the following properties:

universal portability: no ties to any particular platform, framework or programming language
representational power: the ability to compose and represent any complex real world data
accessibility: the technology is approachable, pleasant to work with and does not rely on obscure or esoteric knowledge
performance: the technology can be used in demanding and resource-constrained environments

The combination of all could unify so many applications that a new standard is able to arise.

The Lite³ name

The name Lite³ was chosen since it is lighter than Lite².

‍TIP: To type ³ on your keyboard on Linux hold Ctrl+Shift+U then type 00B3. On Windows, use Alt+(numpad)0179.

Table of Contents