Lite³
A JSON-Compatible Zero-Copy Serialization Format
Loading...
Searching...
No Matches
Design and Limitations

Introduction

Lite³ is a zero-copy binary serialization format encoding data as a B-tree inside a single contiguous buffer, allowing access and mutation on any arbitrary field in O(log n) time. Essentially, it functions as a serialized dictionary.

As a result, the serialization boundary has been broken: 'parsing' or 'serializing' in the traditional sense is no longer necessary. Lite³ structures can be read and mutated directly similar to hashmaps or binary trees, and since they exist in a single contiguous buffer, they always remain ready to send.

Compared to other binary formats, Lite³ is schemaless, self-describing (no IDL or schema definitons required) and supports conversion to/from JSON, enabling compatibility with existing datasets/APIs and allowing for easy debugging/inspecting of messages.

Thanks to the cache-friendly properties of the B-tree and the very minimalistic C implementation (9.3 kB), Lite³ outperforms the fastest JSON libraries (that make use of SIMD) by up to 120x depending on the benchmark. It also outperforms schema-only formats, such as Google Flatbuffers (242x). Lite³ is possibly the fastest schemaless data format in the world.

Example to illustrate:

  1. A Lite³ message is received from a socket
  2. Without doing any parsing, the user can immediately:
    • Lookup keys and read values via zero-copy pointers
    • Insert/overwrite arbitrary key/value entries
  3. After all operations are done, the structure can be transmitted 'as-is' (no serialization required, just memcpy())
  4. The receiver then has access to all the same operations

Typically, in such a scenario a distinct 'serializing' and 'deserializing' step would be required. However Lite³ blurs the line between memory and wire formats, allowing direct access, traversal and mutation of a serialized buffer.

Original idea

The original idea came from a paper published in December 2024 as Lite²:

‍Tianyi Chen †, Xiaotong Guan †, Shi Shuai †, Cuiting Huang † and Michal Aibin †
(2024).
Lite²: A Schemaless Zero-Copy Serialization Format
https://doi.org/10.3390/computers13040089

A serialization format is described where all entries are organized as key-value pairs inside of a B-tree. The paper authors got their idea from SQL databases. They noticed how it is possible to insert arbitrary keys, therefore being schemaless. Also, performing a key lookup can be done without loading the entire DB in memory, thus being zero-copy.

They theorized that it would be possible to remove all the overhead associated with a full-fledged database system, such that it would be lightweight enough to be used as a serialization format. They chose the name Lite² since their format is lighter than SQLite.
Despite showing benchmarks, the paper authors did not include code artifacts.

Improvements

The Lite³ project is an independent interpretation and implementation, with no affiliations or connections to the authors of the original Lite² paper.

Lite³ adds several improvements to the format as described in the paper:

  1. The paper describes a B-tree algorithm as commonly seen in literature. This kind of algorithm travels down from the root node to find a leaf, then if a node split occurs, it cascades updates all the way up to the root node.
    The Lite³ implementation uses a more efficient 'one-pass' algorithm where all node updates happen during traversal, not after. This eliminates backtracking and cuts the number of traveled nodes in half.
  2. The paper uses tree node values of 3 bytes. While this allows you to store slightly more values inside of a node, this is an awkward size for a computer. Lite³ uses 4-byte values, matching native instruction sizes of many architectures. This also raises the maximum message size from 16MiB to 4GiB and reduces the chance of hash collisions by a factor of 16.
  3. Fields have been downsized and outright removed, such that the resulting message is more compact.
    For example:
    • The entire 'header' section has been removed, saving 15 bytes on every message.
    • The one-pass algorithm has removed the need for parent pointers ('PARENT_OFFSET' & 'PARENT_IDX') saving 4 bytes per node.
    • 'CTRL' and 'LEN' fields in entries have been removed, saving 4 bytes per entry.
    • The 4-byte key size tag ('KEY_LEN') is made variable length, such that it takes up 1 byte for most small key sizes (< 64 bytes).
    • The 3-byte 'VAL_LEN' tag is replaced by a 1-byte type tag.
  4. An entirely new type system has been added, supporting NULL, BOOL, I64, F64, BYTES, STRING, and nested types OBJECT & ARRAY. These types are implemented as tagged unions.
  5. The type system has enabled the implementation of full JSON interoperability, such that it is now possible to convert any JSON message to Lite³ format and vice-versa. For this, the yyjson library is used.

The idea of taking the SQLite format and applying it to a serialization context is genuinely innovative. Yet, the original Lite² publication did not include any reference implementation or code artifacts, even despite showing benchmarks of their format being competitive to industry standard formats like JSON, Protobuf etc.

To the writer's awareness, Lite³ is the first project to publish an implementation of this idea. It is done so in the form of a publically consumable C-library, hosted on GitHub and licensed under the MIT-license.

Besides publication, the most significant contribution is the addition of JSON compatibility, which adds a great amount of usability and interoperability with existing systems.

Why B-tree?

Lite³ internally uses a B-tree datastructure. B-trees are well-known performant datastructures, as evidenced by the many popular databases using them (SQLite, MySQL, PostgreSQL, MongoDB & DynamoDB). The reason for their popularity is that they allow for dynamic insertions/deletions while keeping the datastructure balanced and always guaranteeing log(n) lookups.

However, B-trees are rarely found in a memory or serialization context. Many computer science people associate B-trees exclusively with disks. For memory datastructures, common wisdom rather opts for hashmaps or classic binary trees. B-trees are seen as too 'heavyweight'. But the same is possible in memory, though a much more compact representation is needed specifically adapted for fast memory operations.

Databases typically use 4kB to store a node in the tree, matching disk pages sizes. However in memory, the unit size is cache lines, not disk pages. Therefore the node size in Lite³ is set to a (configurable) 96 bytes or 1.5 cache lines by default. Literature suggests that modern machines have nearly identical latency for 64-byte and 256-byte memory accesses, though larger nodes will also increase message size.

CPU performance in the current age is all about access patterns and cache friendliness. It may surprise some people that as a result, memory B-trees actually outperform classic binary trees or red-black trees, at least according to the theory. The reason has to do with better spatial locality and higher 'fanout'. A binary or red-black tree requires log₂(N) lookups, where N is the total number of elements. For a B-tree, the fanout is not limited to 2, but rather dependent on configuration. The number of required lookups is therefore restricted to between logₜ(N) and logₘ(N) where t is the minimum and m the maximum number of children per node. Lite³ uses a default configuration of t = 4, m= 8, configurable up to t = 32, m = 64. B-trees require more key comparisons per node. However with the current trend of memory latencies lagging behind compared to processing speeds, the amount of lookups required (tree height) is the strongest and most dominant performance factor. The access pattern is also much more friendly to the CPU compared to heavy random-access offenders (skip lists, LSM-trees etc.). This is why B-trees are still preferred in databases for maximum read performance.

An algorithmic observer will note that B-trees have logarithmic time complexity, versus hashmaps' constant time lookups. But typical serialized messages are often small (70% of JSON messages are < 10kB) and only read once. So the overhead of other operations will dominate, except with very frequent updates on large structures (larger than LLC). More importantly, using Lite³ structures completely eliminates an O(n) parsing and serialization step.

Despite these advantages, B-trees are less common in general-purpose use because the algorithms are more complex to implement and hashmaps are perceived to perform better for simple key-value workloads. Hashmaps provide 'unbeatable' theoretical lookup times of O(1) vs B-trees O(log n). However, this fact glosses over some of the downsides of hashmaps, namely:

  1. Hashmaps do not (efficiently) support range queries. Since the keys are stored in pseudorandom order, the access pattern to support a range query is also highly scattered and random, potentially triggering many cache misses. Alternatively, a full O(N) scan is required.
    B-trees on the other hand store keys in sorted order. Any ordered access pattern consists of finding the start of the range, then traversing nodes in-order. This form of access is cache-oblivious and sequential, potentially providing supralinear improvements over hashmaps.
  2. Hashmaps have unpredictable latency. A hashmap can never provide hard latency guarantees. As the load factor of a hash map approaches 1, the performance starts to degrade and latency increases from O(1) up to a theoretical maximum of O(N). This is why most hashmaps are configured to trigger a 'rehash event' when the load factor exceeds some threshold (e.g. 0.85). During such an event, all the keys inside the map must be redistributed, causing a very large spike in latency. Realtime application such as control systems and finance may have no tolerance for unpredictable latency. Unlike hash maps, B-trees are able to provide hard latency guarantees as the number of lookups required never exceeds logₜ(N).
  3. Hashmaps cannot be stored contiguously For example, a map with load factor 0.75 will leave 25% of allocated memory unused. This may be relevant for resource-constrained systems or very large memory key-value stores (e.g. Redis). But more importantly, the mapping of keys to addresses necessarily means that the datastructure partially consists of 'empty buckets', with the amount depending on the load factor. Hashmaps can therefore never be directly sent over a network without wasting bandwidth. The B-tree on the other hand knows no such constraints and can be stored fully contiguous inside a single buffer without any unused space.

Byte Layout

In Lite³, all tree nodes and data entries are serialized inside a single byte buffer. Tree hierarchies are established using (32-bit) relative pointers.

Relative pointers have a few advantages:

  1. They remain stable even if the message is copied to a different absolute address.
  2. They are more compact (4 bytes vs full 8 bytes on 64 bit)

All parent nodes point to their child nodes:

|----------------|-------|----------------|----------------|-------------|----------------|-------------
| ROOT |#DATA##| CHILD #1 | CHILD #2 |#####DATA####| CHILD #3 |####DATA####...
|----------------|-------|----------------|----------------|-------------|----------------|-------------
| | |___________________| | |
| |______________________________________| |
|_______________________________________________________________________|

The buffer pointer always points to the start of the buffer (zero-index). The root node is always stored here and this is where traversal of the B-tree begins.

New B-tree nodes or key-value pairs containing user values are simply appended to the end of the buffer. The algorithm ensures that new values can be added gracefully, handling updates to the index to accomodate new data. Throughout all insertions or modifications, the B-tree remains balanced and guarantees O(log n) access to any internal field.

B-tree nodes consist of key hashes stored alongside pointers to key-value pairs (data entries), as well as child node pointers:

NODE
|--------------------------------------------------------------------------------------------|
| h(key1) h(key2) h(keyN) *kv1 *kv2 *keyN *child1 *child2 *child3 *childN |
|--------------------------------------------------------------------------------------------|
| | |
hashed keys kv-pair (relative) child node pointers
used for key comparison pointers pointing used for tree traversal
to data entries

Object keys (think JSON) are hashed to a 4-byte digest and stored inside B-tree nodes, referring to a specific key-value pair. Thanks to a C macro, user applications can calculate hash digests at compile time to remove the runtime overhead.

As a result of storing hash digests instead of strings, tree traversal inside the critical path can be satisfied entirely using fixed 4-byte word comparisons, never actually requiring string comparisons except for detection of hash collisions. This design choice alone contributes to much of the runtime performance of Lite³.

Key-value pairs containing user data are stored anywhere between nodes, or at the end of the buffer:

KEY-VALUE PAIR
|--------|-----------|-------------------------|
| type | key | value |
|--------|-----------|-------------------------|
| |
1 byte string
type tag key

Lite³ uses a 'tagged union' approach, where each value is prepended by a 1-byte type tag. This enables the schemaless design, and allows applications to traverse and discover the structure of arbitrary messages, even if the structure is unknown. This is one of the main benefits of being schemaless, as there is no schema definition or IDL required to read Lite³ data.

Key string is prefixed by a variable key length tag. This tag occupies 1 byte for key lengths <= 63 (most common). For longer keys, this can expand up to 4 bytes.

The actual value is sized according to the type. Some values like LITE3_TYPE_NULL can be stored entirely in 1 byte. LITE3_TYPE_I64 or LITE3_TYPE_F64 occupy 8 bytes. Variable length types like LITE3_TYPE_STRING and LITE3_TYPE_BYTES use an addition 4-byte length prefix.

Arrays also store values with prefixed type tags, though without keys since arrays can use indexes to act as 'implicit keys'.

Over time, deleted entries cause the contiguous byte buffer to accumulate 'holes'. For a serialization format, this is undesirable as data should be compact to occupy less bandwidth. Lite³ will use a defragmentation system to lazily insert new values into the gaps whenever possible. This way, fragmentation is kept under control without excessive data copying or tree rebuilding (STILL WIP, NOT YET IMPLEMENTED).

‍ NOTE: By default, deleted values are overwritten with NULL bytes (0x00). This is a safety feature since not doing so would leave 'deleted' entries intact inside the datastructure until they are overwritten by other values. If the user wishes to maximize performance at the cost of leaking deleted data, LITE3_ZERO_MEM_DELETED should be disabled.

JSON comparison

JSON is a text format. This makes it very convient for humans. We can read it in any text editor, understand and modify it if necessary.

Unfortunately, a computer cannot directly work with numbers in text form. It can display them, sure. But to actually do useful calculations like addition, multiplication etc. it must convert them to binary, because processors can only operate on numbers in native word sizes (8-bit, 16-bit, 32-bit, 64-bit). This conversion from text to binary must be done through parsing. This happens inside your browser when you visit a website, and all around the world, millions or billions of times per second by computers communicating with eachother through all kinds of protocols, APIs etc.

There are broadly 3 strategies or 'ways' to parse JSON:

  1. DOM-based approach: The Document Object Model (DOM) is a memory representation of a document as a logical tree, of which the contents can be programmatically accessed and modified. When receiving a JSON-message, it is parsed into a dictionary object or DOM-tree. This structure acts as the 'intermediate form' to support mutations and lookups. This approach is dominant, because for programmers it is by far the easiest. Yes, first all the text must be converted into the tree-model, but this can be done automatically by the computer, and once it is done, all of the document's data is directly available and mutable. This approach is used by virtually every JSON library and even by your browser to represent HTML web pages. To send data, the tree must be serialized again into a JSON string.
  2. Streaming / event-based approach (SAX): This approach consists of a pointer iterating over the document. The text is processed from the beginning, and each newly encountered component is an event. For example, the beginning of a string, the beginning of an array, the end of an array etc. The programmer may provide functions corresponding to each event, such as executing foo(str) everytime a string is encountered. This approach allows for capturing of only the information that the program needs, ignoring everything else. While more efficient, it is also very inconvenient, because you might discover your program 'missed' an event it now wants to read and has to go back and restart the iterator. The programmer must constantly keep track of their logical position within the document, which could lead to complexity and bugs. Therefore, this approach is not very popular.
  3. On-Demand approach: A relatively new technique employed by the simdjson C++ library. It resembles a DOM-based approach, except that the parser is 'lazy' and will not parse anything until the very last moment when you actually try to read a value. Unfortunately, some restrictions leak into the abstractions of the user API. For example, values can only be consumed once and must be stored separately if you plan to use them multiple times. Therefore it more closely resembles a streaming interface and users might still prefer the DOM-approach for practical reasons. Additionally, the performance of JSON On-Demand is sensitive to the ordering of keys within objects. Many JSON serializers do not preserve or guarantee key ordering in any way. Therefore, it could introduce unexpected variance in read latency. Simdjson also has another major restriction: parsing is read-only. The library does not allow you to modify the JSON document and write it back out.

Despite the DOM-based approach being easiest for the programmer, for the computer it represents a non-trivial amount of work. The text must be parsed to find commas, brackets, decimals etc. A block of memory must be allocated and then the tree must be built. The same data is now duplicated and stored in two different representations: as DOM-tree and string. All these operations add memory overhead, runtime cost and increased latency.

So we spend a lot of time constructing and serializing (separate) DOM-trees. Wouldn't it be great if we could, say, encode the tree directly inside the message?

That is exactly what we do. In Lite³, the DOM-tree index is embedded directly inside the structure of the message itself. The underlying datastructure is a B-tree. As a result, lookups, inserts and deletions (get(), set() & delete()) can be performed directly on the serialized form. Encoding to a separate representation becomes unnecessary. Since it is already serialized, it can be sent over the wire at any time. This works because Lite³ structures are always contiguous in memory. Think of the 'dictionary itself' being sent.
We can perform zero-copy lookups, meaning that we do not need to process the entire message, just the field we are interested in. Similarly, we can insert and delete data only by reading the index and nothing else.

Another side effect of having a full functioning B-tree inside a serialized message is that it is possible to mutate data directly in serialized form. Tree rebalancing, key inserts etc. all happen inside the message data. Typically, to mutate serialized data en route you would have to:

    Receive JSON string -> Parse -> Mutate -> Stringify -> Send JSON string

With Lite³, this simply becomes:

    Receive Lite³ message -> Mutate -> Send Lite³ message

Many internal datacenter communications consist of patterns where messages are received, a few things modified, then sent on to the other services. In such patterns, the design of Lite³ shines because the reserialization overhead is entirely avoided. It is possible to insert entirely new keys, values, or to overriding existing values while completely avoiding reserialization.

With JSON, if you change just a single field inside a 1 MB document, you typically cannot avoid reserializing the entire 1 MB document. But with Lite³, you call an API function to traverse the message, find the field and change it in-place. Such an operation can be done in tens of nanoseconds on modern hardware given the data is present in cache.

Lite³ also implements a BYTES type for native encoding of raw bytes. JSON however does not support this. When converting from Lite³ to JSON, bytes are automatically converted to a base64-encoded string.

Other Serialization Formats

'Binary JSON' formats

A number of formats advertise themselves as being 'binary JSON'. Instead of bracks and commas, they typically use a system of tags (TLV) to encode types and values. Being binary also means they store numbers natively, avoiding the parsing of ASCII floating point decimals which is known to be performance-problematic for text formats.

The notable contenders are:

  • BSON: Or 'binary JSON', known for being the underlying format in the WiredTiger storage engine of MongoDB. It supports various native types such as Date objects.
  • Amazon Ion: A JSON-superset format used by Amazon with a richer type system, similar to BSON.
  • MsgPack: Very fast and small format. Can achieve good performance depending on implementation.
  • CBOR: Inspired by MessagePack but with IETF standardization and extended types.
  • Smile: A format that implements back-referencing (deduplication) to achieve smaller message sizes compared to JSON. Used in ElasticSearch.
  • UBJSON: Aims to achieve the generality of JSON while simplifying parser implementations.

While these formats store numbers as binary, they do not solve many other problems. A separate memory representation such as a DOM-tree is still required to conveniently read data and support meaningful mutations.

Fundamentally, this flaw arises out the fact that all data is stored contiguously without much structure. To find an element inside, typically an O(n) linear search is required. This is particularly problematic for random access on large element counts. Additionally, the contiguous nature means that a change to an internal element could require the entire document to be rewritten.

In contrast, Lite³ is a zero-copy format storing all internal elements within a B-tree structure, guaranteeing O(log n) amortized time complexity for access and modification of any internal value, even inside arrays. The modification of an internal element will also never trigger a rewrite of the document. Only the target element might require reallocation and updating of the corresponding reference. Even throughout modifications, zero-copy access is maintained. The API exposed on Lite³ data is almost identical to a dictionary, except that Lite³ does not require any 'parsing' or transformations since the data format (B-tree) already supports it directly.

Therefore other 'binary JSON' formats may be interesting from a perspective of compactness or rich typing. However looking from the standpoint of encode/decode performance, they exist in a lower category.

Schema-only Binary Formats

By making a number of assumptions about the structure of a serialized message, it is possible to (greatly) accelerate the process of encoding/decoding to/from the serialized form. This is where so-called 'binary formats' come in.

But if binary formats exist and are much faster, why is everyone still using JSON?

The answer lies in the fact that most binary formats require a schema file written in an IDL; basically an instruction manual for how to read a message. A binary message doesn't actually tell you anything. It is literally just a bunch of bytes. Only with the schema does it acquire meaning. Note that 'binary' does not necessarily mean schema-only, though in practice this is often implied.

When sending messages between systems you control, you can create your own schemas. But communicating with other people's servers? Now you need to use their schemas as well. And if you want to communicate with the whole world? Well you better start collecting schemas. Relying on out-of-band information eventually takes its toll. Imagine needing an instruction manual for every person you wanted to talk to. Crazy right?

Because of these restrictions, schema-only formats reside in their own special category, notably distinct from schemaless and self-describing formats like JSON, which can be directly read and interpreted without the requirement of extra outside information.

That said, these formats come in 3 primary forms:

  1. Static code generation: This is the approach taken by Protocol Buffers. Schema definition files (.proto) are compiled using the external protoc Protobuf compiler into encoding/decoding source code (C++, Go, Java) for each type of message. This is also the approach taken by Rust's "serde" crate, although the compilation is performed by the Rust compiler, not an external tool.
  2. Compile-time reflection: With 'reflection' (a.k.a. introspection), the compiler inspects the code's own structure during compilation to generate the serialization logic, avoiding the need for a separate schema file. This approach is used by C++ libraries such as Glaze and cpp-reflect. Furthermore, C++ is getting reflection in C++26. However, these libraries only support serialization to/from C++ classes. There is no cross-language support.
  3. Runtime reflection: In C#, the JsonSerializer collects metadata at run time by using reflection. Whenever it has to serialize or deserialize a type for the first time, it collects and caches this metadata. This approach resembles a JIT-like way to accelerate 'learned' message schema. But the downside is that the metadata collection process takes time and uses memory.
    Other environments such as the V8 JavaScript engine employ a 'fast path' for JSON serialization if the message conforms to specific rules. Otherwise, the traditional 'slow path' must be taken.

Schema-only formats tend to be brittle and require simultaneous end-to-end upgrades to handle change. Although backwards-compatible evolution is possible, it requires recompilation and synchronization of IDL specifications. But updating all clients and servers simultaneously can be challenging. Major changes like renaming, removing fields or changing types can lead to silent data loss or incompatibility if not handled correctly. In some cases it is better to define a new API endpoint and deprecate the old one.

Here is a section taken from the Simple Binary Encoding's "Design Principles" page:

‍Backwards Compatibility
In a large enterprise, or across enterprises, it is not always possible to upgrade all systems at the same time. For communication to continue working the message formats have to be backwards compatible, i.e. an older system should be able to read a newer version of the same message and vice versa.

An extension mechanism is designed into SBE which allows for the introduction of new optional fields within a message that the new systems can use while the older systems ignore them until upgrade. If new mandatory fields are required or a fundamental structural change is required then a new message type must be employed because it is no longer a semantic extension of an existing message type.

For self-describing formats like Lite³ and JSON, adding and removing fields is easy. It is only positional formats that have this problem. More precisely:

  • Positional formats: Require backwards compatibility on the ABI layer + application layer.
  • Self-describing formats: Require backwards compatibility only on the application layer.

Zero-Copy Formats

There sometimes exists ambiguity and confusion around the term 'zero-copy'. What does it mean for a data format to be 'zero-copy', or any system in general?

In most contexts, zero-copy refers to a method of accessing and using data without physically moving or duplicating it from its original location. The guiding principle being that compute resources should be spent on real work or calculations, not wasting CPU cycles on unnecessarily relocating and moving bytes around. Applications, operating systems and programming languages may all use zero-copy techniques in various forms, under various names:

  • C-like languages: Taking pointers into buffers instead of passing around entire values
  • Linux kernel: splice() and sendfile() system calls
  • Rust: &[u8] slices, std::io::Cursor and bytes::Bytes
  • C++: std::string_view and std::span
  • Python: memoryview object
  • Java: java.nio.ByteBuffer and slice()
  • C#: Span<T>, Memory<T> and ReadOnlySpan<T>
  • Go: Slices (e.g., mySlice := myArray[1:4])
  • JavaScript / Node.js: Buffer.subarray(), TypedArray views on an ArrayBuffer

Moving around data is a real cost. In the best case, performance degrades linearly with the size of the data being moved. In reality, memory allocations, cache misses, garbage collection and other overhead mean that these costs can multiply non-linearly.

When we talk about 'zero-copy serialization formats', the format should support reading values from the original location, i.e. directly from the serialized message. If a format requires thats its contents are first transformed into an alternative memory representation (i.e. DOM-tree), then this does not classify as zero-copy.

In some cases, a format may support 'zero-copy' references to string or byte objects. However, the ability to access some member field by reference does not immediately make a format 'zero-copy'. Instead, every member field must be accessible without requiring any parsing or transformation step on a received message.

The 4 most notable existing zero-copy formats are:

  1. Flatbuffers (by Google): Mainly used in game development or performance heavy applications. One caveat is that while decoding Flatbuffers is very fast, encoding is much slower. In fact, modern JSON libraries can beat it at encoding. Also, Flatbuffers has been critized for having an awkward API. This is because logically, the data needs to be encoded inside-out. First all the children/leaf sizes must be known before the parent can be encoded.
  2. Flexbuffers (by Google): The schemaless version of Flatbuffers, made by the same author. It cannot match the instant decoding speed of Flatbuffers. However, being schemaless it requires no schema definitons. Though it never received much popularity or adoption.
  3. Cap'n Proto: Similar to Flatbuffers, but using a more standard builder API. Despite having good performance, it sees limited use in real applications and its author is only semi-actively maintaining it. Another downside is that optional fields take space over the wire, even when unused.
  4. SBE (Simple Binary Encoding): Mainly used in low-latency financial applications (realtime trades & price data). The format was designed specifically for this purpose and resembles little more than a struct sent over a network. It requires schema definitions written in XML.

Note that all of these formats except Flexbuffers require rigid, pre-defined schema compiled into your application. Also, none of these formats support arbitrary mutation of serialized data. If a single field must be changed, then the entire message must be re-serialized. Only the latter 2 formats support trivial in-place mutation of fixed-sized values.

Compression

While it may seem wasteful to store metadata for an entire B-tree inside a serialized message, remember that this B-tree is an unusually compact variant, where a single node only takes up 96 bytes for a fanout up to 8 (vs binary tree fanout 2). Doing this completely eliminates an otherwise required parsing/serializing step and reduces it to a single memcpy(). In effect, a tradeoff is made in favour of CPU-cycles over bandwidth. To give an idea of size: in the simdjson benchmark using real Twitter API data, the same data stored in Lite³ format occupied ~40% extra size compared to JSON. This is a dataset containing much nesting and Unicode strings.

While the Lite³ library does not directly include compression features, it employs various various compacting techniques such as bitfields and varint key length prefixes. For extra size reduction, it is possible to manually layer compression over Lite³, though this will lose the zero-copy capability. On the flip side, compression is very easy to parallelize on a separate thread and thus in practice it may prove beneficial. Sliding-window type codecs are especially good at eliminating redundancy across the entire dataset, which would be difficult to achieve by the format itself. Furthermore, manual choice of codecs can be tuned for specific size/performance trade-offs depending on the project requirements. When deploying compression, it is desirable to compress on a stream-basis rather than per-message basis, as the streaming approach allows for eliminating of low entropy across an entire session as opposed to mere individual messages.

Before using compression, make sure both LITE3_ZERO_MEM_DELETED and LITE3_ZERO_MEM_EXTRA are defined inside the lite3.h header, alternatively using compiler flags -DLITE3_ZERO_MEM_DELETED and -DLITE3_ZERO_MEM_EXTRA

A size benchmark to assess compression ratios for different codecs is on the TODO list.

Canonicalization

Some applications require canonical encoding, meaning there should exist only one valid encoding of any given message. This property is important for cryptographic applications such as:

  • using a message as a key inside a hash table
  • taking a finderprint or checksum of a message
  • comparing serialized payloads for equality

A format can be canonical on two levels:

  1. by specification
  2. by implementation

Being canonical by specification is the strongest form of canonicalization: it follows that any correct encoder implementation should produce canonical encodings. An example of this is the ASN.1 format, specifically the DER variant.

Protobuf serialization is not (and cannot be) canonical for reasons related to handling of unknown fields.

Some formats are not canonical by default, but offer alternative specifications. An example of this is the JSON RFC 8785 standard. It provides additional rules for the ordering of keys, exact floating-point representation and whitespace/Unicode normalization.

For Lite³ there does not exist an official standard (yet) and currently this repository is the only existing implementation. However it is possible to achieve canonical encodings using the following encoding rules:

Lite³ canonical encoding rules:
1. `LITE3_ZERO_MEM_DELETED` and `LITE3_ZERO_MEM_EXTRA` features must be enabled
2. all numbers are encoded as little-endian
3. all object keys must be inserted in lexicographic order
4. all array elements must be inserted in ascending order
5. all nested objects/arrays must be inserted depth-first
6. all string data must be UTF-8 normalized as per [NFC / RFC 5198](https://www.rfc-editor.org/rfc/rfc5198.html)
7. IEEE float value `-0.0` must be normalized to `+0.0`
8. IEEE float with exponent bits all zero AND mantissa non-zero becomes `+0.0`
9. IEEE float value `NaN` must be normalized to `0x7FF8000000000000` (quiet `NaN`)
10. after message construction, it becomes immutable

Hash Collisions

Problem statement

Lite³ stores key hash digests inside the B-tree nodes instead of storing strings directly, which is what most databases do.

It turns out that there is an entire research field dedicated to optimizing the layout of strings inside B-tree nodes. A lot of the problem space is shared with that of memory allocators, which also deal with layouts of variable-sized elements in memory. This is not a straighforward task, and there are many possible solutions with different trade-offs.

Despite the interesting solutions and work being done in this field, we cannot afford any complex solutions. To support a fast in-memory format, we must be able to find a given key quickly; remember that key comparisons are inside the critical path for node traversal. There is no dominating disk I/O latency that we can 'hide behind'. If we allow ourselves to compare against fixed-sized hashes of the keys, then the entire problem of dealing with variable-sized elements disappears. Variable string comparisons (loops, branching etc.) become single compare instructions at the machine code level. This yields enormous benefits in runtime performance, such that memory latency becomes the dominating factor.

Despite achieving the stated goal of fast runtime performance, we have acquired a new problem: collisions.

For the uninitiated, 'collisions' refer to a situation where two or more distinct inputs map to the same output. This is also known as the pigeonhole principle and an inevitable result of trying to squeeze arbitrary-size data into a fixed-size representation.

To remain JSON-compatible, Lite³ only supports string keys. Collisions can manifest themselves when the DJB2(key) hash of two different keys yield the same 4-byte output. Collisions can only happen within a single 'key space'. Every object has its own key space. So a key from object A can only collide with another key in object A, never with keys from another object B, regardless of nesting hierarchies.

Collision Probability

The mathematics of the birthday problem tell us that the probability of a collision is equal to 1 - exp(-n*(n-1)/(2*2^x)), where n is number of entries and x the number of hash digest bits. So with a 32-bit digest, we will find that after exactly 77,164 object entries, the probability of a collision for the first time exceeds 50%:

    1 - exp(-77164*(77164-1)/(2*2^32)) ≈ 0.500006797931

This table illustrates probabilities for at least one collision occuring, for different numbers of object entries:

Number of entries Collision probability
10 0.0000000104773789644
100 0.00000115251102195
1,000 0.000116292144049
10,000 0.0115728810571
20,000 0.0454963391113
30,000 0.0994686465822
50,000 0.252508603738
75,000 0.480468302096
100,000 0.687809461339
150,000 0.927148144667
200,000 0.990501197966
1,000,000 1.000000000

However, this calculation assumes a perfectly random key distribution. Given the choice of encoding (UTF-8), hash function (DJB2) and language patterns in key names, the actual distribution is far from random. Therefore, take this as a rule-of-thumb rather than gospel.

Quadratic Probing

Without a strategy for resolving collisions, a colliding key could fail to insert. Or worse: override unrelated keys.

Therefore Lite³ uses a collision resolution technique called 'quadratic probing', as also used in hash maps.

The algorithm starts out by attempting to insert a target key. If the hash digest of this key collides with that of another key, it will try again with digest + N * N, where N is the number of 'probing attempts'. Probing attempts will keep retrying until an empty slot is found or the LITE3_HASH_PROBE_MAX limit is reached (128 by default).

This way, collisions can be naturally resolved.

Hash Function

The function used for hashing of object keys is DJB2 (Daniel J. Bernstein hash 2), for two primary reasons:

  1. DJB2 somewhat preserves the lexicographical ordering inside the hash digests. So digests for keys "item_0001", "item_0002" and "item_0003" are actually more likely to also be placed sequentially inside the B-tree nodes. This can be useful when doing a sequential scan on the semantic key names. Otherwise a lot more random access would be required.
  2. DJB2 is so simple that it can be calculated entirely by the C preprocessor at compile time for string literals (using the LITE3_KEY_DATA(s) macro). This means message consumers using the C macro never need to pay the runtime cost of hashing. Such a technique would be difficult or impossible to do with other hash functions.

For now, DJB2 is working well. However more testing will be done before the choice of DJB2 is fixed inside the Lite³ format specification. It is entirely possible that a different hash function will be decided on, such as XXH32 (xxHash32).

On-Disk Storage

The ability of Lite³ to store semi-structured data makes it a suitable file format for small- to medium-sized documents. Currently, there is a size limit of 4GiB due to the use of 32-bit indexes. Objects and arrays can also store a maximum of 67,108,864 (2^26) elements or entries.

However, Lite³ is not a replacement for a database. Beware that when reading/writing Lite³ structure to disk, you do not have access to database features like:

  • complex queries
  • fast disk seeks
  • multi-versioning (MVCC)
  • atomicity guarantees
  • crash recovery

If you need a high performance on disk key-value store, choose something like LMDB.
If you need complex queries and flexibility, opt for more traditional databases.

That said, the options for storing Lite³ structures on disk are:

  1. direct file I/O
  2. mmap()
  3. Using byte array/BLOB fields inside of a database engine

For read-only operations, the choice does not matter.

Write operations however, the programmer needs to take some care:

‍direct file I/O

This is dangerous, because if the system crashes while syncing data to disk, the structure can end up in a corrupt intermediate state. To atomically update the contents of a file on Linux, a very specific sequence of steps is required. If not followed exactly, data could be at risk during a crash or power-out event.

The steps are as follows:

  1. create a new temporary file (on the same file system!)
  2. write data to the temporary file
  3. fsync() the temporary file
  4. rename the temporary file to the appropriate name (using the rename() syscall)
  5. fsync() the containing directory

For more information, see: https://lwn.net/Articles/457667

mmap()

Write-maps do not have built-in atomicity. Even when calling msync() manually, you rely on the OS to sync changes to disk. An unexpected crash could leave your data in a corrupted and permanently irrecoverable state. If you choose this option, you must manually implement a mechanism to ensure changes are atomic.

‍Using byte array/BLOB fields inside of a database engine

This is the most reliable option. The database of your choice likely already implements atomic transactions out-of-the-box. Lite³ structures should be stored as byte array, BLOB or equivalent type. Some database engines support returning a 'view' object into the byte buffer which allow partial access through fixed chunks (~ 4-64 kB), however in most cases it is best to simply load the entire structure in memory.

Future work

Future formats building on this work can find improvement through a number of ways. Some ideas include:

  • Observing that leaf nodes allocate space for child pointers despite never having any children.
  • Reducing message size further by observing that many JSON schemas contain repeated keys/patterns. For this, inspiration can be drawn from deduplication/backreferencing techniques employed by the 'Smile' binary format.
  • Reducing branching pressure on the CPU during node traversal and key comparisons (SIMD?).
  • Better hiding latency via prefetching and/or batching overlapping lookups.
  • Improving performance limits related to 'type blindness' when converting between Lite³ and JSON.
  • Implementing conversion between Lite³ and JSON without memory allocations.
  • Employing varint encoding for length prefixes of string/bytes types.
  • Implementing arbitrary B-tree deletions with proper rebalancing logic and defragmentation for insertion of new values.
  • Enabling compatibility for 32-bit CPUs.
  • Adding functions for extraction and insertion of nested objects/arrays.

Worldwide Impact

As of writing, JSON remains the global standard today for data serialization. Reasons include: ease of use, human readability and interopability.

Though it comes with one primary drawback: performance. When deploying services at scale using JSON, parsing/serialization can become a serious bottleneck.

The need for performance is ever-present in today's world of large-scale digital infrastructure. For parties involved, cloud and electricity costs are significant factors which cannot be ignored. Based on a report by the IEA, data centres in 2024 used 415 terawatt hours (TWh) or about 1.5% of global electricity consumption. This is expected to double and reach 945 TWh by 2030.

Building systems that scale to millions of users requires being mindful of cloud costs. According to a paper from 2021, protobuf operations constitute 9.6% of fleet-wide CPU cycles in Google’s infrastructure. Microservices at Meta (Facebook) also spend between 4-13% of CPU cycles on (de)serialization alone. Similar case studies of Atlassian and LinkedIn show the need to step away from JSON for performance reasons.

JSON is truly widespread and ubiquitous. If we estimate that inefficient communication formats account for 1-2% of datacenter infrastructure, this amounts to several TWh annualy; comparable to the energy consumption of a small country like Latvia (7.17 TWh in 2023) or Albania (8.09 TWh in 2023). True figures are hard to obtain, but for a comprehensive picture, all devices outside datacenters must also be considered. Not just big tech, but also hardware devices, IoT and a myriad of other applications across different sectors have spawned a variety of 'application specific' binary formats to answer the performance question.

But many binary formats are domain specific. Or they require rigid schema definitions, typically written using some IDL and required by both sender and receiver. Both must be in sync at all times to avoid communication errors. Then if the schema should be changed (so-called 'schema evolution'), it is often a complex and fragile task to preserve backwards compatibility. This, combined with lacking integration in web browsers means many developers avoid binary formats despite performance benefits.

Purely schemaless* formats are simply easier to work with. This fact is evidenced by the popularity of JSON. For systems talking to eachother, fragmented communications and lack of standards become problematic, especially when conversion steps are required between different formats. In many cases, systems still fall back to JSON for interopability.

Despite being schemaless, Lite³ directly competes with the performance of binary formats.

Final Statement

Systems are rarely fully isolated. In the modern age of microservices, IoT, and the internet, distributed systems are the common case. As such, the convergence of the industry on a number of data formats is completely expected. Standardization enables interoperability and compatibility, such that new technologies can attach themselves to existing ones without barriers. The alternative result: starving in isolation while suffering from 'data impedance mismatch'. New ideas might be be created, but if incompatible, they do not get used and their fate is sealed.

This problem is still very visible in the context of full stack applications:

  1. The backend uses a relational database with table formats
  2. The fronted communicates in a dynamic semi-structured text format

Translation between the two manifests itself in various ORM layers, ETL (Extract, Transform, Load) pipelines and other transformations that introduce friction, complexity, brittleness and overall fragility in the system as a whole. Schema evolution can break down at the ABI layer before it even reaches the application layer.

Fundamentally, these problems stem from a view of seeing systems in isolation. Every programming language and framework has its own opinionated view of how data should be modeled, represented etc. Sometimes, this view might be justified. Many times, it complicates extensibility and interoperability. In a distributed world, one needs to think about the data source and destination. Standardized data formats have contributed immensely to innovation, enabling universal APIs, storage formats and more.

Yet as of today, no data format contains all of the following properties:

  1. universal portability: no ties to any particular platform, framework or programming language
  2. representational power: the ability to compose and represent any complex real world data
  3. accessibility: the technology is approachable, pleasant to work with and does not rely on obscure or esoteric knowledge
  4. performance: the technology can be used in demanding and resource-constrained environments

The combination of all could unify so many applications that a new standard is able to arise.

The Lite³ name

The name Lite³ was chosen since it is lighter than Lite².

‍TIP: To type ³ on your keyboard on Linux hold Ctrl+Shift+U then type 00B3. On Windows, use Alt+(numpad)0179.