Sia: an ultra fast serializer in pure JavaScript
When it comes to serialization in JavaScript, the first and the most obvious choice is JSON, and I must say JSON is very fast, much faster than a lot of serialization libraries I’ve tried for JavaScript. But JSON doesn’t preserve type information, it only supports a few data types, and using a reviver/replacer function to add support for custom types hugely impacts the performance. So what’s the solution, if performance matters?
Doing a little bit of research, I found msgpackr and cbor-x, two schema-less serialization libraries developed by Kris Zyp. If you don’t know them already, they are super fast implementations of the MessagePack and CBOR protocols written for JavaScript and Node. These libraries are very fast, and Kris has an amazing article on how he made them fast, you can read it here. Initially I decided to use one of these libraries for my project, but there were a few reasons I decided against it.
What’s nice about JSON, is that you get the same performance both in the browser, and in Node. However, that is not the case with msgpackr and cbor-x. Both libraries rely on an optional native extension for extracting text, which doesn’t run on the browser. Without this extension, the performance isn’t that great. Another reason I didn’t use these libraries, is that the custom type data is shared between all serializer and deserializer instances, and I needed more isolation there.
Trying many serialization libraries and implementations of various algorithms, in the end I decided to create my own, and I must say I’m very satisfied with the results. I named it Sia, and I’m sharing it with anyone who needs a fast, pure JavaScript serializer. Below you can see a few charts, comparing the performance of Sia to msgpackr, cbor-x and JSON:
For a small file, Sia, msgpackr and cbor-x are all faster than JSON, with Sia being ~18% faster than msgpackr and cbor-x, and ~66% faster than JSON. As soon as we remove the optional native extensions of msgpackr and cbor-x, the deserialization times go through the roof. Without the native extensions, Sia is +116% faster than msgpackr. Worth to mention that the resulting file size for Sia is 15% smaller than the others. You can check the test data here.
On a medium file size, only Sia is faster than JSON. With this sample file, Sia is +18% faster than JSON, and is ~30% faster than cbor-x. Removing the optional extensions, again hurts the performance of msgpackr and cbor-x. Without the native extensions, Sia is ~108% faster than cbor-x! Sia file size is 20% smaller than cbor-x and msgpackr. You can check the test data here.
On a huge file size, Sia is +71% faster than JSON, ~28% faster than msgpackr, and finally ~135% faster than msgpackr without the native extension! It’s amazing to see how a pure JavaScript library beats a native extension in performance! You can check the test data here. On smaller file sizes, all libraries have more or less the same performance. That’s why I didn’t include a chart for that.
For custom data types, Sia is extremely fast. As you can see in the chart above, it’s ~1234% faster than JSON, +86.5% faster than cbor-x, and +336% faster than msgpackr! So how can a pure JavaScript serialization library, beat several others that are written in C++? Well, lots of optimizations! You can check out the project on GitHub, or continue reading to learn more about these optimizations.
Optimizations
There are lots of bottlenecks when writing a serializer for JavaScript. Type checking is heavy, object and array creation is time consuming, but the most expensive and heaviest operation is working with strings. They’re expensive to read, they’re expensive to write. Even in Node, using the native functions such as Buffer.toString
,Buffer.write
, or using TextEncoder
and TextDecoder
APIs can be really really slow for smaller strings, because there is an overhead passing objects from JavaScript to C++. So what is the solution?
Optimization #1: caching
Ok, so it’s expensive to read and write strings, then why not just cache them? Some of the strings, for example key names of JavaScript objects, are usually repeated many times. For example in an array of objects, most of the time it’s the case that all objects share the same key names, or at least share a part of them. Sia caches the key names on objects, keeps the count on them and uses pointers to refer to them. This gives us a huge performance boost. Caching big strings or any string apart from key names is not a wise idea and hurts the performance a lot.
Optimization #2: utfz
I invented a new string encoding, or better say a new UTF-16 compression algorithm just to make it faster to read and write strings in JavaScript. When I investigated super fast JavaScript serialization libraries, I learned that for reading and writing short strings they all use pure JavaScript functions to skip the overhead of passing JavaScript strings to C++. This gives them a huge performance boost, but they have to convert JavaScript’s UTF-16 strings into UTF-8 and that is time consuming.
The first thought that came into my mind was, why not just write UTF-16 into the buffer? This way we skip the overhead of passing strings to C++, and we’re also not spending time converting from UTF-16 to UTF-8. I tried that, and unfortunately it didn’t work well. First of all, it increased the size of the generated file, and secondly, since instead of one byte for ASCII characters I had to write two bytes, it actually reduced the performance a little, I realized we are writing a lot of unnecessary zero’s for ASCII characters. It was then that I got this idea, what if, we skip all the bytes that are repeated? For example, take a look at the hexadecimal representation of the word “Hello”:
48 00 65 00 6c 00 6c 00 6f 00
You see all those 00
’s that are repeated? What if we had an encoding, that could skip these repeated 00
’s, and just say “Ok we have a bunch of bytes and they all have a 00
after them?”. Well, that’s exactly what utfz does. It works with all languages and symbols, and it also works nice with multi-language strings. The magic is that for each language, the UTF-16 code points are consecutive, so they’re always a shared X + Y. Using utfz, reading and writing strings becomes much faster, and it saves some extra space.
Optimization #3: pre-compilation
I was looking for the fastest way to read and write strings, and I checked msgpackr, cbor-x, avsc and uft8-buffer and some more packages for that. The packages I named, all have a pure JS implementation of functions to read and write UTF-8, and for small strings they’re all faster than the native functions and APIs. Comparing these, I noticed something very interesting. For strings smaller than 16 characters, msgpackr and cbor-x implementations of UTF-8 functions are at least 3 or 4 times faster than the others!
Kris came up with an amazing optimization for his UTF-8 functions. He implements a generic solution for reading and writing UTF-8, then he implements special little pieces of code for reading strings of specific length. With this technique he doesn’t have to make an array to keep track of code points and he doesn’t have to do any loops or recursions. I thought by myself, what if I did the same, but with my utfz? So I got to work and wrote a generator function, that takes in a length n
and generates a function for reading strings of length n
. Doing this, I gained an extra 60% boost in performance!
Optimization #4: a better serialization algorithm
Sia is designed to be fast. Unlike MessagePack, Sia doesn’t require a length header for objects or maps. Checking popular hash map implementations in multiple low level languages, I realized almost none of them require a length for initialization. Removing the length header saves us at least a byte per object, and helps us improve performance by not counting the number of keys in each object. This helps with performance a lot. Instead of specifying the length of an object, Sia uses a code point for objectStart
and another one as objectEnd
.
Sia caches object keys by default, and it’s in the algorithm and specification to keep a record of them. In msgpackr and cbor-x implementations, Kris had to add a custom pointer type to the original specifications in order to cache the keys in items. Sia is a performance first algorithm, and I spent more than a year optimizing the algorithm and the specification, and I’m still working on improvements. Visit the project’s GitHub page to check out the code, benchmarks, specs, installation guide as well as the documentation! Thanks for reading my article, and any feedback is appreciated!