-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.
There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).
To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.
A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).
This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.
This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.
cc @jkeiser