Skip to content

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

@lemire

Description

@lemire

The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.

There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).

To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.

A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).

This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.

This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.

cc @jkeiser

Metadata

Metadata

Assignees

No one assigned

    Labels

    researchExploration of the unknown

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions