Skip to content

index_dump skip parameter still loads the json lines previous to skip #50

@ziodave

Description

@ziodave

The skip_docs parameter is passed to the index_stream:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/cli.py#L117-L119

The index_stream method will skip lines:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/taggerfactory.py#L74-L75

But the dumpreader has already used time to load the json, even for skipped lines:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/readers/dumpreader.py#L26-L33

While working on a fix for this, IMO a good workaround is to tail the lines before sending them to the cli, e.g.:

pbzip2 -c -d -p8 latest-all.json.bz2 | tail -n +879322

(Notice here I am using a multi-threaded bzip2 implementation)

Also I suggest switching to orjson since it faster than the system json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions