Lazy-splitting means getting the first n token-separated substrings without splitting the entire string first. I wrote a heuristic that does this in Python.
The method I provide has its ups, and downs. It is faster on larger streams, and slower on smaller streams. For smaller stream, just the calling convention, initializations, etc takes much more time than CPython's implementation. But with larger streams, of course, it's much faster, as it does not split the entire stirng all at once.
First let's go through how I prototyped this:
Let's generate a file of size 1MB, using Python's string mode (-c
flag):
touch rand && python3 -c "from random import choices;from pathlib import Path;Path('rand').write_bytes(bytearray(choices(list(range(65, 122)),k=1000000)))" && wc -c rand
touch rand
let's us create a file called rand
, this works on all UNIX-like systems and POSIX-compliant terminals. We then create a file containing random characters between ASCII byte 65 (A) and ASCII byte 122 (I think it's small-case z, not sure! But they are all printable). Now let's run Python split on it, and time it using time
utility:
$ python3 -c "from time import time_ns;t1=time_ns();f=open('rand','r').read();a,b,c=f.split('A')[:3];t2=time_ns();print(t2-t1,a,b,c,sep='\n')"
1819675
mbMdsYdSJORISOuPPrwdbBnL[y]_F
VJxWK`tKGXyYGpvlsDbOpWYIpT`__oTsbpx`bN[j_PL\CdHVJEgWiXbv
^IsHOiojsacmsj`P]F_Z^vvNCpUuDbWCQIi`lUiktW`Hs\`rcFwWssY\VnkyIoWpqEUNKqwBOgjPYG[eIniy`jIhyCZlTGCUaySNTU\rpSMaukfDvn[GbCBkeFMogq\ICoZ_jBjRklFtDXW\`gmieEmn\lBPmqZc\DjpnII]Rwbwd]GJVhSpbSEs\_BCQeC
It takes 1819675ns
for Python to split by A (65) and get the first 3 substrings.
Now lazysplit.py
's heuristic in string mode:
$ python3 -c "from time import time_ns;f=open('rand','rb').read();p1=0;p2=0;a='';b='';c='';t1=time_ns();list(map(lambda c: exec(f'p2+=f[p1 + p2:].index(65) + p1;{c}=f[p1:p2];p1=p2+1',globals()),['a','b','c']));t2=time_ns();print(t2 - t1,a,b,c,sep='\n')"
582256
b'mbMdsYdSJORISOuPPrwdbBnL[y]_F'
b'VJxWK`tKGXyYGpvlsDbOpWYIpT`__oTsbpx`bN[j_PL\\CdHVJEgWiXbv'
b'^IsHOiojsacmsj`P]F_Z^vvNCpUuDbWCQIi`lUiktW`Hs\\`rcFwWssY\\VnkyIoWpqEUNKqwBOgjPYG[eIniy`jIhyCZlTGCUaySNTU\\rpSMaukfDvn[GbCBkeFMogq\\ICoZ_jBjRklFtDXW\\`gmieEmn\\lBPmqZc\\DjpnII]Rwbwd]GJVhSpbSEs\\_BCQeC'
Much faster right?
Now take a look at how lazysplit.py
itself handles it:
Python result: [b'FwdKaHBKgBXX\\C]YKTs_l_womMKgNytRYEGlIXW_tmLo`aXQelimtYe^hgbCcaij[uBfZS', b'hTeG\\Uh^DKE\\HfVo]Z]ognryOWwXVPvLEowrYMmKQPOLdmpgrjWhbjgpdRuDvhoOIMveIo', b'bsGWKeyTaBWQNOjcpDdkScoQhpeShNROvp\\`WuT`W`^[ocE^uw]xUL\\hlFVFRVV_rGRhCn]COmUXgsgNmmy[iRSxeHnjISVUMoyqQeLSHRgtDMlut_XmJdvDqQSw[lHjatDmSCajQwxby']
Python time: 1010817
---
Custom result: [b'FwdKaHBKgBXX\\C]YKTs_l_womMKgNytRYEGlIXW_tmLo`aXQelimtYe^hgbCcaij[uBfZS', b'hTeG\\Uh^DKE\\HfVo]Z]ognryOWwXVPvLEowrYMmKQPOLdmpgrjWhbjgpdRuDvhoOIMveIo', b'bsGWKeyTaBWQNOjcpDdkScoQhpeShNROvp\\`WuT`W`^[ocE^uw]xUL\\hlFVFRVV_rGRhCn]COmUXgsgNmmy[iRSxeHnjISVUMoyqQeLSHRgtDMlut_XmJdvDqQSw[lHjatDmSCajQwxby']
Custom time: 602078
Diff pytime - custtime: 408739
Disclaimer: I am not to be held responsible if this code ends up with issues if you use this in-production. I did this as part of a job, but I would like to be held responsible only if the said project faces issues because of this, not someone else's.
Notice: This function does NOT handle errors. Do not pass it more n
s than it can handle.
The file has a shebang so you can do sudo chmod +x ./lazysplit.py && ./lazysplit.py
.
Contact me on Discord Chubak#7400
. I am currently working on another piece of code that I shall publish via Gist, implementation of djb2 with Aarch64 and x86-64 Assembly. I am also working on my own implementation of Zlib + DEFLATE in Rust. Please stay tuned, and follow my profile.
Thank you.