Rawscan
is written for programmers of utilities that read
data one line at a time. A line is any sequence of bytes terminated
by a designated delimiter byte.
Use the rawscan
library routines rs_open
(), rs_getline
(), and
rs_close
() to read input, line by line, from any any input stream
(any readable open file descriptor). See doc/HowToUse.md for the
details of how to use rawscan
.
As of January 10, 2020, I have (I hope) concluded my performance
tests and refinements on rawscan
. A couple of weeks ago, I
had planned on learning NumPy and JupyterLab to analyze these
test results, but it was quicker and easier for me to use the
classic Unix/Linux utilities that I already knew for this.
You can find the performance results (which I find to be quite interesting) below in the Further Performance Results section. For these results, I tested rawscan, rust, grep, sed, awk, python2, and python3 across a variety of input sizes.
Except for BurntSushi's Rust bstr crate found at BurntSushi's
bstr, rawscan is the fastest
over all input sizes, long or short lines, and few or many lines.
In order to beat this Rust bstr crate on large inputs of short
lines, I had to replace the ordinary C library call strncmp()
with hackish inline code hardwired for the specific test pattern,
because the Rust equivalent of strncmp()
, namely str.starts_with
,
is quite a bit faster. Even with that, I only "beat" Rust bstr
on this test case (many short lines) by the thinnest of margins,
likely within the margin of error.
Now that I have a good set of performance data across various commands
and input sizes, and now that rawscan
is beating all comers,
across all inputs (though in a few cases by razor thin margins),
perhaps I can turn my focus to presenting that performance data
in a more accessible manner, using NumPy and JupyterLab.
This rawscan project is seeking multiple goals for me:
- Providing a useful text line reader for some useful C/Linux commandline tools I intend to polish up and publish.
- A project to shape my learning (or re-leaning) of such tech as git, cmake, Python, NumPy, and JupyterLab.
- A set of well done C/Linux tools that I can rework in Rust, to learn and contribute to that language.
Rawscan
's advantages include:
- fast, safe and robust
- rarely moves or copies data internally
- fixed length caller specified buffer doesn't grow
- can be used in memory constrained situations
- caller controls buffer size at initialization
- handles arbitrarily long input lines
- all lines that fit in buffer returned in one chunk
- no confusing sscanf format risking bugs and overruns
- correctly and easily handles embedded nuls
- beats `gets`, `fgets`, `scanf`, `getline` and `getdelim`
- .. see "Comparative Analysis" below for details
- fast ... did I say fast?
rawscan
uses a single fixed length buffer that is allocated
during the rs_open
() call. Except as modified by the optional
rs_set_min1stchunklen()
setting, all lines short enough to fit in
that buffer are returned in one piece. Lines too long to fit are
returned in multiple chunks. Whether returned in one piece or in
multiple chunks, all returns use pointers directly into that buffer,
with zero copies.
Except as noted in the pause/resume discussion below, whenever
rs_getline
() finds that it only has a partial line left in
the upper end of its buffer, it shifts that partial line lower down
in its buffer and continues, always trying to return full lines in
one piece. For this reason in particular, note that each
rs_getline
() call might invalidate the pointers returned
in prior calls (unless the pause/resume feature is used.)
Thus rs_getline
() is not "zero-copy", but "infrequent copy",
so long as it's configured in the rs_open
() call to have an
internal buffer that is usually longer than the typical lines it's
returning.
This strategy works especially well when there is a given upper length to the line length that needs to be parsed conveniently (in one piece), and longer lines can be ignored or handled in chunks.
For applications that can ignore or quickly skip over overly
long lines, the rawscan
interface makes that especially easy
to do so. Just ignore RAWSCAN_RESULT's with result types of
rt_start_longline
, rt_within_longline
or rt_longline_ended
,
as described in doc/HowToUse.md.
The use of rawscan
on any particular input stream should be
single threaded. Multiple parallel threads trying to access and
update the same rawscan
stream will likely confuse the RAWSCAN
control structure for that stream, and might prematurely overwrite
some data still being used by another thread.
One key technique that is used to improve performance (several times
fewer user CPU cycles than stdio buffer based solutions) is the
use of strchr
, strchrnul
, or rawmemchr
to scan considerable
lengths of input. These routines are very heavily optimized, both by
modern compilers, and further by the Intel(tm) MultiMedia eXtension
(MMX) and Intel(tm) Advanced Vector Extensions (AVX) instructions.
Whenever a problem can be reduced to scanning large spans of buffer looking for a single particular character, these routines can race through the data at maximum speed, operating on data perhaps 128, 256 or 512 bits at a time, depending on compiler and CPU technology.
This is all much faster than doing character at a time:
while ((c = getchar()) != EOF) switch (c) { ... }
loops in hand-coded C from a stdio input buffer, with multiple tests
for state and the value of each character 'c' in each loop iteration.
Of the three byte scanners strchr
, strchrnul
, and rawmemchr
,
rawscan
uses the fastest one, which is rawmemchr
. It is fastest
because it only stops scanning when it sees the requested
character. The x86_64 assembly code in good libraries for these
routines is very fast, at least on "modern" (recent years as of
this writing in 2019) Intel and AMD x86_64 processors with AVX
vector instructions.
However rawmemchr
will happily run to, and beyond, the limits of the
memory segment it's searching, which could cause a SIGSEGV or other
such crash, if the character it's looking for is not found sooner.
So to use rawmemchr
safely, the rawscan
code sets up a read-only
page, just past the main buffer, with a copy of the configured
delimiterbyte in the first byte of that read-only page, ensuring
that calls to rawmemchr
will terminate there, if not sooner,
regardless of what's in the buffer at the time.
The following facilities extend the default behavior described above. If you use one of these features, then some of the specifics given above might not apply, as will be described in the following.
The rawscan
calling routine can gain some control over when the
contents of the buffer are invalidated by using the optional
pause/resume states.
To use pause/resume, first invoke rs_enable_pause
() on the
RAWSCAN stream. That call just sets a pause enable flag on the
stream's internal state and returns immediately.
Then whenever a rs_getline
() call hits the upper end of the buffer
and needs to invalidate the already returned portion to make room
for more data, that rs_getline
() call will instead return with a
RAWSCAN_RESULT.type
of rt_paused
, without altering the current
buffer contents.
When the calling routine has finished using or copying out
whatever data it's still needs from the buffer, it can then call the
rs_resume_from_pause
() function on that stream, which will unpause
the stream and enable the next rs_getline
() call to once again
return another line, and as a necessary side effect, overwrite some
or all of the previously returned data that had been in the buffer.
Such calls to the rs_resume_from_pause
() function do not alter
the setting of the pause enable flag set by rs_enable_pause
().
Rather such calls to the rs_resume_from_pause
() function just set
a one-time latch, enabling the next rs_getline
() call to one-time
overwrite stale data in the buffer in order to make room to read
in more data that it can scan and potentially return to the caller.
Use the rs_disable_pause
() call to disable the pause enable flag.
By default, rawscan
guarantees to return all lines that are
shorter than the buffer size as a single complete line. This can
result in having to copy almost an entire buffer of data lower in
the buffer, in an effort to get the rest of the line to fit in
the upper end of the buffer. Depending on the specified buffer
size and on the various line lengths typically see in the input,
this can substantially increase the amount of data copying done
internally by the rawscan
routines, and reduce performance.
In the abstract, this potential performance issue is due to conflating what could be two separate tuning parameters:
- The buffer size, which might be tuned for optimum i/o performance on the target system.
- The guaranteed minimum length of any full lines or initial chunks of longer lines.
The rs_set_min1stchunklen(RAWSCAN *rsp, size_t len)
routine, run
on an already opened rawscan
stream, sets the second parameter
of these two to any specified length len
less than or equal to
that rawscan
's buffer size.
Once set to some value len
, then that rawscan
stream will
always return any line shorter than or equal to length len
in
one piece, and for any longer line, will always return a first
chunk that is at least len
bytes long.
This can be useful if your code can efficiently handle shorter lines
and initial line chunks, and you don't want to pay the potential
performance penalty of forcing the rawscan
code to shift larger
initial chunks lower in the buffer in an effort to complete the
entire line in a single return.
The default len
value is whatever bufsz
was requested in the
rs_open()
call, and invoking rs_set_min1stchunklen(rsp, bufsz)
to restore that value will exactly restore that default.
The internal state of a rawscan
stream is hidden from the using
application. Application code that includes rawscan.h
and links
dynamically with librawscan.so
cannot not see the internals of
the RAWSCAN structure at all, and application code that includes
rawscan_static.h
can technically see the internals of the RAWSCAN
structure, but is not supposed to look.
There may be future opportunities to expose some of these internal properties of the RAWSCAN structure to using applications, preferably using small accessor functions for better portability across internal changes to internals of the RAWSCAN structure.
The current rs_open() code allocates, using the Linux sbrk(2) and brk(2) system calls, the memory space, in the invoking process's data region, for the buffer, the controlling RAWSCAN structure, and a copy of the delimiter byte in a separate read-only page just above the buffer.
A variant of this open could accept a pointer and size to memory that the caller wants rawscan to use, and with a couple more minor code tweaks, this buffer could even be read-only, allowing for example a table in ROM to be parsed, line by line.
By giving the invoking application more control over when and by how much partial data in the upper end of the buffer is shifted lower, an application could arrange to efficiently obtain a multi-line record all in one piece in the buffer, so long as it allfit.
The delimiter byte (e.g. '\n' or '\0') could be changed on the fly, with a routine that changed the page holding the sentinel copy of the delimiter to be writable, changed that sentinel byte, then set that page back to read-only.
If an application working in a very memory constrained situation
did not want to "waste" an entire memory page just to ensure that
the sentinel byte was read-only, then (barring bugs in the code) the
rawscan
library code could place the sentinel byte at the upper
end of the buffer, in the same writable page as the top of the buffer.
With some additional work, routines could be provided to resize the
internal buffer of a rawscan
stream, perhaps including copying
over what existing buffered data would fit.
The simple, most common case, of Windows style "\r\n" line endings could be trimmed with a wrapper to rs_getline(), that moved the line.end value returned by rs_getline down by one byte, when it was pointing at the '\n' character of a two character '\r\n' sequence.
As part of developing other personal Unix/Linux command line tools over the years that process data line by line, I've not been happy with the existing alternatives for scanning input lines:
- stdio's gets() is dangerously insecure, allowing buffer overflow.
- stdio's fgets(), scanf() and getchar() use slow double buffering.
- stdio's fgets() cannot correctly handle lines with embedded nuls.
- stdio's fgets() normal usage confuses continuations of lines longer than buffer with start of new lines.
- scanf() risks failed format matching and buffer overflows
- getchar() loops are accurate, but with slower per byte logic
- getline() grows (realloc's) its buffer to handle longer lines
- getdelim() is just like getline(), with a configurable delimiter
The buffer growing reallocations in getline
() present the risk
of a denial of service attack. Applications using getline
()
on untrusted input can be forced to keep growing getline
()'s
buffer until the application dies of memory exhaustion, or until
the supporting system slows unacceptably swapping a huge program.
Rust's LineReader is the
closest equivalent to rawscan
that I've found so far. However,
so far as I can tell, it will split lines that wrap around the end
of the buffer, even if the line could otherwise have fit in the
buffer, had it been at a different offset.
Instead, rawscan
never splits lines that it can fit in
its buffer, or that are shorter than specified by the optional
rs_set_min1stchunklen()
setting, and never reallocates a bigger
buffer than initially allocated in the rs_open
() call. If an
input line that would have fit in the buffer would spill off the
end of the buffer, rawscan
will shift up to min1stchunklen
bytes of that partially read line lower in the buffer and continue
reading the rest of it into the buffer, before returning the whole
line in one piece if possible.
The following test case highlights rawscan
's strengths.
The input file held 55,626,370 (55.6 million) lines of text, and had a size of 11,266,624,053 (11.3 Gb) bytes. The shortest line was 82 bytes long and the longest line was 2525 bytes long. Each line started with a long ASCII hex number over the alphabet [0-9a-f]. These initial numbers were largely, but not entirely, random in order and value.
A pattern was chosen that would be easy to code for in various ways, and that would only match a few lines, so as to focus the performance timings on the cost of reading and distinguishing each line, not on searching within each line for a pattern, and not on writing out the successful matches.
The pattern matched lines beginning with the eight hex characters "6fffbf42", using strncmp() to look for the pattern, or using the pattern "^6fffbf42" (anchored to start of line) with grep. This pattern matched 8 out of the 55,626,370 lines.
Here's the user CPU times, on a Ryzen 1700 with plenty of RAM and and with the input file (all 11.2+ GBytes of it) already loaded into RAM, for each of grep and three small programs coded to use fgets, getline, and rawscan. Times are averages of 10 runs, using "nice -n-10" to get higher CPU priority, which resulted in fairly consistent times from run to run (except for "grep", as noted below.)
----------------------
| Routine | CPU secs |
| ------- | -------- |
| grep | 4.38 |
| fgets | 2.61 |
| getline | 2.36 |
| rawscan | 1.60 |
----------------------
The "grep" runs displayed an anomaly that I don't understand. About 90% of the grep runs used very close to 4.38 seconds, while the other 10% of them used very close to 7.10 seconds. In the above table, I gave grep a mulligan for those runs that took over 7 seconds.)
As you can see from the above table, rawscan shaves about 30% to 40% off the user CPU times of the fgets and getline based code.
Rawscan does this without having fgets' problems with embedded nuls, getline's potentially unbounded malloc's (or leaving difficult to code reallocations to the caller, as fgets does), or the potential for dangerous buffer overruns (as gets and some scanf formats do).
Rawscan also provides more sophisticated options for managing memory usage, from very small pre-allocated buffers that require no malloc runtime, to sufficiently large buffers to contain the entire input, directly accessible in memory (without falling apart when an even larger than expected input shows up.)
Further tests comparing staticly included rawscan code in the "rawscan_static.h" header, with dynamically linked librawscan.so code as declared in the "rawscan.h" header, along with other common routines for reading lines of input, produced the following results.
Compared to including rawscan.h and dynamically linking with
librawscan.so, including "rawscan_static.h" was faster, across
almost all test sizes, large and small, short and long lines.
For sufficiently large test cases such as 65536 lines of 4096 bytes
each, there was no measurable difference between the two; in such
cases, I presume that the raw power of rawmemchr()
scanning for
the next delimiter dominates all other costs.
The performance gain obtained from using "rawscan_static.h", versus the more "normal" use of "rawscan.h" with a dynmically linked librawscan.so, depended on the average line length of the input data.
In tests with many short lines of 8 bytes (plus newline) each, the static built rawscan ran 10% faster. In tests with many lines of 64 bytes each, "static" was 2% faster. In tests with many lines of 512 bytes or longer, there was not a measurable difference between tests that included "rawscan_static.h", versus that included "rawscan.h" and dynamically linked to librawscan.so.
This probably reflects what part of the code is most heavily used.
With short lines, the per line call cost of rs_getline() matters most, whereas with long lines, the underling per character cost of the rawmemchr() scanning long runs of characters matters most. The "static" build improves the per rs_getline() call costs, but has no impact on the underlying per character rawmemchr() scanning costs that dominate on input with long lines.
The above performance comparisons apply for large file inputs, such as 65536 lines used to obtain the above results. For small file inputs, the cost of starting up the test command dominates.
For example, on test inputs of 8 lines of 8 chars (plus newline) each, the "static" builds had about a 3% advantage over the dynmaically linked build, and both variants had about an 18% advantage over their nearest competitors using rust, as well as a 27% advantage over the "grep" command line utility. The other utilities sed, awk, python2, and python3 were progressively slower, in that order, with python3 being over 19 times slower on these small inputs of 8 lines having 8+1 characters each.
I've been coding various C routines to read input line by line for many years, since before stdio even existed. As an old school C programmer, the complexities and inefficiencies of the stdio package, compared to raw read(2) and write(2) calls, have always annoyed me a bit.
As a coder of quite a few personal text processing tools, I've long experimented with various safe, fast, line readers. Those familiar with Rust's Result style of handling complex function call returns, or with the above mentioned LineReader routine, will recognize their influence on this current rawscan code.
Here's how to obtain rawscan
for your project.
Rawscan
is developed and maintained using the GNU C compiler
on Linux systems, though a cmake
build system is provided that
(so far only tested on my Ubuntu system) should enable porting
(including removing any gcc and Linux dependencies), building,
installing, and linking rawscan
on BSD, Windows and MacOS
systems, and with the clang and MSVC compilers.
Download or clone from
github, compile, and install the shared library (librawscan.so
) and header files
(rawscan.h
and rawscan_static.h
) as follows:
Once downloaded or cloned, cd into rawscan
's build directory and do:
cmake .. # run commands in the build subdirectory
make
sudo make install
One is then ready to use rawscan to help write one's own text or line processing utilities.
If you're interested in using or porting rawscan
, or would
like to suggest an additional feature that might make rawscan
more attractive to you, let me know, at my email address below,
or via the rawscan
repository on github. Thanks!
The source file source/tests/random_line_generator.c
generates
random input, of specified number of lines, of specified length(s)
of constant length or of random lengths within a specified range.
The source file source/tests/compare_various_apis.sh
provides
a framework to run tests on several competing commands at a time,
in parallel, from a range of test in put sizes, and collects
statistics such as user and system cpu time, elapsed time, and
page fault rates from the runs. A few other scripts in the
source/tests
directory perform preliminary analysis of the
resulting data.
Paul Jackson
[email protected]
Project began in November of 2019