There are many ways to extract data from strings or files. The scanf family of function offers one of them.
These functions scan input according to a provided format string. The format string might contain conversion specifiers or conversion directives to extract integers, floating-point numbers, characters, strings, etc. from the input and store it in the arguments.
For example, a format string for parsing components of an IP address might look like: “%3d.%3d.%3d.%3d” — the scanf function will parse 4 integers (maximum 3 digits each) that are delimited with dots and return them to the caller.
There are basically two ways to implement scanf:
- as an interpreter, that scans format string and executes commands as they are retrieved.
- as a translator to an intermediate language that, in turn, is compiled into machine code.
The trivial-scanf package (that comes as the part of the CL-STRING-MATCH library) takes the first approach. The trivial-scanf implementation reads one character at a time, and depending on the read character performs the designated operation. Underneath, it uses PROC-PARSE library to deal with the input. Outline of the function’s main loop looks as follows:
(iter
(while (< fmt-pos fmt-len))
(for c = (char fmt fmt-pos))
(case c
(#\%
;; process conversion directive
)
((#\Space #\Tab #\Return #\Newline #\Page)
;; process white space characters
)
(otherwise
;; process ordinary characters
)))
Conversion directives might have optional flags and parameters that must be taken into account. Simple directives, like %d, are handled in a straightforward way: input matching to the designated data type (digits) are bound to a string that is then parsed using corresponding function (parse-integer in this case).
However, the standard scanf also specifies a directive to match a set of designated characters. For example, directive ‘%[a-z0-9-]’ would scan input and return a string composed of letters, digits, and a dash from the current position, until first mismatch. In case, if we dealt with an octet-string (a string where every character is guaranteed to be a single byte in size), it would be feasible to interpret this directive using a table to mark characters that belong to the set. The trivial-scanf takes another approach: characters set directive is converted into a list of closures that serve as predicates for the input string binding operation. In our example, the list of closures would contain predicates for: (range #\a…#\z), (range #\0…#\9) (character #\).
trivial-scanf will be accessible through Quicklisp after the next packages update. At the moment you can clone the repository and install it locally.
Some usage examples:
(ql:quickload :trivial-scanf)
(snf:scanf "%3d.%3d.%3d.%3d" "127.0.0.1") => (127 0 0 1)
(snf:scanf "%d %[A-C] %d" "1 ABBA 2") => (1 "ABBA" 2)
This the first (almost alpha) release of the code, so some bugs are expected. Feel free to comment or submit them.
trivial-scanf is the part of the CL-STRING-MATCH library.