`reggy`

A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Unicode-aware and able to search a stream with several patterns at once.

Should I Use `reggy`?

If you are working on a text processing problem with streaming datasets or hand-tuned regexes for natural language, you may find the feature set compelling.

Crate	Match Streams?	Case Insensitivity?	Pattern Flexibility?
`aho-corasick`	✅	simple ASCII	string set
`regex`	❌	Unicode best-effort	full-featured regex
`reggy`	✅	Unicode best-effort	regex subset

API Usage

Use the high-level Pattern struct for simple search.

let mut p = Pattern::new("dogs?")?;
assert_eq!(
    p.findall_spans("cat dog dogs cats"),
    vec![(4, 7), (8, 12)]
);

Use the Ast struct to transpile to normal regex syntax.

let ast = Ast::parse(r"dog(gy)?|dawg|(!CAT|KITTY CAT)")?;
assert_eq!(
    ast.to_regex(),
    r"\b(?mi:dog(?:gy)?|dawg|(?-i:CAT|KITTY\s+CAT))\b"
);

Stream a File

In this example, we will count the matches of a set of patterns within a file without loading it into memory. Use the Search struct to search a stream with several patterns at once.

Create a BufReader for the text.

use std::fs::File;
use std::io::{self, BufReader};

let f = File::open("tests/samples/republic_plato.txt")?;
let f = BufReader::new(f);

Compile the search object.

let patterns = [
    r"yes|(very )?true|certainly|quite so|I have no objection|I agree",
    r"\?",
];

let mut pattern_counts = [0; 2];

let mut search = Search::compile(&patterns)?;

Call Search::iter to create a StreamSearch. Any IO errors or malformed UTF-8 will be return a SearchStreamError.

for result in search.iter(f) {
    match result {
        Ok(m) => {
            pattern_counts[m.id] += 1;
        }
        Err(e) => {
            println!("Stream Error {e:?}");
            break;
        }
    }
}

println!("Assent Count:   {}", pattern_counts[0]);
println!("Question Count: {}", pattern_counts[1]);
// Assent Count:   1467
// Question Count: 1934

Walk a Stream Manually

let mut search = Search::compile(&[
    r"$#?#?#.##",
    r"(John|Jane) Doe"
])?;

Call Search::next to begin searching. It will yield any matches deemed definitely-complete immediately.

let jane_match = Match::new(1, (0, 8));
assert_eq!(
    search.next("Jane Doe paid John"),
    vec![jane_match]
);

Call Search::next again to continue with the same search state. Note that "John Doe" matched across the chunk boundary, and spans are relative to the start of the stream.

let john_match = Match::new(1, (14, 22));
let money_match_1 = Match::new(0, (23, 29));
let money_match_2 = Match::new(0, (41, 48));
assert_eq!(
    search.next(" Doe $45.66 instead of $499.00"),
    vec![john_match, money_match_1, money_match_2]
);

Call Search::finish to collect any not-definitely-complete matches once the stream is closed.

assert_eq!(search.finish(), vec![]);

See more in the API docs.

Pattern Language

reggy is case-insensitive by default. Spaces match any amount of whitespace (i.e. \s+). All the reserved characters mentioned below (\, (, ), {, }, ,, ?, |, #, and !) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit unicode word boundaries (i.e. \b). Empty patterns or subpatterns are not permitted.

Examples

Make a character optional with ?

dogs? matches dog and dogs

Create two or more alternatives with |

dog|cat matches dog and cat

Create a sub-pattern with (...)

the qualit(y|ies) required matches the quality required and the qualities required

the only( one)? around matches the only around and the only one around

Create a case-sensitive sub-pattern with (!...)

United States of America|(!USA) matches USA, not usa

Match digits with #

#.## matches 3.14

Match exactly n times with {n}, or between n and m times with {n,m}

(very ){1,4}strange matches very very very strange

Definitely-Complete Matches

reggy follows "leftmost-longest", greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, abb? will match s.next("ab"), but a subsequent call to s.next("b") would create a longer match, "abb", which should supercede the match "ab".

Search only yields matches once they are definitely complete and cannot be superceded by future next calls. Each pattern has a maximum byte length L, counting contiguous whitespace as 1 byte.¹ Once reggy has streamed at most L bytes past the start of a match without superceding it, that match will be yielded.

As a consequence, results of a given Search are the same regardless of how a given haystack stream is chunked. Search::next returns Matches as soon as it practically can while respecting this invariant.

Implementation

The pattern language is parsed with lalrpop (grammar).

The search routines use a regex_automata::dense::DFA. Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. Unicode word boundaries are handled by the unicode_segmentation crate.

This is why unbounded quantifiers are absent from reggy. When a pattern requires * or +, users should choose an upper limit ({0,n}, {1,n}) instead. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
precommit.sh		precommit.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`reggy`

Should I Use `reggy`?

API Usage

Stream a File

Walk a Stream Manually

Pattern Language

Examples

Definitely-Complete Matches

Implementation

About

Releases

Packages

Languages

License

doc-sieve/reggy

Folders and files

Latest commit

History

Repository files navigation

reggy

Should I Use reggy?

API Usage

Stream a File

Walk a Stream Manually

Pattern Language

Examples

Definitely-Complete Matches

Implementation

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`reggy`

Should I Use `reggy`?

Packages