Hyperscan-java provides Java bindings for Vectorscan, a fork of Intel's Hyperscan - a high-performance multiple regex matching library.
Vectorscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions across streams of data with exceptional performance.
- High Performance: Scan text against thousands of patterns simultaneously with high performance
- Two Usage Modes:
- Direct Vectorscan API for maximum performance (with limited regex syntax support)
PatternFilter
utility for full Java Regex API compatibility
- UTF-8 Support: Proper handling of Unicode text with character-based matching results
- Cross-Platform: Pre-compiled native libraries for Linux and macOS (x86_64 and arm64)
- Database Serialization: Save and load compiled pattern databases to avoid recompilation costs
The library is available on Maven Central. The version number consists of two parts (e.g., 5.4.11-3.1.0
):
- First part: Vectorscan version (
5.4.11
) - Second part: Library version using semantic versioning (
3.1.0
)
<dependency>
<groupId>com.gliwka.hyperscan</groupId>
<artifactId>hyperscan</artifactId>
<version>5.4.11-3.1.0</version>
</dependency>
implementation 'com.gliwka.hyperscan:hyperscan:5.4.11-3.1.0'
libraryDependencies += "com.gliwka.hyperscan" %% "hyperscan" % "5.4.11-3.1.0"
The library offers two primary ways to use the regex matching capabilities:
PatternFilter
is ideal when you:
- Need full Java Regex API functionality and PCRE syntax
- Want to pre-filter a large list of patterns efficiently
- Need capture groups, backreferences, or other advanced regex features
It uses Vectorscan to quickly identify potential matches, then confirms them with Java's standard Regex engine.
Use the direct API when:
- You need the absolute highest performance
- Your patterns work within Vectorscan's supported syntax
- You understand the differences in matching semantics
Note: Vectorscan doesn't support backreferences, capture groups, or backtracking verbs.
import com.gliwka.hyperscan.util.PatternFilter;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.util.Arrays.asList;
public class PatternFilterExample {
public static void main(String[] args) {
// Create a list of Java regex patterns (could be thousands)
List<Pattern> patterns = asList(
Pattern.compile("The number is ([0-9]+)", Pattern.CASE_INSENSITIVE),
Pattern.compile("The color is (blue|red|orange)"),
Pattern.compile("\\w+@\\w+\\.com")
// imagine thousands more patterns here
);
try (PatternFilter filter = new PatternFilter(patterns)) {
String text = "The number is 42 and the NUMBER is 123. Contact [email protected]";
// Quickly filter to just the potentially matching patterns
List<Matcher> potentialMatches = filter.filter(text);
// Use Java's Regex API to confirm matches and extract groups
for (Matcher matcher : potentialMatches) {
while (matcher.find()) {
System.out.println("Pattern: " + matcher.pattern());
System.out.println("Match: " + matcher.group(0));
// Handle capture groups if present
if (matcher.groupCount() > 0) {
System.out.println("Captured: " + matcher.group(1));
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.util.EnumSet;
import java.util.LinkedList;
import java.util.List;
public class DirectHyperscanExample {
public static void main(String[] args) {
// Define expressions to match
LinkedList<Expression> expressions = new LinkedList<>();
// Expression(pattern, flags, id)
expressions.add(new Expression("[0-9]{5}", EnumSet.of(ExpressionFlag.SOM_LEFTMOST), 0));
expressions.add(new Expression("test", EnumSet.of(ExpressionFlag.CASELESS), 1));
expressions.add(new Expression("example\\.(com|org|net)", EnumSet.of(ExpressionFlag.SOM_LEFTMOST), 2));
try (Database db = Database.compile(expressions)) {
try (Scanner scanner = new Scanner()) {
// Allocate scratch space matching the database
scanner.allocScratch(db);
// Scan text against all patterns simultaneously
String text = "12345 is a zip code. Test this at example.com!";
List<Match> matches = scanner.scan(db, text);
for (Match match : matches) {
Expression matchedExpression = match.getMatchedExpression();
System.out.println("Pattern: " + matchedExpression.getExpression());
System.out.println("Pattern ID: " + matchedExpression.getId());
// Note: start and end positions are character indices (inclusive)
System.out.println("Start position: " + match.getStartPosition());
System.out.println("End position: " + match.getEndPosition());
// The matched string is only available if SOM_LEFTMOST flag was used
if (matchedExpression.getFlags().contains(ExpressionFlag.SOM_LEFTMOST)) {
System.out.println("Matched text: " + match.getMatchedString());
}
System.out.println("---");
}
// You can also check if a pattern matches without getting match details
boolean hasAnyMatch = scanner.hasMatch(db, text);
System.out.println("Has any match: " + hasAnyMatch);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.util.EnumSet;
public class ValidationExample {
public static void main(String[] args) {
// Create an expression to validate
Expression expr = new Expression(
"a++", // This pattern uses a feature not supported by Hyperscan
EnumSet.of(ExpressionFlag.UTF8)
);
// Check if the expression is valid for Hyperscan
Expression.ValidationResult result = expr.validate();
if (result.isValid()) {
System.out.println("Pattern is valid");
} else {
System.out.println("Pattern is invalid: " + result.getErrorMessage());
}
// Common expression flags:
// - ExpressionFlag.CASELESS - Case insensitive matching
// - ExpressionFlag.DOTALL - Dot (.) matches newlines
// - ExpressionFlag.MULTILINE - ^ and $ match on line boundaries
// - ExpressionFlag.UTF8 - Pattern and input are UTF-8
// - ExpressionFlag.SOM_LEFTMOST - Track start of match (enables getMatchedString())
// - ExpressionFlag.PREFILTER - Optimize for pre-filtering (used by PatternFilter)
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.io.*;
import java.util.EnumSet;
public class DatabaseSerializationExample {
public static void main(String[] args) {
try {
// Create and compile a database
Expression expr = new Expression("\\w+@\\w+\\.(com|org|net)",
EnumSet.of(ExpressionFlag.SOM_LEFTMOST));
// Saving to a file
try (Database db = Database.compile(expr);
OutputStream out = new FileOutputStream("email_patterns.db")) {
db.save(out);
System.out.println("Database saved, size: " + db.getSize() + " bytes");
}
// Loading from a file later (much faster than recompiling)
try (InputStream in = new FileInputStream("email_patterns.db");
Database loadedDb = Database.load(in);
Scanner scanner = new Scanner()) {
scanner.allocScratch(loadedDb);
List<Match> matches = scanner.scan(loadedDb, "Contact us at [email protected]");
System.out.println("Found " + matches.size() + " matches");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Vectorscan operates on bytes (exclusive end index), but Scanner.scan()
returns Match
objects with inclusive character indices:
getStartPosition()
- Character index where the match startsgetEndPosition()
- Character index where the match ends (inclusive)
This is especially important when working with UTF-8 text where byte and Java character positions differ. If you're not interested in Java characters, you might see some performance improvement in using a lower level API (see below).
- The native Vectorscan library is not thread-safe for the re-use of scratch spaces. Those are encapsulated in Scanners
- Create separate
Scanner
(scratch space) instances for each thread Database
instances are thread-safe for scanning- Always use try-with-resources or explicitly call
close()
onScanner
andDatabase
instances
In addition to the default scanning methods that return Match
objects, hyperscan-java provides callback-based scanning methods for more efficient processing:
-
String-based scanning with callbacks:
void scan(Database db, String input, StringMatchEventHandler eventHandler)
- Handles UTF-8 character to byte mapping automatically
- Invokes your callback with character indices (inclusive start and end)
- Return
false
from callback to stop scanning early
-
Byte-oriented scanning with callbacks:
void scan(Database db, byte[] input, ByteMatchEventHandler eventHandler) void scan(Database db, ByteBuffer input, ByteMatchEventHandler eventHandler)
- Works directly with raw bytes for maximum performance
- Avoids UTF-8 to character mapping overhead
- Provides byte offsets to callback (inclusive start, exclusive end)
- Ideal for binary data or when you need to handle byte offsets yourself
// String-based scanning with callback
scanner.scan(db, inputString, (expression, fromIndex, toIndex) -> {
System.out.printf("Match for pattern '%s' at positions %d-%d: %s%n",
expression.getExpression(),
fromIndex,
toIndex,
inputString.substring((int)fromIndex, (int)toIndex + 1)); // +1 because toIndex is inclusive
return true; // continue scanning
});
// Byte-oriented scanning with callback
byte[] inputBytes = "sample text".getBytes(StandardCharsets.UTF_8);
scanner.scan(db, inputBytes, (expression, fromByteIdx, toByteIdxExclusive) -> {
System.out.printf("Match for pattern '%s' at byte positions %d-%d%n",
expression.getExpression(),
fromByteIdx,
toByteIdxExclusive - 1); // -1 to convert to inclusive index for display
return true; // continue scanning
});
-
Use byte-oriented methods when:
- Working with binary data
- Processing high volumes of data where UTF-8 conversion is a bottleneck
- You want to avoid allocating
Match
objects for very frequent matches - Integrating with code that already works with byte offsets
-
Use string-oriented methods when:
- Working with text data where character positions matter
- You need to extract matched substrings
- The convenience of character-based indices outweighs the performance benefit
This library ships with native binaries for:
- Linux (glibc ≥2.17) - x86_64 and arm64
- macOS - x86_64 and arm64
Windows is no longer supported after version 5.4.0-2.0.0
due to Vectorscan dropping Windows support.
Feel free to raise issues or submit pull requests. Please see the native libraries repository here.
Special thanks to @eliaslevy, @krzysztofzienkiewicz, @swapnilnawale, @mmimica, @Jiar and @apismensky for their contributions.
Thanks to Intel for originally open-sourcing Hyperscan and @VectorCamp for actively maintaining the Vectorscan fork!