I needed a tool to calculate time-window event rollups, and I wanted to learn functional programming with various languages.
Now aspiring to provide libs that other people could actually use!
If you have a list of event timestamps, say something like this:
127.0.0.1 - - [01/Dec/2011:00:00:11 -0500] "GET / HTTP/1.0" 304
127.0.0.1 - - [01/Dec/2011:00:00:24 -0500] "GET /feed/rss2/ HTTP/1.0" 301
127.0.0.1 - - [01/Dec/2011:00:00:25 -0500] "GET /feed/ HTTP/1.0" 304
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET /robots.txt HTTP/1.0" 200
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET / HTTP/1.0" 200
127.0.0.1 - - [01/Dec/2011:00:01:35 -0500] "POST /wp-cron.php?doing_wp_cron HTTP/1.0" 200
then you could run one of these tools like so:
gzcat event-data | rollup.coffee -w 1d
and you’d get something like this:
Start End Count
2011-12-01 00:00 2011-12-01 23:59 2823
2011-12-02 00:00 2011-12-02 23:59 2572
2011-12-03 00:00 2011-12-03 23:59 2621
2011-12-04 00:00 2011-12-04 23:59 3027
I plan to:
- support various input and output formats (but not too many)
- support usage as either a command-line tool or as a programmatic library
- make it fast
A big part of my goal was to learn functional programming by implementing this functionality in various languages.
The goal is for each variant to be functionally equivalent, but I’m not quite there yet.
So here's some notes on the status of each one:
- Fast — possibly the fastest with small windows
- Should already be usable as a library
- Fast
- Probably not safe to use as a library yet
- Slow
- Window spec is currently hard-coded to 1 day
- Support other event timestamp formats
- Support other output formats (starting with JSON)
- Support cmd-line arg for csv separator
- Refactor to be usable as a general-purpose library
- CoffeeScript: either in Node or in a browser
- Decide whether data must be passed in sorted or not (would allow for some optimizations)
- Add behaviour tests!!
- Support parallellization (off by default)
- Add an option to specify whether weeks should start on Sunday or Monday
- Support rollup windows of N months
- Support the input already being a rollup, of which we'd do a bigger rollup
- so you might store a per-minute rollup in a file, and generate a per-hour rollup from that
- kinda like re-reduce
- This'd work really well with a companion tool, something which would support incremental processing by keeping track of where you were in a file or stream, and grabbing only the new part of the data, then updating the pointer