Relational pipes

One of the great parts of the hacker culture 1 and the GNU way is the invention 2 of pipes and the idea 3 that one program should do one thing and do it well.

Each running program (process) has one input stream (called standard input or STDIN) and one output stream (called standard output or STDOUT) and also one additional output stream for logging/errors/warnings (STDERR). We can connect programs and pass the STDOUT of the first one to the STDIN of the second one (etc.) using pipes.

A classic pipeline example (explained):

cat animals.txt | grep dog | cut -d " " -f 2 | tr a-z A-Z

According to this principle we can build complex and powerful programs (pipelines) by composing several simple, single-purpose and reusable programs. Such single-purpose programs (often called filters) are much easier to create, test and optimize and their authors don't have to bother about the complexity of the final pipeline. They even don't have to know, how their programs will be used in the future by others. This is a great design principle that brings us advanced flexibility, reusability, efficiency and reliability. Being in any role (author of a filter, builder of a pipeline etc.), we can always focus on our task only and do it well. 4 And we can collaborate with others even if we don't know about them and we don't know that we are collaborating. Now think about putting this together with the free software ideas... How very!

But the question is: how the data passed through pipes should be formatted and structured. There is a wide spectrum of options from simple unstructured text files (just arrays of lines) through various DSV to formats like XML (YAML, JSON, ASN.1, Diameter, S-expressions etc.). Simpler formats look temptingly but have many problems and limitations (see the Pitfalls section in the Classic pipeline example). On the other hand, the advanced formats are capable to represent arbitrary object tree structures or even arbitrary graphs. They offer unlimited possibilities – and this is their strength and weakness at the same time.

It is not about the shape of the brackets, apostrophes, quotes or text vs. binary. It is not a technical question – it is in the semantic layer and human brain. Generic formats and their arbitrary object trees/graphs are (for humans, not for computers) difficult to understand and work with – compared to simpler structures like arrays, maps or matrixes.

This is the reason why we have chosen the relational model as our logical model. This model comes from 1969 5 and through decades it has proven its qualities and viability. This logical model is powerful enough to describe almost any data and – at the same time – it is still simple and easy to be understood by humans.

Thus, the Relational pipes are streams containing zero or more relations. Each relation has a name, one or more attributes and zero or more records (tuples). Each attribute has a name and a data-type. Records contain attribute values. We can imagine this stream as a sequence of tables (but the table is only one of many possible visual representations of such relational data).

What Relational pipes are?

Relational pipes are an open data format designed for streaming structured data between two processes. Simultaneously with the format specification, we are also developing a reference implementation (libraries and tools) as a free software. Although we believe in the specification-first (or contract-first) approach, we always look and check, whether the theoretic concepts are feasible and whether they can be reasonably and reliably implemented. So before publishing any new specification or its version, we will verify it by creating a reference implementation at least in one programming language.

More generally, Relational pipes are a philosophical continuation of the classic *NIX pipelines and the relational model.

What Relational pipes are not?

Relational pipes respect the existing ecosystem and are rather an improvement or supplement than a replacement. So the Relational pipes are not a:

  • Shell – we use existing shells (e.g. GNU Bash), work with any shell and even without a shell (e.g. as a stream format passed through a network or stored in a file).
  • Terminal emulator – same as with shells, we use existing terminals and we can use Relational pipes also outside any terminal; if we interact with the terminal, we use standard means like Unicode, ANSI escape sequences etc.
  • IDE – we can use standard *NIX tools as an IDE (GNU Screen, Emacs, Make etc.) or any other IDE.
  • Programming language – Relational pipes are language-independent data format and can be produced or consumed in any programming language.
  • Query language – although some of our tools are doing queries, filtering or transformations, we are not inventing a new query language – instead, we use existing languages like SQL, XPath, Scheme, AWK or regular expressions.
  • Database system, DBMS – we focus on the stream processing rather than data storage. Although sometimes it makes sense to redirect data to a file and continue with the processing later.

Project status

The main ideas and the roadmap are quite clear, but many things will change (including the format internals and interfaces of the libraries and tools). Because we understand how important the API and ABI stability is, we are not ready to publish the version 1.0 yet.

On the other hand, the already published tools (tagged as v0.x in v_0 branch) should work quite well (should compile, should run, should not segfault often, should not wipe your hard drive or kill your cat), so they might be useful for someone who likes our ideas and who is prepared to update own programs and scripts when the new version is ready.

We promise to fulfill all requirements of the Sane software manifesto before we release the version 1.0:

Sane software manifesto

Big picture

Typical relational pipeline consists of an input filter (left), an output filter (right) and zero, one or more transformations (middle) as outlined on this diagram: 6

Data can flow through a sequence of several transformations or directly from the input filter to the output filter.

1.formerly UNIX (sometimes called *NIX), now mostly GNU/Linux (see also hacking and GNU/Linux FAQ)

2.which is attributed to Doug McIlroy, see The Art of Unix Programming: Pipes, Redirection, and Filters

3.see The Art of Unix Programming: Basics of the Unix Philosophy

4.see cluelessness by Jaroslav Tulach in his Practical API Design. Confessions of a Java Framework Architect

5.invented and described by Edgar F. Codd, see Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, Research Report, IBM from 1969 and A Relational Model of Data for Large Shared Data Banks from 1970, see also Relational model

6.The diagram was made in the Tiled Map Editor and the tiles (particular pipe graphics parts) come from the game called Pipepanic. The tiles used in this diagram are licensed under the Free Art License.

Relational pipes, open standard and free software © 2018-2022 GlobalCode