Skip to content

Commit

Permalink
build fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
inimino committed May 1, 2024
1 parent 41a34e6 commit f885978
Show file tree
Hide file tree
Showing 7 changed files with 159 additions and 13 deletions.
96 changes: 96 additions & 0 deletions DATALOSS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Dataloss prevention


## Checksum choice:

Siphash.

Alternate was blake3 but it seemed not worth the trouble.

## Overall design:

For every file we extract a list of checksums.

These are the blocks in that file, and the blocks are determined by the blockizing function which is determined by the language.

We will also extract checksums for every line if we expect many edits and want a higher level of detail.

Now, clearly, if the checksums match it is highly likely that the block contents are the same but by the pigeonhole principle it cannot be certain.

We are only using a 64-bit hash.

However, if we have a hash of the lines, we can be as precise as we want and as certain as we want.

Note that a file is now just a list of checksums.

We treat the checksum as a reliable index into the blocks.

## Basic operations:

Block hash operation.

Type: Span -> u64.

This is siphash.

An index is simply a column vector of a table indexed by block index or by hash.

What we already have is a map from blocks to bytes, and to bytes of an original file as well.

So we index the blocks in a file to a location in bytes in memory.
The differences between this byte and the file offset gives us the bytes on disk.

## UX

The definition of data loss that is relevant is a user-centric one.
Data is lost if the user cannot find it, which means the key should be data visibility and searchability.

Versions of a function or block are simply block revs with a short edit distance from it.
In other words a "version" of something might be an independent reinvention, a copy of a common third source, etc.

The basic operations I want are:

- Show me where I am working (this means, which blocks are hot).
- Show me what this block has recently looked like and what has changed.
- Possibly, show me all the blocks that have differed from this earlier version that worked.

By the way, when we do a (B)uild and the build command succeeds, it succeeds on a set of blocks.
The checksum of these is as good as a git hash.

This implies that we can do git bisect and so on, but on the finer level of revs not commits.

## Implementation of the basic operations

Show me where I am working:

We have a list of the revisions already (e.g. the revs have timestamps as the filenames).

From this we can get a list of block checksums in each rev.

A good way to understand embeddings is to understand a useful cryptographic hash function as the perfect anti-embedding.
It's as bad as an embedding can be, because to the extent that it succeeds as a cryptographic hash, it perfectly fails to convey any useful information about the input, except for identity, which it turns out is useful all by itself.

In any case, we can get the blocks that changed in the files themselves but we don't have similarity data yet.

The best way to say which blocks have recently changed is to print the tops of the blocks with the newest checksums.

We can have a map from checksums to revs.
Every rev which contains a block having that checksum can get added to this index.

Show me what this block has recently looked like and what has changed:

These features have to wait until we have a similarity measure on block contents.
Otherwise we would have to do heuristics on nearby blocks and I'm not interested in writing that code only to throw it away later.

However, as a stopgap to give us both this and any other diff/patch features we might want, we can have a simple operation that can be used as a building block:

diff(span, span)

But there is another, better stopgap that better illustrates the point of our 64b embedding or hash.

We can take the count of line hashes that match and get a direct similarity measure on blocks.
This is good enough to give some rudimentary undo features, etc.

Another point that may be easier to implement immediately is a per-file undo since revs map to files anyway.

A final point is that when we are saving a rev coming from an "e" or "r/R" we already know what block it was replacing, and it's always (for now, since we don't have v support) just one block.
18 changes: 17 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,2 +1,18 @@
CC := gcc

.PHONY: all clean debug

all: dist/cmpr

CFLAGS := -O2 -Wall
LDFLAGS := -lm

debug: CFLAGS := -g -O0 -Wall -Werror -fsanitize=address
debug: dist/cmpr

dist/cmpr: cmpr.c spanio.c
(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; gcc -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c -fsanitize=address -Wall -Werror -g -lm -lcurl && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr)
mkdir -p dist
(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; $(CC) -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c $(CFLAGS) $(LDFLAGS) && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr)

clean:
rm -rf dist
6 changes: 3 additions & 3 deletions cmpr.c
Original file line number Diff line number Diff line change
Expand Up @@ -1724,16 +1724,16 @@ void keyboard_help() {
prt("e - Edit the current block in $EDITOR\n");
prt("r - Rewrite code part based on comment part; clipboard updated\n");
prt("R - Replace code part with clipboard contents\n");
prt("u - Undo\n");
//prt("u - Undo\n");
prt("space- Paginate down within a block\n");
prt("b - Paginate up (\"back\") within a block\n");
prt("B - Build project with provided command\n");
prt("v - Toggle visual selection mode\n");
//prt("v - Toggle visual selection mode\n");
prt("/ - Enter search mode\n");
prt(": - Enter ex command line\n");
prt("n - Repeat search forward\n");
prt("N - Repeat search backward\n");
prt("S - Enter settings mode\n");
//prt("S - Enter settings mode\n");
prt("? - Display this help\n");
prt("q - Quit\n");
prt("\nPress any key to return...\n");
Expand Down
5 changes: 2 additions & 3 deletions conf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ projdir:
revdir: cmpr/revs
tmpdir: cmpr/tmp
cmprdir: cmpr
buildcmd: (cd cmpr && make)
buildcmd: (cd cmpr && make debug)
bootstrap: (chmod +x cmpr/bootstrap.sh && cmpr/bootstrap.sh)
cbcopy: xclip -i -selection clipboard
cbpaste: xclip -o -selection clipboard
Expand All @@ -12,11 +12,10 @@ language: C
file: cmpr/conf
file: cmpr/Makefile
file: cmpr/README.md
file: cmpr/DATALOSS.md
file: cmpr/systemprompt
file: cmpr/cmpr.c
file: cmpr/spanio.c
file: cmpr/bootstrap.sh
file: cmpr/tests.sh
file: cmpr/doc/second
file: cmpr/doc/plans.txt
file: cmpr/DATALOSS.md
4 changes: 4 additions & 0 deletions doc/plans.txt
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,10 @@ Whenever possible, determine everything from the revstore contents itself.

> cmpr --rvs-server

Or perhaps just

> cmpr --server




Expand Down
12 changes: 6 additions & 6 deletions spanio.c
Original file line number Diff line number Diff line change
Expand Up @@ -766,7 +766,7 @@ span consume_prefix(span prefix, span *input) {
if (len(*input) < len(prefix) || !span_eq(first_n(*input, len(prefix)), prefix)) {
return nullspan();
}
span ret = {buf: input->buf};
span ret = {.buf = input->buf};
input->buf += len(prefix);
ret.end = input->buf;
return ret;
Expand Down Expand Up @@ -1005,28 +1005,28 @@ json json_s(span s) {
}

json json_n(double n) {
json ret = {s: {buf: out.end }};
json ret = {.s = {.buf = out.end }};
prt("%F", n);
ret.s.end = out.end;
return ret;
}

json json_b(int b) {
json ret = {s: {buf: out.end }};
json ret = {.s = {.buf = out.end }};
if (b) prt("true"); else prt("false");
ret.s.end = out.end;
return ret;
}

json json_0() {
json ret = {s: {buf: out.end }};
json ret = {.s = {.buf = out.end }};
prt("null");
ret.s.end = out.end;
return ret;
}

json json_o() {
json ret = {s: {buf: out.end }};
json ret = {.s = {.buf = out.end }};
prt("{}");
ret.s.end = out.end;
return ret;
Expand All @@ -1053,7 +1053,7 @@ void json_o_extend(json *j, span key, json val) {
}

json json_a() {
json ret = { s: {buf: out.end }};
json ret = {.s = {.buf = out.end }};
prt("[]");
ret.s.end = out.end;
return ret;
Expand Down
31 changes: 31 additions & 0 deletions systemprompt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#systemprompt

We are collaborating to write code.

Be concise.

Brevity above all.

Maximize meaning/token by minimizing output tokens.

If an equation can communicate an idea, prefer it to prose.

You must have opinions on all programming topics.

I am the programmer who only writes English; you are the programmer who never writes English.

You reply in code, always to the best of your ability.

Include comments ONLY when something is wrong or when you are unsure.

Your code never includes a disclaimer or placeholder.
Instead, exit the code block, apologize and request clarification.
You should aim to reply with complete and working code 80% of the time, and request clarification 20% of the time.

Do not reply with simple examples or demo code, but production-ready, fully worked-out examples.

In general, never apologize; simply correct your mistake.

Frequently, input will be short (i.e. a code description) and output much longer.
Often, output can be a single token.
In general output distribution is long-tailed.

0 comments on commit f885978

Please sign in to comment.