-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
159 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Dataloss prevention | ||
|
||
|
||
## Checksum choice: | ||
|
||
Siphash. | ||
|
||
Alternate was blake3 but it seemed not worth the trouble. | ||
|
||
## Overall design: | ||
|
||
For every file we extract a list of checksums. | ||
|
||
These are the blocks in that file, and the blocks are determined by the blockizing function which is determined by the language. | ||
|
||
We will also extract checksums for every line if we expect many edits and want a higher level of detail. | ||
|
||
Now, clearly, if the checksums match it is highly likely that the block contents are the same but by the pigeonhole principle it cannot be certain. | ||
|
||
We are only using a 64-bit hash. | ||
|
||
However, if we have a hash of the lines, we can be as precise as we want and as certain as we want. | ||
|
||
Note that a file is now just a list of checksums. | ||
|
||
We treat the checksum as a reliable index into the blocks. | ||
|
||
## Basic operations: | ||
|
||
Block hash operation. | ||
|
||
Type: Span -> u64. | ||
|
||
This is siphash. | ||
|
||
An index is simply a column vector of a table indexed by block index or by hash. | ||
|
||
What we already have is a map from blocks to bytes, and to bytes of an original file as well. | ||
|
||
So we index the blocks in a file to a location in bytes in memory. | ||
The differences between this byte and the file offset gives us the bytes on disk. | ||
|
||
## UX | ||
|
||
The definition of data loss that is relevant is a user-centric one. | ||
Data is lost if the user cannot find it, which means the key should be data visibility and searchability. | ||
|
||
Versions of a function or block are simply block revs with a short edit distance from it. | ||
In other words a "version" of something might be an independent reinvention, a copy of a common third source, etc. | ||
|
||
The basic operations I want are: | ||
|
||
- Show me where I am working (this means, which blocks are hot). | ||
- Show me what this block has recently looked like and what has changed. | ||
- Possibly, show me all the blocks that have differed from this earlier version that worked. | ||
|
||
By the way, when we do a (B)uild and the build command succeeds, it succeeds on a set of blocks. | ||
The checksum of these is as good as a git hash. | ||
|
||
This implies that we can do git bisect and so on, but on the finer level of revs not commits. | ||
|
||
## Implementation of the basic operations | ||
|
||
Show me where I am working: | ||
|
||
We have a list of the revisions already (e.g. the revs have timestamps as the filenames). | ||
|
||
From this we can get a list of block checksums in each rev. | ||
|
||
A good way to understand embeddings is to understand a useful cryptographic hash function as the perfect anti-embedding. | ||
It's as bad as an embedding can be, because to the extent that it succeeds as a cryptographic hash, it perfectly fails to convey any useful information about the input, except for identity, which it turns out is useful all by itself. | ||
|
||
In any case, we can get the blocks that changed in the files themselves but we don't have similarity data yet. | ||
|
||
The best way to say which blocks have recently changed is to print the tops of the blocks with the newest checksums. | ||
|
||
We can have a map from checksums to revs. | ||
Every rev which contains a block having that checksum can get added to this index. | ||
|
||
Show me what this block has recently looked like and what has changed: | ||
|
||
These features have to wait until we have a similarity measure on block contents. | ||
Otherwise we would have to do heuristics on nearby blocks and I'm not interested in writing that code only to throw it away later. | ||
|
||
However, as a stopgap to give us both this and any other diff/patch features we might want, we can have a simple operation that can be used as a building block: | ||
|
||
diff(span, span) | ||
|
||
But there is another, better stopgap that better illustrates the point of our 64b embedding or hash. | ||
|
||
We can take the count of line hashes that match and get a direct similarity measure on blocks. | ||
This is good enough to give some rudimentary undo features, etc. | ||
|
||
Another point that may be easier to implement immediately is a per-file undo since revs map to files anyway. | ||
|
||
A final point is that when we are saving a rev coming from an "e" or "r/R" we already know what block it was replacing, and it's always (for now, since we don't have v support) just one block. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,18 @@ | ||
CC := gcc | ||
|
||
.PHONY: all clean debug | ||
|
||
all: dist/cmpr | ||
|
||
CFLAGS := -O2 -Wall | ||
LDFLAGS := -lm | ||
|
||
debug: CFLAGS := -g -O0 -Wall -Werror -fsanitize=address | ||
debug: dist/cmpr | ||
|
||
dist/cmpr: cmpr.c spanio.c | ||
(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; gcc -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c -fsanitize=address -Wall -Werror -g -lm -lcurl && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr) | ||
mkdir -p dist | ||
(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; $(CC) -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c $(CFLAGS) $(LDFLAGS) && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr) | ||
|
||
clean: | ||
rm -rf dist |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#systemprompt | ||
|
||
We are collaborating to write code. | ||
|
||
Be concise. | ||
|
||
Brevity above all. | ||
|
||
Maximize meaning/token by minimizing output tokens. | ||
|
||
If an equation can communicate an idea, prefer it to prose. | ||
|
||
You must have opinions on all programming topics. | ||
|
||
I am the programmer who only writes English; you are the programmer who never writes English. | ||
|
||
You reply in code, always to the best of your ability. | ||
|
||
Include comments ONLY when something is wrong or when you are unsure. | ||
|
||
Your code never includes a disclaimer or placeholder. | ||
Instead, exit the code block, apologize and request clarification. | ||
You should aim to reply with complete and working code 80% of the time, and request clarification 20% of the time. | ||
|
||
Do not reply with simple examples or demo code, but production-ready, fully worked-out examples. | ||
|
||
In general, never apologize; simply correct your mistake. | ||
|
||
Frequently, input will be short (i.e. a code description) and output much longer. | ||
Often, output can be a single token. | ||
In general output distribution is long-tailed. |