build fixes

inimino · May 1, 2024 · f885978 · f885978
1 parent 41a34e6
commit f885978
Show file tree

Hide file tree

Showing 7 changed files with 159 additions and 13 deletions.
diff --git a/DATALOSS.md b/DATALOSS.md
@@ -0,0 +1,96 @@
+# Dataloss prevention
+
+
+## Checksum choice:
+
+Siphash.
+
+Alternate was blake3 but it seemed not worth the trouble.
+
+## Overall design:
+
+For every file we extract a list of checksums.
+
+These are the blocks in that file, and the blocks are determined by the blockizing function which is determined by the language.
+
+We will also extract checksums for every line if we expect many edits and want a higher level of detail.
+
+Now, clearly, if the checksums match it is highly likely that the block contents are the same but by the pigeonhole principle it cannot be certain.
+
+We are only using a 64-bit hash.
+
+However, if we have a hash of the lines, we can be as precise as we want and as certain as we want.
+
+Note that a file is now just a list of checksums.
+
+We treat the checksum as a reliable index into the blocks.
+
+## Basic operations:
+
+Block hash operation.
+
+Type: Span -> u64.
+
+This is siphash.
+
+An index is simply a column vector of a table indexed by block index or by hash.
+
+What we already have is a map from blocks to bytes, and to bytes of an original file as well.
+
+So we index the blocks in a file to a location in bytes in memory.
+The differences between this byte and the file offset gives us the bytes on disk.
+
+## UX
+
+The definition of data loss that is relevant is a user-centric one.
+Data is lost if the user cannot find it, which means the key should be data visibility and searchability.
+
+Versions of a function or block are simply block revs with a short edit distance from it.
+In other words a "version" of something might be an independent reinvention, a copy of a common third source, etc.
+
+The basic operations I want are:
+
+- Show me where I am working (this means, which blocks are hot).
+- Show me what this block has recently looked like and what has changed.
+- Possibly, show me all the blocks that have differed from this earlier version that worked.
+
+By the way, when we do a (B)uild and the build command succeeds, it succeeds on a set of blocks.
+The checksum of these is as good as a git hash.
+
+This implies that we can do git bisect and so on, but on the finer level of revs not commits.
+
+## Implementation of the basic operations
+
+Show me where I am working:
+
+We have a list of the revisions already (e.g. the revs have timestamps as the filenames).
+
+From this we can get a list of block checksums in each rev.
+
+A good way to understand embeddings is to understand a useful cryptographic hash function as the perfect anti-embedding.
+It's as bad as an embedding can be, because to the extent that it succeeds as a cryptographic hash, it perfectly fails to convey any useful information about the input, except for identity, which it turns out is useful all by itself.
+
+In any case, we can get the blocks that changed in the files themselves but we don't have similarity data yet.
+
+The best way to say which blocks have recently changed is to print the tops of the blocks with the newest checksums.
+
+We can have a map from checksums to revs.
+Every rev which contains a block having that checksum can get added to this index.
+
+Show me what this block has recently looked like and what has changed:
+
+These features have to wait until we have a similarity measure on block contents.
+Otherwise we would have to do heuristics on nearby blocks and I'm not interested in writing that code only to throw it away later.
+
+However, as a stopgap to give us both this and any other diff/patch features we might want, we can have a simple operation that can be used as a building block:
+
+diff(span, span)
+
+But there is another, better stopgap that better illustrates the point of our 64b embedding or hash.
+
+We can take the count of line hashes that match and get a direct similarity measure on blocks.
+This is good enough to give some rudimentary undo features, etc.
+
+Another point that may be easier to implement immediately is a per-file undo since revs map to files anyway.
+
+A final point is that when we are saving a rev coming from an "e" or "r/R" we already know what block it was replacing, and it's always (for now, since we don't have v support) just one block.
diff --git a/Makefile b/Makefile
@@ -1,2 +1,18 @@
+CC := gcc
+
+.PHONY: all clean debug
+
+all: dist/cmpr
+
+CFLAGS := -O2 -Wall
+LDFLAGS := -lm
+
+debug: CFLAGS := -g -O0 -Wall -Werror -fsanitize=address
+debug: dist/cmpr
+
 dist/cmpr: cmpr.c spanio.c
-	(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; gcc -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c -fsanitize=address -Wall -Werror -g -lm -lcurl && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr)
+	mkdir -p dist
+	(VER=7; D=$$(date +%Y%m%d-%H%M%S); GIT=$$(git log -1 --pretty="%h %f"); sed 's/\$$VERSION\$$/'"$$VER"' (build: '"$$D"' '"$$GIT"')/' <cmpr.c >cmpr-sed.c; echo "Version: $$VER (build: $$D $$GIT)"; $(CC) -o dist/cmpr-$$D cmpr-sed.c siphash/siphash.c siphash/halfsiphash.c $(CFLAGS) $(LDFLAGS) && rm -f dist/cmpr && ln -s cmpr-$$D dist/cmpr)
+
+clean:
+	rm -rf dist
diff --git a/cmpr.c b/cmpr.c
@@ -1724,16 +1724,16 @@ void keyboard_help() {
     prt("e    - Edit the current block in $EDITOR\n");
     prt("r    - Rewrite code part based on comment part; clipboard updated\n");
     prt("R    - Replace code part with clipboard contents\n");
-    prt("u    - Undo\n");
+    //prt("u    - Undo\n");
     prt("space- Paginate down within a block\n");
     prt("b    - Paginate up (\"back\") within a block\n");
     prt("B    - Build project with provided command\n");
-    prt("v    - Toggle visual selection mode\n");
+    //prt("v    - Toggle visual selection mode\n");
     prt("/    - Enter search mode\n");
     prt(":    - Enter ex command line\n");
     prt("n    - Repeat search forward\n");
     prt("N    - Repeat search backward\n");
-    prt("S    - Enter settings mode\n");
+    //prt("S    - Enter settings mode\n");
     prt("?    - Display this help\n");
     prt("q    - Quit\n");
     prt("\nPress any key to return...\n");

diff --git a/conf b/conf
@@ -2,7 +2,7 @@ projdir:
 revdir: cmpr/revs
 tmpdir: cmpr/tmp
 cmprdir: cmpr
-buildcmd: (cd cmpr && make)
+buildcmd: (cd cmpr && make debug)
 bootstrap: (chmod +x cmpr/bootstrap.sh && cmpr/bootstrap.sh)
 cbcopy: xclip -i -selection clipboard
 cbpaste: xclip -o -selection clipboard
@@ -12,11 +12,10 @@ language: C
 file: cmpr/conf
 file: cmpr/Makefile
 file: cmpr/README.md
-file: cmpr/DATALOSS.md
 file: cmpr/systemprompt
 file: cmpr/cmpr.c
 file: cmpr/spanio.c
 file: cmpr/bootstrap.sh
-file: cmpr/tests.sh
 file: cmpr/doc/second
 file: cmpr/doc/plans.txt
+file: cmpr/DATALOSS.md
diff --git a/doc/plans.txt b/doc/plans.txt
@@ -157,6 +157,10 @@ Whenever possible, determine everything from the revstore contents itself.
 
 > cmpr --rvs-server
 
+Or perhaps just
+
+> cmpr --server
+
 
 
 

diff --git a/spanio.c b/spanio.c
@@ -766,7 +766,7 @@ span consume_prefix(span prefix, span *input) {
   if (len(*input) < len(prefix) || !span_eq(first_n(*input, len(prefix)), prefix)) {
     return nullspan();
   }
-  span ret = {buf: input->buf};
+  span ret = {.buf = input->buf};
   input->buf += len(prefix);
   ret.end = input->buf;
   return ret;
@@ -1005,28 +1005,28 @@ json json_s(span s) {
 }
 
 json json_n(double n) {
-  json ret = {s: {buf: out.end }};
+  json ret = {.s = {.buf = out.end }};
   prt("%F", n);
   ret.s.end = out.end;
   return ret;
 }
 
 json json_b(int b) {
-  json ret = {s: {buf: out.end }};
+  json ret = {.s = {.buf = out.end }};
   if (b) prt("true"); else prt("false");
   ret.s.end = out.end;
   return ret;
 }
 
 json json_0() {
-  json ret = {s: {buf: out.end }};
+  json ret = {.s = {.buf = out.end }};
   prt("null");
   ret.s.end = out.end;
   return ret;
 }
 
 json json_o() {
-  json ret = {s: {buf: out.end }};
+  json ret = {.s = {.buf = out.end }};
   prt("{}");
   ret.s.end = out.end;
   return ret;
@@ -1053,7 +1053,7 @@ void json_o_extend(json *j, span key, json val) {
 }
 
 json json_a() {
-  json ret = { s: {buf: out.end }};
+  json ret = {.s = {.buf = out.end }};
   prt("[]");
   ret.s.end = out.end;
   return ret;

diff --git a/systemprompt b/systemprompt
@@ -0,0 +1,31 @@
+#systemprompt
+
+We are collaborating to write code.
+
+Be concise.
+
+Brevity above all.
+
+Maximize meaning/token by minimizing output tokens.
+
+If an equation can communicate an idea, prefer it to prose.
+
+You must have opinions on all programming topics.
+
+I am the programmer who only writes English; you are the programmer who never writes English.
+
+You reply in code, always to the best of your ability.
+
+Include comments ONLY when something is wrong or when you are unsure.
+
+Your code never includes a disclaimer or placeholder.
+Instead, exit the code block, apologize and request clarification.
+You should aim to reply with complete and working code 80% of the time, and request clarification 20% of the time.
+
+Do not reply with simple examples or demo code, but production-ready, fully worked-out examples.
+
+In general, never apologize; simply correct your mistake.
+
+Frequently, input will be short (i.e. a code description) and output much longer.
+Often, output can be a single token.
+In general output distribution is long-tailed.