1. 11
  1.  

    1. 4

      So if I understand this right, the intended usage is that you model the expected behavior, you instrument your real code to produce the traces and then you deploy it or fuzz it to get a lot of traces you can check for linearizability violations?

      In practice what’s the easiest way to capture the traces? Just log some JSON? Derive it from otel spans?

      1. 5

        You do need to ensure that whatever system records the history of events is itself Linearizable (hand-waving a bit here; you can actually get away with slightly weaker properties). That could be a single-node program which records invocations and completions in a Linearizable in-memory structure. Or you could journal them (synchronously!) to a Linearizable database before invocation and after completion, in production.

        What you can’t do is, say, stamp log lines or OTEL events with local node clocks–those aren’t necessarily synchronized, and the checker might tell you your program violated Linearizability when it didn’t. You also can’t send them to a database asynchronously, because the message recording the invocation might be delayed and arrive at the database after the operation was actually performed. You also can’t use a Serializable or Causally consistent database, because it might fail to record your events in real-time order. A Sequential log is OK, but the scope of the Sequential object has to be be the entire log; otherwise you could (you guessed it!) get reorderings.

        1. 2

          Another thing to be aware of is that if you log to the same datastore that’s under test, you could influence the results. For example, if your data store has an issue with stale reads in read-only transactions, and you add an “insert log line” between every read transaction down the same connection, you could see the bug become invisible.

        2. 1

          In practice is it enough to synchronize the clocks using something like ntp?

          1. 2

            I mean… no? Imagine you detected a Linearizability violation in a history recorded using timestamps on, say, Linux machines. From this you could conclude that either a.) the system being measured violated Linearizability, or b.) the history-collecting system had, at some time, poor synchronization; or the VM paused; or the OS paused; or the application paused; and so on. It doesn’t allow you to make a provable claim of correctness or incorrectness.

          2. 2

            Maybe?

            If you’re talking about something like EC2 timesync, where clock errors are ~50us, and your storage system is remote and best-case a couple hundred microseconds, and you’re careful about which end of the clock bound to pick, you can still get good results.

            NTP in general, probably not, unless you’re very careful about the way you set it up and have great time infrastructure.

        3. 1

          would it work to use lamport clocks? or stamp messages with the ones they’ve seen and then topologically sort (functionally the same thing but different)?

          1. 1

            Sort of. You can use messages between nodes and a strategy like Lamport clocks to establish a partial order which is consistent (because messages always flow forwards in time) with the real-time order. However, you’d fail to detect violations of Linearizability over timescales shorter than the message propagation interval.

            1. 1

              hrm—forgive me if these are stupid questions, but

              1. wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs? are there bugs that are still systematically more likely to be caught or can only be caught that way? (it seems not very nice because it sacrifices scalability)

              2. IIUC, the difference between sequential consistency and linearisability is effectively that linearisability is consistent with communication ‘outside the system’ where seqcst doesn’t have to be. and lamport clocks or causal ordering should not miss sequential consistency violations. but if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too? (i think this is similar to using a linearisable logging system, except that you don’t record all the ordering relationships—so you don’t get to full linearisability—only some of them, but what you do record you get in a distributed way)

              1. 1

                wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs?

                Yep! It’s actually impossible to measure precisely. Jepsen does this by putting all clients on a single node, allowing fast connections between clients and servers, and (optionally) adding latency between servers; this lets it see errors “faster than light”, so to speak.

                (it seems not very nice because it sacrifices scalability)

                In general, I wouldn’t worry about scalability. Linearizability checking is NP-hard and relies on total histories–both of those are already bad for scalability. Even if that weren’t the case, you’ll get the most bang for your buck testing in a controlled environment where you can induce failures, rather than in production, where failures are (I hope!?) comparatively infrequent.

                Lamport clocks or causal ordering should not miss sequential consistency violations.

                I’m not sure what this means exactly, but I should note that Causal is, in general, weaker than Sequential.

                But if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too?

                Sure, you could do that, but then the property you’re measuring is something like Causal or Sequential, rather than Linearizable.

      2. 4

        Yes you’ve got it! If the project is in Go you could just write the fuzzer and test harness in Go. Otherwise yes you’d want to write some intermediate format from your system and then write a Go program to convert it for Porcupine.