refac(perf): refactor parallelize polynomialize() #302

PatStiles · 2024-04-17T03:49:19Z

Took a shot at making polynomialize() less serial as described in #293. After inlining and combining compute_lookup_outputs() and subtable_lookup_indices() I did not observe any improvements in performance.

sragss

I do see a 20% improvement on the big machine. I've added some ideas in the review. Probably worth creating a little iai-callgrind benchmark.

One other suggestion would be to add more granular tracing. Trace instruction_flag_bitvectors construction, instruction_flag_polys construction, and even within the big loop. NUM_MEMORIES should only be around 80 so we should be able to sort what's going on in the trace.

Looks to me like 50% of e2e time is spend on DensePolynomial::from which isn't going to get much faster so we can consider that a ceiling.

sragss · 2024-04-24T19:52:09Z

jolt-core/src/jolt/vm/instruction_lookups.rs

+                },
+            )
+            .reduce(
+                || (Vec::new(), Vec::new(), Vec::new()),


Vec::with_capacity(preprocessing.num_memories)

sragss · 2024-04-24T19:53:29Z

jolt-core/src/jolt/vm/instruction_lookups.rs

+
+                    let mut final_cts_i = vec![0usize; M];
+                    let mut read_cts_i = vec![0usize; m];
+                    let mut subtable_lookups = vec![F::zero(); m];


Try unsafe_allocate_zero_vec::<F>::(m) here.

sragss · 2024-04-24T20:02:03Z

jolt-core/src/jolt/vm/instruction_lookups.rs

+                    let mut read_cts_i = vec![0usize; m];
+                    let mut subtable_lookups = vec![F::zero(); m];
+
+                    for (j, op) in ops.iter().enumerate() {


It may be faster if instead of the current pattern:

for memory in memories { for op in ops { if(op.memories.contains(memory)) { // probably linear in the size of memories (4 -> 12 with current params) // do stuff } } }

Doing (bad psuedo code to show idea):

// Precompute list of instruction indices for each memory let memory_A_op_indices = []; let memory_B_op_indices = []; let memory_ops = [memory_A_op_indices, memory_B_op_indices,...]; ... for op in ops { for memory in ops.memories_used { memory_ops[memory].push(op); } } // Now compute the memory counters directly in parallel for memory in memories { let mut final_cts_i = vec![0usize; M]; let mut read_cts_i = vec![0usize; m]; let mut subtable_lookups = vec![F::zero(); m]; for op in memory_ops[memory] { // update relevant counters } // TODO: construct polynomials }

I'm thinking this will have better cache performance (less stuff is looked up in the hot loop) and get rid of the memories_used.contains.

PatStiles added 3 commits April 15, 2024 17:03

fold and reduce read, final, E polys in same loop -> no perf change

697ff64

inline and refactor lookup_outputs, dim, & subtable_lookup_indices

43019eb

nit

96d8452

moodlezoup requested a review from sragss April 17, 2024 14:35

sragss reviewed Apr 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refac(perf): refactor parallelize polynomialize() #302

refac(perf): refactor parallelize polynomialize() #302

PatStiles commented Apr 17, 2024

sragss left a comment

sragss Apr 24, 2024

sragss Apr 24, 2024

sragss Apr 24, 2024

refac(perf): refactor parallelize polynomialize() #302

Are you sure you want to change the base?

refac(perf): refactor parallelize polynomialize() #302

Conversation

PatStiles commented Apr 17, 2024

sragss left a comment

Choose a reason for hiding this comment

sragss Apr 24, 2024

Choose a reason for hiding this comment

sragss Apr 24, 2024

Choose a reason for hiding this comment

sragss Apr 24, 2024

Choose a reason for hiding this comment