-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refac(perf): refactor parallelize polynomialize() #302
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see a 20% improvement on the big machine. I've added some ideas in the review. Probably worth creating a little iai-callgrind
benchmark.
One other suggestion would be to add more granular tracing. Trace instruction_flag_bitvectors
construction, instruction_flag_polys
construction, and even within the big loop. NUM_MEMORIES should only be around 80 so we should be able to sort what's going on in the trace.
Looks to me like 50% of e2e time is spend on DensePolynomial::from
which isn't going to get much faster so we can consider that a ceiling.
}, | ||
) | ||
.reduce( | ||
|| (Vec::new(), Vec::new(), Vec::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec::with_capacity(preprocessing.num_memories)
|
||
let mut final_cts_i = vec![0usize; M]; | ||
let mut read_cts_i = vec![0usize; m]; | ||
let mut subtable_lookups = vec![F::zero(); m]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try unsafe_allocate_zero_vec::<F>::(m)
here.
let mut read_cts_i = vec![0usize; m]; | ||
let mut subtable_lookups = vec![F::zero(); m]; | ||
|
||
for (j, op) in ops.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be faster if instead of the current pattern:
for memory in memories {
for op in ops {
if(op.memories.contains(memory)) { // probably linear in the size of memories (4 -> 12 with current params)
// do stuff
}
}
}
Doing (bad psuedo code to show idea):
// Precompute list of instruction indices for each memory
let memory_A_op_indices = [];
let memory_B_op_indices = [];
let memory_ops = [memory_A_op_indices, memory_B_op_indices,...];
...
for op in ops {
for memory in ops.memories_used {
memory_ops[memory].push(op);
}
}
// Now compute the memory counters directly in parallel
for memory in memories {
let mut final_cts_i = vec![0usize; M];
let mut read_cts_i = vec![0usize; m];
let mut subtable_lookups = vec![F::zero(); m];
for op in memory_ops[memory] {
// update relevant counters
}
// TODO: construct polynomials
}
I'm thinking this will have better cache performance (less stuff is looked up in the hot loop) and get rid of the memories_used.contains
.
Took a shot at making polynomialize() less serial as described in #293. After inlining and combining compute_lookup_outputs() and subtable_lookup_indices() I did not observe any improvements in performance.