Skip to content

Conversation

@holiman
Copy link
Contributor

@holiman holiman commented Nov 12, 2024

Alternative to #30746, potential follow-up to #30743 . This PR makes the stacktrie always copy incoming value buffers, and reuse them internally.

Improvement in #30743:

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

This PR further enhances that:

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.2  │          derivesha.3           │
                          │    sec/op    │    sec/op     vs base          │
DeriveSha200/stack_trie-8   430.0µ ± 12%   423.6µ ± 13%  ~ (p=0.739 n=10)

                          │  derivesha.2  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   25.654Ki ± 0%   4.960Ki ± 0%  -80.67% (p=0.000 n=10)

                          │ derivesha.2 │            derivesha.3             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   232.00 ± 0%   37.00 ± 0%  -84.05% (p=0.000 n=10)

So the total derivesha-improvement over both PRS is:

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.3              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   423.6µ ± 13%  -11.33% (p=0.015 n=10)

                          │  derivesha.1  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.171Ki ± 0%   4.960Ki ± 0%  -89.02% (p=0.000 n=10)

                          │ derivesha.1  │            derivesha.3             │
                          │  allocs/op   │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.00 ± 0%   37.00 ± 0%  -97.06% (p=0.000 n=10)

Since this PR always copies the incoming value, it adds a little bit of a penalty on the previous insert-benchmark, which copied nothing (always passed the same empty slice as input) :

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.7  │          stacktrie.10          │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.21m ± 34%   92.37m ± 31%  ~ (p=0.280 n=10)

             │ stacktrie.7  │             stacktrie.10             │
             │     B/op     │     B/op      vs base                │
Insert100K-8   3.424Ki ± 3%   4.581Ki ± 3%  +33.80% (p=0.000 n=10)

             │ stacktrie.7 │            stacktrie.10            │
             │  allocs/op  │ allocs/op   vs base                │
Insert100K-8    22.00 ± 5%   26.00 ± 4%  +18.18% (p=0.000 n=10)

@holiman holiman force-pushed the stacktrie_allocs_3 branch from eccb604 to 82269e4 Compare January 14, 2025 12:54
@rjl493456442 rjl493456442 changed the title trie: [wip] reduce allocations in derivesha core/types, trie: reduce allocations in derivesha Sep 2, 2025
@rjl493456442
Copy link
Member


[[ Master ]]
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: Apple M1 Pro
BenchmarkDeriveSha200
BenchmarkDeriveSha200/std_trie
BenchmarkDeriveSha200/std_trie-8                6921        167578 ns/op       77865 B/op       1731 allocs/op
BenchmarkDeriveSha200/stack_trie
BenchmarkDeriveSha200/stack_trie-8              7146        168260 ns/op       26098 B/op        232 allocs/op

[[ PR ]]
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: Apple M1 Pro
BenchmarkDeriveSha200
BenchmarkDeriveSha200/std_trie
BenchmarkDeriveSha200/std_trie-8                7058        172988 ns/op       80054 B/op       1926 allocs/op
BenchmarkDeriveSha200/stack_trie
BenchmarkDeriveSha200/stack_trie-8              7122        166626 ns/op         745 B/op         19 allocs/op

@rjl493456442 rjl493456442 marked this pull request as ready for review September 2, 2025 06:46
@rjl493456442
Copy link
Member

[[ Master ]]

(pprof) focus=DeriveSha
(pprof) top
Active filters:
   focus=DeriveSha
Showing nodes accounting for 75616.73MB, 1.69% of 4479364.35MB total
Dropped 36 nodes (cum <= 22396.82MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
49936.95MB  1.11%  1.11% 68792.73MB  1.54%  github.com/ethereum/go-ethereum/core/types.encodeForDerive
   13106MB  0.29%  1.41% 17499.76MB  0.39%  github.com/ethereum/go-ethereum/core/types.Receipts.EncodeIndex
 6319.21MB  0.14%  1.55%  6372.15MB  0.14%  github.com/ethereum/go-ethereum/rlp.(*encBuffer).writeBytes
 4116.95MB 0.092%  1.64%  4116.95MB 0.092%  github.com/ethereum/go-ethereum/rlp.(*EncoderBuffer).AppendToBytes
     988MB 0.022%  1.66%      988MB 0.022%  bytes.growSlice
  560.10MB 0.013%  1.67%   560.10MB 0.013%  github.com/ethereum/go-ethereum/trie.init.func3
  464.51MB  0.01%  1.69%   464.51MB  0.01%  github.com/ethereum/go-ethereum/rlp.(*encBuffer).list (inline)
     113MB 0.0025%  1.69%  6600.56MB  0.15%  github.com/ethereum/go-ethereum/trie.(*StackTrie).hash
      11MB 0.00025%  1.69%  6989.98MB  0.16%  github.com/ethereum/go-ethereum/trie.(*StackTrie).Update
       1MB 2.2e-05%  1.69%      989MB 0.022%  bytes.(*Buffer).grow
(pprof)
[[ PR ]]

(pprof) top
Active filters:
   focus=DeriveSha
Showing nodes accounting for 41593.68MB, 0.93% of 4490562.58MB total
Dropped 37 nodes (cum <= 22452.81MB)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
14087.04MB  0.31%  0.31% 23043.54MB  0.51%  github.com/ethereum/go-ethereum/trie.(*StackTrie).Update
12947.95MB  0.29%   0.6% 17368.17MB  0.39%  github.com/ethereum/go-ethereum/core/types.Receipts.EncodeIndex
 6340.27MB  0.14%  0.74%  6395.78MB  0.14%  github.com/ethereum/go-ethereum/rlp.(*encBuffer).writeBytes
 4156.57MB 0.093%  0.84%  4156.57MB 0.093%  github.com/ethereum/go-ethereum/rlp.(*EncoderBuffer).AppendToBytes
 1662.01MB 0.037%  0.87%  1662.01MB 0.037%  github.com/ethereum/go-ethereum/trie.(*unsafeBytesPool).get (inline)
 1003.13MB 0.022%   0.9%  1003.13MB 0.022%  bytes.growSlice
  836.15MB 0.019%  0.91%   836.15MB 0.019%  github.com/ethereum/go-ethereum/trie.init.func3
  448.04MB  0.01%  0.92%   448.04MB  0.01%  github.com/ethereum/go-ethereum/rlp.(*encBuffer).list (inline)
     111MB 0.0025%  0.93%  6641.22MB  0.15%  github.com/ethereum/go-ethereum/trie.(*StackTrie).hash
    1.50MB 3.3e-05%  0.93%  1004.63MB 0.022%  bytes.(*Buffer).grow
(pprof)

@fjl
Copy link
Contributor

fjl commented Sep 2, 2025

I think you have 'master' and 'PR' reversed in your comment.

@holiman
Copy link
Contributor Author

holiman commented Sep 5, 2025

I think you have 'master' and 'PR' reversed in your comment.

I don't think he had. Not sure which comment you referred to, but, first one:

Master
BenchmarkDeriveSha200/stack_trie-8              7146        168260 ns/op       26098 B/op        232 allocs/op
PR
BenchmarkDeriveSha200/stack_trie-8              7122        166626 ns/op         745 B/op         19 allocs/op

Second one

Master
49936.95MB  1.11%  1.11% 68792.73MB  1.54%  github.com/ethereum/go-ethereum/core/types.encodeForDerive
   13106MB  0.29%  1.41% 17499.76MB  0.39%  github.com/ethereum/go-ethereum/core/types.Receipts.EncodeIndex

PR
14087.04MB  0.31%  0.31% 23043.54MB  0.51%  github.com/ethereum/go-ethereum/trie.(*StackTrie).Update
12947.95MB  0.29%   0.6% 17368.17MB  0.39%  github.com/ethereum/go-ethereum/core/types.Receipts.EncodeIndex

So EncodeIndex comparable, whereas pr has, as largest offender, Update at 14G and master has encodeForDerive at 50G.

@fjl
Copy link
Contributor

fjl commented Sep 8, 2025

Yeah I got confused with the std_trie vs stack_trie

@rjl493456442
Copy link
Member

rjl493456442 commented Sep 8, 2025

Deployed on bench07 and 08 for snap sync

EDIT: Snap sync finished correctly, with a complete state assembled locally.

@fjl fjl force-pushed the stacktrie_allocs_3 branch from 1dfbc99 to c1930b3 Compare September 29, 2025 19:00
@fjl fjl added this to the 1.16.5 milestone Sep 29, 2025
@fjl fjl merged commit 0576671 into ethereum:master Oct 1, 2025
5 of 6 checks passed
Sahil-4555 pushed a commit to Sahil-4555/go-ethereum that referenced this pull request Oct 12, 2025
Alternative to ethereum#30746, potential follow-up to ethereum#30743 . This PR makes the
stacktrie always copy incoming value buffers, and reuse them internally.

Improvement in ethereum#30743:
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

```
This PR further enhances that: 

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.2  │          derivesha.3           │
                          │    sec/op    │    sec/op     vs base          │
DeriveSha200/stack_trie-8   430.0µ ± 12%   423.6µ ± 13%  ~ (p=0.739 n=10)

                          │  derivesha.2  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   25.654Ki ± 0%   4.960Ki ± 0%  -80.67% (p=0.000 n=10)

                          │ derivesha.2 │            derivesha.3             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   232.00 ± 0%   37.00 ± 0%  -84.05% (p=0.000 n=10)
```
So the total derivesha-improvement over *both PRS* is: 
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.3              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   423.6µ ± 13%  -11.33% (p=0.015 n=10)

                          │  derivesha.1  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.171Ki ± 0%   4.960Ki ± 0%  -89.02% (p=0.000 n=10)

                          │ derivesha.1  │            derivesha.3             │
                          │  allocs/op   │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.00 ± 0%   37.00 ± 0%  -97.06% (p=0.000 n=10)
```

Since this PR always copies the incoming value, it adds a little bit of
a penalty on the previous insert-benchmark, which copied nothing (always
passed the same empty slice as input) :

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.7  │          stacktrie.10          │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.21m ± 34%   92.37m ± 31%  ~ (p=0.280 n=10)

             │ stacktrie.7  │             stacktrie.10             │
             │     B/op     │     B/op      vs base                │
Insert100K-8   3.424Ki ± 3%   4.581Ki ± 3%  +33.80% (p=0.000 n=10)

             │ stacktrie.7 │            stacktrie.10            │
             │  allocs/op  │ allocs/op   vs base                │
Insert100K-8    22.00 ± 5%   26.00 ± 4%  +18.18% (p=0.000 n=10)
```

---------

Co-authored-by: Gary Rong <[email protected]>
Co-authored-by: Felix Lange <[email protected]>
atkinsonholly pushed a commit to atkinsonholly/ephemery-geth that referenced this pull request Nov 24, 2025
Alternative to ethereum#30746, potential follow-up to ethereum#30743 . This PR makes the
stacktrie always copy incoming value buffers, and reuse them internally.

Improvement in ethereum#30743:
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

```
This PR further enhances that: 

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.2  │          derivesha.3           │
                          │    sec/op    │    sec/op     vs base          │
DeriveSha200/stack_trie-8   430.0µ ± 12%   423.6µ ± 13%  ~ (p=0.739 n=10)

                          │  derivesha.2  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   25.654Ki ± 0%   4.960Ki ± 0%  -80.67% (p=0.000 n=10)

                          │ derivesha.2 │            derivesha.3             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   232.00 ± 0%   37.00 ± 0%  -84.05% (p=0.000 n=10)
```
So the total derivesha-improvement over *both PRS* is: 
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.3              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   423.6µ ± 13%  -11.33% (p=0.015 n=10)

                          │  derivesha.1  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.171Ki ± 0%   4.960Ki ± 0%  -89.02% (p=0.000 n=10)

                          │ derivesha.1  │            derivesha.3             │
                          │  allocs/op   │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.00 ± 0%   37.00 ± 0%  -97.06% (p=0.000 n=10)
```

Since this PR always copies the incoming value, it adds a little bit of
a penalty on the previous insert-benchmark, which copied nothing (always
passed the same empty slice as input) :

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.7  │          stacktrie.10          │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.21m ± 34%   92.37m ± 31%  ~ (p=0.280 n=10)

             │ stacktrie.7  │             stacktrie.10             │
             │     B/op     │     B/op      vs base                │
Insert100K-8   3.424Ki ± 3%   4.581Ki ± 3%  +33.80% (p=0.000 n=10)

             │ stacktrie.7 │            stacktrie.10            │
             │  allocs/op  │ allocs/op   vs base                │
Insert100K-8    22.00 ± 5%   26.00 ± 4%  +18.18% (p=0.000 n=10)
```

---------

Co-authored-by: Gary Rong <[email protected]>
Co-authored-by: Felix Lange <[email protected]>
prestoalvarez pushed a commit to prestoalvarez/go-ethereum that referenced this pull request Nov 27, 2025
Alternative to ethereum#30746, potential follow-up to ethereum#30743 . This PR makes the
stacktrie always copy incoming value buffers, and reuse them internally.

Improvement in ethereum#30743:
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

```
This PR further enhances that: 

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.2  │          derivesha.3           │
                          │    sec/op    │    sec/op     vs base          │
DeriveSha200/stack_trie-8   430.0µ ± 12%   423.6µ ± 13%  ~ (p=0.739 n=10)

                          │  derivesha.2  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   25.654Ki ± 0%   4.960Ki ± 0%  -80.67% (p=0.000 n=10)

                          │ derivesha.2 │            derivesha.3             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   232.00 ± 0%   37.00 ± 0%  -84.05% (p=0.000 n=10)
```
So the total derivesha-improvement over *both PRS* is: 
```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.3              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   423.6µ ± 13%  -11.33% (p=0.015 n=10)

                          │  derivesha.1  │             derivesha.3              │
                          │     B/op      │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.171Ki ± 0%   4.960Ki ± 0%  -89.02% (p=0.000 n=10)

                          │ derivesha.1  │            derivesha.3             │
                          │  allocs/op   │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.00 ± 0%   37.00 ± 0%  -97.06% (p=0.000 n=10)
```

Since this PR always copies the incoming value, it adds a little bit of
a penalty on the previous insert-benchmark, which copied nothing (always
passed the same empty slice as input) :

```
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.7  │          stacktrie.10          │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.21m ± 34%   92.37m ± 31%  ~ (p=0.280 n=10)

             │ stacktrie.7  │             stacktrie.10             │
             │     B/op     │     B/op      vs base                │
Insert100K-8   3.424Ki ± 3%   4.581Ki ± 3%  +33.80% (p=0.000 n=10)

             │ stacktrie.7 │            stacktrie.10            │
             │  allocs/op  │ allocs/op   vs base                │
Insert100K-8    22.00 ± 5%   26.00 ± 4%  +18.18% (p=0.000 n=10)
```

---------

Co-authored-by: Gary Rong <[email protected]>
Co-authored-by: Felix Lange <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants