Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation: In case of a continued reindex, only activate chain in the end #31439

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mzumsande
Copy link
Contributor

@mzumsande mzumsande commented Dec 6, 2024

If a user interrupts a reindex while it is iterating over the block files, it will continue to reindex with the next node start (if the -reindex arg is dropped, otherwise it will start reindexing from scratch).
However, due to an early call to ActivateBestChainState() that only exists to connect the genesis block during
the original -reindex, it wil start connecting blocks immediately before having iterated through all block files.
Because later headers above the minchainwork threshold won't be loaded in this case, -assumevalid will not
be applied and the process is much slower due to script validation being done.

Fix this by only calling ActivateBestChainState() here if Genesis is not connected yet (equivalent to ActiveHeight() == -1).
Also simplify this spot by only doing this for the active chainstate instead of looping over all chainstates (first commit).

This issue was discussed in the thread below #31346 (comment), the impact on assumevalid was found by l0rinc.

The fix can be tested by manually aborting a -reindex e.g. on signet and observing in the debug log the order in which blockfiles are indexed / blocks are connected with this branch vs master.

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 6, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31439.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK TheCharlatan
Concept ACK ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

This simplifies the code. The only reason to call ActivateBestChain()
here is to allow the main init thread to finish startup in a case of
-reindex. In this situation no second chainstate can exist anyway
because -reindex would have deleted any snapshot chainstate earlier.
@mzumsande mzumsande force-pushed the 202412_reindex_interrupt branch from b91a191 to 0ab9cbe Compare December 6, 2024 19:01
@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 6, 2024

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/34049612281

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

If a reindex was interrupted while it was iterating
through the block files, genesis will already be connected
when the reindex resumes at the next startup.
In this case, a call to ActivateBestChainState() is not only unnecessary,
but it would connect multiple blocks without applying
-assumevalid, which is much slower.
This is because assumevalid requires us to have a header above
the minimum chainwork, but that header is unknown to us if it's in
a later blockfile not indexed yet.
@mzumsande mzumsande force-pushed the 202412_reindex_interrupt branch from 0ab9cbe to 570af72 Compare December 6, 2024 19:10
@l0rinc
Copy link
Contributor

l0rinc commented Dec 6, 2024

@mzumsande, do you happen to know why setting assumevalid to block 859999 works as expected (it is found in the 860001 blocks in m_blockman.m_block_index), so it's correctly disabling script verification, but setting it to block 860002 (we're currently at 873543) enables fScriptChecks from the very beginning for some reason (i.e. setting a later block results in having earlier validation).

Edit:

Details
2024-12-06T22:37:39Z Synchronizing blockheaders, height: 860000 (~98.48%)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048 height=1 version=0x00000001 log2_work=33.000022 tx=2 date='2009-01-09T02:54:25Z' progress=0.000000 cache=0.3MiB(1txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000006a625f06636b8bb6ac7b960a8d03705d1ace08b1a19da3fdcc99ddbd height=2 version=0x00000001 log2_work=33.584985 tx=3 date='2009-01-09T02:55:44Z' progress=0.000000 cache=0.3MiB(2txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=0000000082b5015589a3fdf2d4baff403e6f0be035a5d9742c1cae6295464449 height=3 version=0x00000001 log2_work=34.000022 tx=4 date='2009-01-09T03:02:53Z' progress=0.000000 cache=0.3MiB(3txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000004ebadb55ee9096c9a2f8880e09da59c0d68b1c228da88e48844a1485 height=4 version=0x00000001 log2_work=34.321950 tx=5 date='2009-01-09T03:16:28Z' progress=0.000000 cache=0.3MiB(4txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000009b7262315dbf071787ad3656097b892abffd1f95a1a022f896f533fc height=5 version=0x00000001 log2_work=34.584985 tx=6 date='2009-01-09T03:23:48Z' progress=0.000000 cache=0.3MiB(5txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000003031a0e73735690c5a1ff2a4be82553b2a12b776fbd3a215dc8f778d height=6 version=0x00000001 log2_work=34.807377 tx=7 date='2009-01-09T03:29:49Z' progress=0.000000 cache=0.3MiB(6txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=0000000071966c2b1d065fd446b1e485b2c9d9594acd2007ccbd5441cfc89444 height=7 version=0x00000001 log2_work=35.000022 tx=8 date='2009-01-09T03:39:29Z' progress=0.000000 cache=0.3MiB(7txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=00000000408c48f847aa786c2268fc3e6ec2af68e8468a34a28c61b7f1de0dc6 height=8 version=0x00000001 log2_work=35.169947 tx=9 date='2009-01-09T03:45:43Z' progress=0.000000 cache=0.3MiB(8txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000008d9dc510f23c2657fc4f67bea30078cc05a90eb89e84cc475c080805 height=9 version=0x00000001 log2_work=35.321950 tx=10 date='2009-01-09T03:54:39Z' progress=0.000000 cache=0.3MiB(9txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=000000002c05cc2e78923c34df87fd108b22221ac6076c18f3ade378a4d915e9 height=10 version=0x00000001 log2_work=35.459454 tx=11 date='2009-01-09T04:05:52Z' progress=0.000000 cache=0.3MiB(10txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=0000000097be56d606cdd9c54b04d4747e957d3608abe69198c661f2add73073 height=11 version=0x00000001 log2_work=35.584985 tx=12 date='2009-01-09T04:12:40Z' progress=0.000000 cache=0.3MiB(11txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:39Z UpdateTip: new best=0000000027c2488e2510d1acf4369787784fa20ee084c258b58d9fbd43802b5e height=12 version=0x00000001 log2_work=35.700462 tx=13 date='2009-01-09T04:21:28Z' progress=0.000000 cache=0.3MiB(12txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:40Z UpdateTip: new best=000000005c51de2031a895adc145ee2242e919a01c6d61fb222a54a54b4d3089 height=13 version=0x00000001 log2_work=35.807377 tx=14 date='2009-01-09T04:23:40Z' progress=0.000000 cache=0.3MiB(13txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:41Z UpdateTip: new best=0000000080f17a0c5a67f663a9bc9969eb37e81666d9321125f0e293656f8a37 height=14 version=0x00000001 log2_work=35.906913 tx=15 date='2009-01-09T04:33:09Z' progress=0.000000 cache=0.3MiB(14txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:42Z UpdateTip: new best=00000000b3322c8c3ef7d2cf6da009a776e6a99ee65ec5a32f3f345712238473 height=15 version=0x00000001 log2_work=36.000022 tx=16 date='2009-01-10T04:45:46Z' progress=0.000000 cache=0.3MiB(15txo)
@@@m_blockman.m_block_index.size()@@@ = 860001
@@@fScriptChecks@@@ = 1
2024-12-06T22:37:42Z UpdateTip: new best=00000000174a25bb399b009cc8deff1c4b3ea84df7e93affaaf60dc3416cc4f5 height=16 version=0x00000001 log2_work=36.087485 tx=17 date='2009-01-10T04:45:58Z' progress=0.000000 cache=0.3MiB(16txo)
2024-12-06T22:37:43Z Synchronizing blockheaders, height: 862000 (~98.71%)
@@@m_blockman.m_block_index.size()@@@ = 862001
@@@fScriptChecks@@@ = 0

i.e. during synchronization the UpdateTip already starts and fScriptChecks gets enabled until the correct block is loaded, which will disable it - weird

@l0rinc
Copy link
Contributor

l0rinc commented Dec 6, 2024

I have tested 570af72, did an IBD to a clean datadir, stopped the process at ~20k blocks, restarted it with a -reindex: m_blockman.m_block_index contains the previous 25921 blocks only, so m_chainman.AssumedValidBlock() is not found among them, so script validation gets turned on at height=1 (same behavior as on master).

@mzumsande
Copy link
Contributor Author

mzumsande commented Dec 6, 2024

@mzumsande, do you happen to know why setting assumevalid to block 859999 works as expected (it is found in the 860001 blocks in m_blockman.m_block_index), so it's correctly disabling script verification, but setting it to block 860002 (we're currently at 873543) enables fScriptChecks from the very beginning for some reason (i.e. setting a later block results in having earlier validation).

I can't reproduce this. Assuming it's not related to reindex, on an empty datadir after header sync, whether I start with
-assumevalid="0000000000000000000196a7111c7aa3368af5a803a81c7bfb32860100dfd9c7"
or
-assumevalid="000000000000000000020c0ad748d78767d986e881259ba6de0a6e62cf946cf1"
fScriptChecks is set to 0.

I have tested 570af72, did an IBD to a clean datadir, stopped the process at ~20k blocks, restarted it with a -reindex: m_blockman.m_block_index contains the previous 25921 blocks only, so m_chainman.AssumedValidBlock() is not found among them, so script validation gets turned on at height=1 (same behavior as on master).

Did you do this on mainnet? That would be because -reindex deletes the block tree db, but you don't have the full blocks for most of the headers yet. So after reindexing all of the block files you have, you don't have the headers for later blocks, so assumevalid is not applied.

This is a general issue if you do reindex with a datadir that doesn't have blocks above the minchainwork threshold in it - this PR doesn't solve this.

@DrahtBot DrahtBot removed the CI failed label Dec 6, 2024
Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK - I was somehow under the impression that this was done on purpose to guarantee the validity of the already done work before continuing, but could never come up with a really good reason of why this should be the case - and I guess there really isn't one!

Reading through this code again recently makes me wish we'd have better handling for the genesis block in general. It feels weird that it is only activated in the first place during a fresh startup once ImportBlocks is called. It is also inconvenient for kernel users, since they need to manually call ActivateBestChain on startup (like done in bitcoin-chainstate). I wonder if it would be better to move loading it into load chainstate, for example like I did here (though that patch is pretty hacky with hardcoding the flat file position and making the registered validation interfaces miss the first notification and has some other implications).

Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 570af72

@ryanofsky
Copy link
Contributor

Concept ACK. These changes seem clearly better than current code, though the fix here seems like a workaround for a workaround for a workaround. IIUC, the original problem leading to this complexity was a crash in some code that assumed the genesis block existed, so a wait was added in #5243 to prevent that crash (first workaround instead of just making code handle the state and not crash). Then this wait apparently prevented gui from showing reindex progress so an ActivateBestChain call was added in #7917 to connect the genesis block and bypass the wait (second workaround, which can connect too many blocks after reindexing is interrupted or when -loadblock is used, leading to this bug). Now a third workaround is being implemented to prevent connecting too many blocks in the typical case when reindexing is interrupted and blocks are written in expected order with genesis first. But this is still a workaround, not a solid fix because other blocks besides the genesis block can still be connected in other cases.

It seems to me like things would be simpler and more reliable if we could delete the wait and delete the ActivateBestChain call, but as long as they need to exist, this fix seems strictly better than the current code.

Am wondering if there is some way to write a functional test for this PR. It'd be hard to write a test that interrupts reindexing at the right time, but maybe there could be a test that uses two -loadblock arguments to load the same blocks multiple times and makes sure all the blocks except the genesis blocks are only connected together at the very end.

@mzumsande
Copy link
Contributor Author

Thanks for digging up the history!

Am wondering if there is some way to write a functional test for this PR. It'd be hard to write a test that interrupts reindexing at the right time, but maybe there could be a test that uses two -loadblock arguments to load the same blocks multiple times and makes sure all the blocks except the genesis blocks are only connected together at the very end.

I don't completely understand this - do you mean -loadblock with specially crafted blockfiles in which block appear in a certain order?

But this is still a workaround, not a solid fix because other blocks besides the genesis block can still be connected in other cases.

Do you know any concrete examples for the latter? I was thinking that this was more of a solid fix because

  • ActivateBestChain() cannot connect blocks unless they are in the block tree db.
  • we cannot add blocks to the block tree db unless they connect to genesis (LoadExternalBlockFile detects out-of-order blocks and saves them for later).

So, if we call ABC directly after calling AcceptBlock() for Genesis, it should be impossible for it to connect any other blocks than genesis because no other headers can be in the block tree db - even in some theoretical scenario where the order within the block file is messed up such that Genesis is not at its beginning.

That being said, I agree it would be nice if we could remove this complexity, but I think it can only be done together with removing the wait-for-genesis in init - which means making sure that nothing bad happens when we don't have a genesis block after startup.

@mzumsande
Copy link
Contributor Author

mzumsande commented Dec 10, 2024

So it looks me like there are several ideas for further improvements in this general area:

  1. Consider enabling assumevalid during reindex always - in case of a crash during IBD when we haven't downloaded blocks above the assumevalid threshold, we currently don't apply assumevalid since we don't have headers above minchainwork but we also won't request headers from peers until the (slow) reindex has finished.
  2. Call ActivateBestChain during first startup with an empty datadir earlier, not only in ImportBlocks (main challenge: -reindex must still work).
  3. Remove the genesis-specific ActivateBestChain() call - possibly together with the wait-for-genesis condition in init.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants