Fix potential race condition in node-relay #1716
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We previously relied on
context.child
to check whether we already had a relay handler for a givenpayment_hash
.Unfortunately this can return an actor that is currently stopping itself.
When that happens our relay command can end up in the dead letters and the payment will not be relayed, nor be failed upstream.
We fix that by maintaining the list of current relay handlers in the
NodeRelayer
and removing them from the list before stopping them. This is similar to what's done in theMultiPartPaymentFSM
.This bug has been seen in the wild by the Electrum team with an eclair trampoline node that they run. Phoenix is not affected by this, because this bug only happens when the sender uses trampoline with multiple incoming HTLCs, but without letting the trampoline node aggregate the incoming MPP.
In their case, the
NodeRelayer
receives for example 2 HTLCs with the samepayment_hash
, but each HTLC setstotal_amount_msat
to itsamount_msat
(so the trampoline node cannot aggregate them and must relay them separately).If the first HTLC fails downstream slightly before the second HTLC reaches our node, the
NodeRelayer
thinks it has a handler for thepayment_hash
of the second HTLC but in fact it's stopping itself, and when we send itNodeRelay.Relay
with the second HTLC the actor is gone and we don't correctly fail the HTLC upstream.