Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure schema is consistent while bootstrapping #583

Open
wants to merge 16 commits into
base: palantir-cassandra-2.2.18
Choose a base branch
from

Conversation

rhuffy
Copy link
Contributor

@rhuffy rhuffy commented Nov 20, 2024

This PR addresses a potential edge case where the Schema is updated while bootstrapping. The following changes are made

  • Ensure that the local schema version matches the versions for all peers reported by the Gossiper before Bootstrapping
  • After creating bootstrap streams, ensure that the local schema version has not changed, and is not in the process of changing.

This PR make no change to the logic here 4c831cc

Making that change would not work for the first node in a new cluster.

@@ -1048,6 +1060,18 @@ private void joinTokenRing(int delay, boolean autoBootstrap, Collection<String>
unsafeDisableNode();
throw new BootstrappingSafetyException("Bootstrap streaming failed.");
}

if(!localSchemaVersion.equals(Schema.instance.getVersion().toString()) || !isSchemaConsistent(localSchemaVersion))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we are being too restrictive by requiring this. as far as i can tell, it's actually okay for the schema to change bootstrap has started as long as the joining node is receiving forwarded writes + the new keyspace/columnfamily. this is already required for correctness in normal bootstraps (without schema changes), so i'm not sure we should be more strict here.

maybe the more "correct" check here is that other nodes see the joining node and are forwarding writes (presumably if the joining node is in their gossip, this is true). i think the current code already tries to do this via sleeping for 30 seconds to let gossip work and checking that gossip has seen the seeds

            Gossiper.instance.addLocalApplicationStates(states);
            setMode(Mode.JOINING, "sleeping " + RING_DELAY + " ms for pending range setup", true);
            Uninterruptibles.sleepUninterruptibly(RING_DELAY, TimeUnit.MILLISECONDS);
[...]

        if (!Gossiper.instance.seenAnySeed())
            throw new IllegalStateException("Unable to contact any seeds!");

but doesn't appear to guarantee that all writes are forwarded

Copy link
Contributor Author

@rhuffy rhuffy Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this check is more restrictive than we need to be.

@@ -83,6 +86,11 @@ public ListenableFuture<StreamState> bootstrap(StreamStateStore stateStore, bool
streamer.addRanges(keyspaceName, strategy.getPendingAddressRanges(tokenMetadata, tokens, address));
}

if (!initialLocalSchemaVersion.equals(Schema.instance.getVersion()) || !MigrationManager.isReadyForBootstrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the order of this check is actually important. MigrationManager.isReadyForBootstrap() must be evaluated to true before initialLocalSchemaVersion.equals(Schema.instance.getVersion()) is evaluated to true. the way it's written now, it's possible for a partially applied schema to complete in between the two checks and not throw here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what's the reason for putting these checks in Bootstrapper as opposed to StorageService#bootstrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea was to put the second check in the callback for handling a PREPARE, but I decided that that was not necessary.

Moved the checks to SS

@@ -966,6 +962,15 @@ private void joinTokenRing(int delay, boolean autoBootstrap, Collection<String>
setMode(Mode.JOINING, "waiting for schema information to complete", true);
Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
}

while(!isSchemaConsistent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see a reason not to just combine this with the previous check and have it be

            while (!MigrationManager.isReadyForBootstrap() || !isSchemaConsistent())
            {
                setMode(Mode.JOINING, "waiting for schema information to complete... local schema version is {Schema.instance.getVersion().toString()}", true);
                Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
            }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also means we can remove the above // first sleep the delay to make sure we see all our peers, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to switch the first loop to just a wait with no short circuit condition

@d-guo
Copy link
Contributor

d-guo commented Nov 26, 2024

nice! i think these checks are sufficient to guard us against the theoretical edge case we talked about. as discussed, let's hold off on merging this "fix" until we have a repro of the edge case to make sure we aren't missing something here

@d-guo
Copy link
Contributor

d-guo commented Nov 26, 2024

does this affect spinning up zombie or new clusters?

}
Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
}
Uninterruptibles.sleepUninterruptibly(delay, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might as well just remove this given the below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having some delay here is still important because we know from experience that you should wait at least some amount of time here before proceeding. The intention of this change is to make the bootstrapping node wait

  • at least 30s
  • wait longer if it thinks the schema is consistent

A bad scenario would be if we get the non-empty schema by gossiping with a node in a network partition, think we have gossiped with everyone because we haven't waited long enough, and then pass the subsequent checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that risk exist even if we wait for isSchemaConsistent? in that case, we should make our check more strict instead of just relying on a 30 second wait to hope it's safe

private static boolean isSchemaConsistent()
{
String localSchemaVersion = Schema.instance.getVersion().toString();
Set<Entry<InetAddress, EndpointState>> endpointStates = Gossiper.instance.getEndpointStates();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return Gossiper.instance.getEndpointStates().stream().map(Entry::getValue).map(entry -> entry.getApplicationState(ApplicationState.SCHEMA).value).allMatch(Schema.instance.getVersion().toString()::equals);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

String localSchemaVersion = Schema.instance.getVersion().toString();
return Gossiper.instance.getEndpointStates().stream()
.map(entry -> entry.getValue().getApplicationState(ApplicationState.SCHEMA).value)
.allMatch(localSchemaVersion::equals);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah wait doesn't this check cause problems bc the endpoint states can have LEFT nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a change to exclude LEFT nodes, and also gracefully handle cases where ApplicationStates are null

@rhuffy
Copy link
Contributor Author

rhuffy commented Nov 27, 2024

does this affect spinning up zombie or new clusters?

Confirmed from testing that this does not impact spinning up new clusters. On the first startup of seed nodes, this codepath is not flexed because they do not bootstrap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants