-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large changesets over slow connections result in uncontrollable usage-metered long-polling loops #2683
Comments
The issue you're encountering here is limbo resolution performed by the client. As you've noticed, the Firestore server does not remember that deletes have happened indefinitely and a the client has been disconnected for long enough will just see deleted documents omitted from the result set rather than getting delete notifications. Limbo resolution is designed to counter this limitation. The idea is that if the client has documents in its cache that match your query but the server isn't returning them then the cache must be wrong and needs fixing. The issue is that when the client listens to a query, it first emits a snapshot based on the contents of its cache. Once the server has given results for the query the client emits another snapshot with those results, marked as in sync. Without limbo resolution, after some other client modifies a document so that it no longer matches the query, the first snapshot (from cache) will always include it and the second snapshot will not. Especially on slower connections this flicker is obvious and it won't self correct. The extra read brings the cache up-to-date and now the local document will no longer match the query and it won't end up in the first snapshot. Note that in the general case it's not enough to just delete limbo documents because even though that document might not match this query it might match another one. There are additional complications that happen with this method that involve concurrent remote modifications while a limbo resolution is in flight that cause us to implement limbo resolutions as listens and the client doesn't consider itself done with limbo resolution until all of them are in sync. So what's happening in your case is that over slow connections, performing 15k document lookups takes a long time. It might take long enough that most clients never actually finish the process. If they quit and come back the limbo resolution process will start over. I suspect this is why the number of reads is so large. In your reproduction (by the way, thanks for this, it's super helpful), your client is listening to a whole collection. Is that how your application actually works? That is--you're not using any where clauses or order bys? If so you're in luck there's a pretty simple fix we can make. Currently the client performs limbo resolution for all queries, but full collection queries are actually a case where the client can delete limbo documents outright without asking. If you are using where clauses or order bys then there's no simple fix for this. We'll need to think about and test methods for limiting the number of active limbo resolutions or otherwise guaranteeing that clients can make progress without restarting in these kinds of pathological cases. We might also possibly look into how we might adjust the algorithm to not use listens. However, these aren't easy fixes. In the meantime a workaround might be to clear persistence for this particular client, though sending this kind of instruction remotely is tricky (since you only want to clear once). |
Hi @wilhuff. Thanks for the response -- I was expecting one by tomorrow at the earliest. Unfortunately, our actual application does use inequality queries, so the proposed fix won't actually work for us. Something I've been wondering about is whether it would be possible to communicate that limbo resolution is failing to the This is something of a time-sensitive issue for us. We've already lost a non-trivial amount of money to this bug this year. The faster we can deploy a mitigation, the better. |
The Can you show an example of the query you're actually running against this collection? Are there any other queries you run against this collection? Another potential fix is to delete documents that are in limbo but have only ever matched a single query. The simplest mitigation right now is to clear persistence for affected clients. Something like this: let db = ...;
if (isAffectedClient() && window.localStorage.getItem("persistence.cleared") == null) {
await db.clearPersistence();
window.localStorage.setItem("persistence.cleared", "1");
} I've also put together a possible way to get the client to delete all limbo documents for itself: https://github.com/firebase/firebase-js-sdk/compare/wilhuff/limbo-hack, essentially forcing a delete-everything-in-limbo policy. I'm not convinced this is the right way to implement that though and need to discuss with the team. If you wanted you could build using the instructions in our readme and kick the tires on that though. I'm currently looking more closely at your repro to see what's going on here:
The initial long interval makes sense to me, but the loop doesn't seem right. |
OK I've figured out what's going on with your 270 second delay and 70 second loops. With the "Slow 3G" setting, the initial 270 seconds is what it takes to download the new 15k documents. The client writes these out and notifies the listener. The client then starts the limbo resolution process. Under the "Slow 3G" preset it takes about 10 seconds to send all 15k requests. The client then receives a resource-exhausted error from the backend indicating that it's not going to let a single client listen to 15k new things at once. The client's behavior when it encounters a resource-exhausted error is to wait with maximum backoff, which comes out to about 60 seconds. It turns out that slowing the connection down isn't necessary to reproduce this loop. Next steps are to look into how to add throttling to the limbo resolver. |
One thing I just noticed here is that your reproduction doesn't have persistence enabled. Is your production app similarly configured? If so one of the easier solutions might actually apply. |
@djsweet In addition to technical discussion here you should open a regular support case so that we can credit you for the excessive reads. In it, please document when this large deletion happened, reference this github issue and that you're working with me ([email protected]). From there I'll be able to escalate this to the right people. |
Hi again @wilhuff. My apologies for the delay in a response, yesterday was quite hectic for us. As some background as to what our listeners are doing, we are listening to rows in virtual tables that are specific per app that developers create on our platform. The target collection names all take the form We use a query 99.9% of the time in this listener, but sometimes have to fall back to a bare listen. Our app data is also stored in Cloud Storage to increase performance of initial app loading, and all changes to the data are tracked by a monotonic increasing counter. Every time a document is affected by a change, its own copy of this counter increases. So what we often do looks like firebase
.firestore()
.collection("/data-store/{app}/tables/{table}/rows")
.where("version", ">", blobVersion)
.onSnapshot(handler) This prevents the network overhead of having to download the entire table again, even though we've already realized it via a blob in Cloud Storage, but at the same time allows us to report updates as they happen. As you saw in the reproduction project, we do not use Firestore persistence. We really appreciate the offer to help us out with GCP support on the financial side of this. Thanks! |
Update: our plan to address this issue is to follow two tracks. On the first track we're currently testing a small change that disables limbo resolution. You should be able to use this version of the SDK temporarily and it should address the immediate concern. The second track involves more fundamental changes to limbo resolution so that these kinds of events can't cause this kind of problem. We believe that this circumstance is one we could automatically avoid. Additionally we'll be adding limits and throttling so that even if we somehow determine that a huge number of limbo documents do need to be resolved the client won't fail. |
While we've been working on coming up with a build that disables limbo resolution it occurred to me that in your specific circumstances there's a potentially much easier workaround. The core of the idea is to keep track of the latest version seen (from the blob or Firestore snapshots). When the app loses network connectivity, stop your listener. Then, when there's a chance there's a network connection, start the listener again from the latest version you've seen. This works because when persistence is disabled, Firestore only caches active queries and pending writes. If you stop listening the client will forget about the query and any documents that matched it (that don't match some other active query). When you start the listener later there's no chance for limbo resolution because there are no cached documents to get into limbo. A potential way to implement this would be to subscribe to let unsubscribe = null;
function setupListener() {
const db = firebase.firestore();
unsubscribe = db.collection(targetCollection).onSnapshot(...);
}
function networkEventListener(e) {
if (navigator.onLine) {
if (unsubscribe == null) {
setupListener();
}
} else {
if (unsubscribe != null) {
unsubscribe();
unsubscribe = null;
}
}
}
window.addEventListener("online", networkEventListener);
window.addEventListener("offline", networkEventListener); Instead of testing with the Another signal you might use is Let me know if this addresses the immediate problem. We'll continue working on making limbo resolution smarter, but making a build that disables limbo resolution has proven far more difficult than is really practical for a short term mitigation like this. |
@wilhuff My apologies for missing your replies up til this point. It's interesting that you suggested shutting off network connectivity based on visibility. We actually did develop out a mechanism for disabling the Firestore network connection when our app isn't visible, thinking that it would be a valid mitigation for the issue. We don't re-establish listeners yet but this may not be too hard for us. Today, we hit an unexpected read loop for approximately an hour. This is the first time the issue hasn't persisted, so I think it's safe to say that our partial mitigation is working. We'll look into re-establishing listeners on visibility and network changes as well. |
OK, that sounds promising. Let us know how that goes. In the meantime we're getting closer to having a working throttler for Limbo resolution and are making progress on design changes that mitigate specific circumstances like yours. I'm going to leave this open until we ship a throttler that avoids the loop you saw. If you see further issues related to this, don't hesitate to post more. |
Limbo resolution throttling was implemented in all three client SDKs 6 months ago:
I think we just forgot to close this ticket. I'm closing it, but feel free to re-open or add a comment if the issue is not resolved by limbo resolution throttling. |
[REQUIRED] Describe your environment
[REQUIRED] Describe the problem
We have encountered sporadic periods of quadrupled document read rates since early January 2020. We have narrowed the cause down to a pathological behavior of the Firebase SDK dealing with large changesets of 15,000 deleted documents and 15,000 added documents over slow Internet connections.
Clients with slow Internet connections will enter an uncontrollable, endless long-polling loop when these changes are encountered in snapshot listeners. The document additions seem to be reported to SDK consumers, but document removals are not. Each of these long-polling sessions counts the entire deleted document count against our read quotas, even though nothing is being passed to the SDK consumer.
Steps to reproduce:
I have prepared a sample project that simulates the exact conditions necessary to trigger this bug. A "server" script deletes and then creates 15,000 documents against a Firestore database, and a client attempts to react to these changes. The client's network connection is throttled using the Slow 3G preset in the Google Chrome Developer Console.
We see an initial HTTP connection that takes 270s to complete, transmitting 13.2 MB of data, which results in the client reporting all of the document additions but none of the document removals.
After this, we enter a state where follow-up HTTP connections taking 70 seconds each are constantly retried. Each connection transmits 3.3 MB of data, but the consumer of the SDK is not passed any information regarding changes.
After running this client for an hour with no actual database changes, we see that we will be charged for 800,000 document reads.
Because nothing has actually changed in the database during this period, and we are not actually able to observe changes from the client, we expect to not be charged for any reads.
Relevant Code:
See https://github.com/djsweet/firestore-read-spike-bug-report for an environment which reproduces this problem.
The text was updated successfully, but these errors were encountered: