Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firestore Android get stuck permanently after a large volume of mutations #5417

Closed
fanwgwg opened this issue Oct 13, 2023 · 11 comments · Fixed by #5585
Closed

Firestore Android get stuck permanently after a large volume of mutations #5417

fanwgwg opened this issue Oct 13, 2023 · 11 comments · Fixed by #5585
Assignees

Comments

@fanwgwg
Copy link

fanwgwg commented Oct 13, 2023

[READ] Step 1: Are you in the right place?

Issues filed here should be about bugs in the code in this repository.
If you have a general question, need help debugging, or fall into some
other category use one of these other channels:

  • For general technical questions, post a question on StackOverflow
    with the firebase tag.
  • For general Firebase discussion, use the firebase-talk
    google group.
  • For help troubleshooting your application that does not fall under one
    of the above categories, reach out to the personalized
    Firebase support channel.

[REQUIRED] Step 2: Describe your environment

  • Android Studio version: Android Studio Giraffe | 2022.3.1 Patch 2
  • Firebase Component: Firestore
  • Component version: com.google.firebase:firebase-bom:32.3.1

[REQUIRED] Step 3: Describe the problem

During our development (our app is already in production with Firestore for nearly a year so we're very familiar with the Firestore Android SDK), we spotted one rare issue that we've never experienced before, which is Firestore getting stuck permanently after a large volume of mutations on the client side. Although this is our first time experiencing the issue, after days of debugging we couldn't find any reason why it was getting stuck, here's what we happens:

  1. We're using the same device that we've be constantly using for app development throughout the last year.
  2. We're certain that there was no internet connection issue. As our app is designed to work both offline and online, so we're very familiar with Firestore and knows that its Task won't complete if there was no internet connection, because it can only complete when a write is committed on server
  3. After a large volume of mutations, in a roughly sequence of inserting thousands of documents -> deleting thousands of documents -> inserting thousands of documents -> etc (about ~10 times), Firestore got stuck because we waited for more than 1 day (24 HOURS), but still Firestore couldn't complete its operations. We were verifying that it's getting stuck by issuing a dummy call to Firestore, which to delete a non-existent document, however, this Task never completed, failed or cancelled.
  4. While firestore got stuck, we used Profiler inside Android Studio and can see that FirestoreWorker has been running 100% of the CPU time (of its own thread).
  5. We kept the device connected, charged and on idle for 3 days, but the dummy call (deletion of non existence document) still couldn't complete.
  6. We tried restarting the app many times, but it never recovered from the "stuck" state

As of right now, we've already re-installed our app yesterday to recover from this issue. However, while we were debugging the issue, we did dump the Firestore database at that time, because I thought it might be helpful for Firestore engineers to investigate. Here's the link to the DB. The db itself is gigantic, at size of > 1GB, our app does not override any Firestore's persistence settings, so everything about offline persistence should be the default.

We've filed a support case to Firebase Support earlier (last week) with case number 10250473, however, they said they were to spot any issue on the server side and suggested me to contact the client SDK team directly.

Steps to reproduce:

Ever since then, we never encountered the issue again. Similarly during our ~1 year development with Firestore Android SDK, that was the only time we encountered the issue. We believe that it could be a rare case, however, when the issue does happen, it is SERIOUS because the only way to get out of it is to uninstall and reinstall the app. Therefore, we're kindly asking Firestore team to take a look at this issue.

Relevant Code:

Explained above

@milaGGL
Copy link
Contributor

milaGGL commented Oct 13, 2023

Hi @fanwgwg, thank you for reporting this issue. I am unable to download the DB via the link provided. Could you please check if there are any abnormal transactions that trying to mutate a same document at the same time, or anything that would caught your eyes?

I wonder if this was caused by your persistence settings + large volume of mutations. But without a reproducible code, it is hard to tell what is the root cause.

@fanwgwg
Copy link
Author

fanwgwg commented Oct 14, 2023

@milaGGL The link should be valid as I'm able to download it from Incognito mode (unless you meant that it cannot be virus scanned by Google drive due to file being too big, as shown in the screenshot).
Screenshot 2023-10-14 at 15 05 43

For persistence settings, our app does not override any Firestore's persistence settings, so everything about offline persistence should be the default.

For abnormal transactions, we did not identify any of it. Also, that's the reason why I did a dump of the firestore database when the "stuck" happens, which is to send to Firestore engineers for a further inspection in case there is anything mysterious. But for us, we never reproduced the same issue again so it's difficult for us to tell if there's anything abnormal.

Regarding this:

check if there are any abnormal transactions that trying to mutate a same document at the same time

I'm unable to tell if this has happened, as we never encountered the same issue during our 1 year development time with firestore android, so this must be a rare or race condition. However, I recall that all mutations to firestore android sdk is enqueued sequentially:

/** A helper class that allows to schedule/queue Runnables on a single threaded background queue. */
. Therefore even if there are two mutations to the same document at the same time, they should never happen in parallel. Meaning that the following code should not have any issues:

var bgExecutor = MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(4));
var docRef = db.collection("test").document("a");
for (int i = 0; i <1000; i++) {
   bgExecutor.submit(() -> docRef.set(ImmutableMap.of("field_1", "value_1")));
   bgExecutor.submit(() -> docRef.delete());
   bgExecutor.submit(() -> docRef.set(ImmutableMap.of("field_1", "value_2")));
}

Please correct me if I'm not understanding it correctly

@milaGGL
Copy link
Contributor

milaGGL commented Oct 17, 2023

@fanwgwg, yes, that is correct. Quick question, have there been any updates to your app recently, like upgrading to a newer version of the SDK?

@fanwgwg
Copy link
Author

fanwgwg commented Oct 18, 2023

@milaGGL We've been following the latest version and Firebase BOM releases, typically we always update to the latest version within 1 week of its release. The latest firebase android BOM update was 32.3.1 on September 15, 2023, and we updated to this version on Sep 16th.

When this issue happened (around Sep 20-ish), we were already using the 32.3.1 release.

@fanwgwg
Copy link
Author

fanwgwg commented Oct 26, 2023

@milaGGL I've just encountered another case of "stuck" today, this time I've enabled Firestore log and here's log output from starting the app to when firestore got stuck: Google Drive Link

One thing I found is that each time the "stuck" happens, the log ends with these two lines. After these two lines, no matter how long you wait for (no matter hours or days), firestore no longer prints any logs. Any further enqueued Task to Firestore will never complete, neither will them fail, but just in a waiting state.

(24.9.0) [Persistence]: Starting transaction: Collect garbage
(24.9.0) [LruGarbageCollector]: Capping sequence numbers to collect down to the maximum of 1000 from 1059

@fanwgwg
Copy link
Author

fanwgwg commented Oct 26, 2023

@milaGGL I think I've found something that might be very close to the cause, there is a infinite loop running in this method: SQLiteLruReferenceDelegate#removeOrphanedDocuments:

I've attached a screenshot of debugging at the breakpoints inside the method, as you can see, the upperBound = 19923, but each time inside the while loop, the docsToRemove = 0, therefore this loop runs indefinitely.

screenshot-20231026-184239

@milaGGL
Copy link
Contributor

milaGGL commented Oct 26, 2023

@fanwgwg, this is amazing! I wonder if !isPinned(key) ever returns true and gets inside the if block to populate the docsToRemove.
Would you be able to provide a minimal repro app?

@fanwgwg
Copy link
Author

fanwgwg commented Oct 27, 2023

Yes isPinned(key) always returns true because mutationQueuesContainKey(key) always return true.

Unfortunately I do not have a way to consistently reproduce the same issue, except that it only happens when there is a large volume of mutations. Even for large volume of mutations, which we often perform such tests on a daily basis, I've only encountered it twice so far.

In case anyone else is encountering the same issue, our mitigation for the issue is to disable persistence because our app doesn't use the persistence feature, i.e.,

    FirebaseFirestore.getInstance()
        .setFirestoreSettings(
            new FirebaseFirestoreSettings.Builder().setPersistenceEnabled(false).build());

@fanwgwg
Copy link
Author

fanwgwg commented Nov 30, 2023

@milaGGL @ehsannas Friendly ping, is there any update on this issue?

While disabling persistence works for now, we don't like this approach as mitigation as it might prevent us from developing features that rely on offline persistence, which is one of the core features of Firestore.

Just thinking out loud based on my superficial understanding of the SDK, it seems like that:

  1. Firestore runs a garbage collection logic to remove orphaned documents from its persistent cache if the cache becomes too large
  2. When trying to remove the orphaned documents, it looks up documents according to sequence number (maybe trying to look for the oldest documents?) and then try to remove them. However, if the document is in the mutation queue, it will skip.
  3. What if there is a case where the cache is small, but mutation queue contains so many documents that all documents in cache also happen to be inside the mutation queue? Does this sound like what's happening here?

@wu-hui
Copy link
Contributor

wu-hui commented Nov 30, 2023

@fanwgwg Thanks for the detailed investigation, I think you are right that there is a bug in removeOrphanedDocuments.

Your operation put the SDK's GC under pressure, it try to go through all orphaned documents in batches (with batch size REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE), and it will stop when it processed a batch of a size different than REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE, this means it has processed all the orphaned documents with sequence number in scope.

Typically, the number of orphaned documents do not exceed REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE, thus the loop breaks with one iteration. The operation you did triggered the loop into running multiple iterations.

Unfortunately, from the second iteration onwards, the SDK did not "resume" the query from iteration 1, and instead it issue the same query from iteration 1, thus an infinite loop.

We will get this fixed ASAP.

@fanwgwg
Copy link
Author

fanwgwg commented Nov 30, 2023

@wu-hui Thanks for the quick response! Looking forward to the fix!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants