Firestore Android get stuck permanently after a large volume of mutations #5417

fanwgwg · 2023-10-13T15:29:07Z

[READ] Step 1: Are you in the right place?

Issues filed here should be about bugs in the code in this repository.
If you have a general question, need help debugging, or fall into some
other category use one of these other channels:

For general technical questions, post a question on StackOverflow
with the firebase tag.
For general Firebase discussion, use the firebase-talk
google group.
For help troubleshooting your application that does not fall under one
of the above categories, reach out to the personalized
Firebase support channel.

[REQUIRED] Step 2: Describe your environment

Android Studio version: Android Studio Giraffe | 2022.3.1 Patch 2
Firebase Component: Firestore
Component version: com.google.firebase:firebase-bom:32.3.1

[REQUIRED] Step 3: Describe the problem

During our development (our app is already in production with Firestore for nearly a year so we're very familiar with the Firestore Android SDK), we spotted one rare issue that we've never experienced before, which is Firestore getting stuck permanently after a large volume of mutations on the client side. Although this is our first time experiencing the issue, after days of debugging we couldn't find any reason why it was getting stuck, here's what we happens:

We're using the same device that we've be constantly using for app development throughout the last year.
We're certain that there was no internet connection issue. As our app is designed to work both offline and online, so we're very familiar with Firestore and knows that its Task won't complete if there was no internet connection, because it can only complete when a write is committed on server
After a large volume of mutations, in a roughly sequence of inserting thousands of documents -> deleting thousands of documents -> inserting thousands of documents -> etc (about ~10 times), Firestore got stuck because we waited for more than 1 day (24 HOURS), but still Firestore couldn't complete its operations. We were verifying that it's getting stuck by issuing a dummy call to Firestore, which to delete a non-existent document, however, this Task never completed, failed or cancelled.
While firestore got stuck, we used Profiler inside Android Studio and can see that FirestoreWorker has been running 100% of the CPU time (of its own thread).
We kept the device connected, charged and on idle for 3 days, but the dummy call (deletion of non existence document) still couldn't complete.
We tried restarting the app many times, but it never recovered from the "stuck" state

As of right now, we've already re-installed our app yesterday to recover from this issue. However, while we were debugging the issue, we did dump the Firestore database at that time, because I thought it might be helpful for Firestore engineers to investigate. Here's the link to the DB. The db itself is gigantic, at size of > 1GB, our app does not override any Firestore's persistence settings, so everything about offline persistence should be the default.

We've filed a support case to Firebase Support earlier (last week) with case number 10250473, however, they said they were to spot any issue on the server side and suggested me to contact the client SDK team directly.

Steps to reproduce:

Ever since then, we never encountered the issue again. Similarly during our ~1 year development with Firestore Android SDK, that was the only time we encountered the issue. We believe that it could be a rare case, however, when the issue does happen, it is SERIOUS because the only way to get out of it is to uninstall and reinstall the app. Therefore, we're kindly asking Firestore team to take a look at this issue.

Relevant Code:

Explained above

The text was updated successfully, but these errors were encountered:

milaGGL · 2023-10-13T18:29:00Z

Hi @fanwgwg, thank you for reporting this issue. I am unable to download the DB via the link provided. Could you please check if there are any abnormal transactions that trying to mutate a same document at the same time, or anything that would caught your eyes?

I wonder if this was caused by your persistence settings + large volume of mutations. But without a reproducible code, it is hard to tell what is the root cause.

fanwgwg · 2023-10-14T07:22:29Z

@milaGGL The link should be valid as I'm able to download it from Incognito mode (unless you meant that it cannot be virus scanned by Google drive due to file being too big, as shown in the screenshot).

For persistence settings, our app does not override any Firestore's persistence settings, so everything about offline persistence should be the default.

For abnormal transactions, we did not identify any of it. Also, that's the reason why I did a dump of the firestore database when the "stuck" happens, which is to send to Firestore engineers for a further inspection in case there is anything mysterious. But for us, we never reproduced the same issue again so it's difficult for us to tell if there's anything abnormal.

Regarding this:

check if there are any abnormal transactions that trying to mutate a same document at the same time

I'm unable to tell if this has happened, as we never encountered the same issue during our 1 year development time with firestore android, so this must be a rare or race condition. However, I recall that all mutations to firestore android sdk is enqueued sequentially:

firebase-android-sdk/firebase-firestore/src/main/java/com/google/firebase/firestore/util/AsyncQueue.java

Line 46 in 406c057

    
           /** A helper class that allows to schedule/queue Runnables on a single threaded background queue. */

. Therefore even if there are two mutations to the same document at the same time, they should never happen in parallel. Meaning that the following code should not have any issues:

var bgExecutor = MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(4));
var docRef = db.collection("test").document("a");
for (int i = 0; i <1000; i++) {
   bgExecutor.submit(() -> docRef.set(ImmutableMap.of("field_1", "value_1")));
   bgExecutor.submit(() -> docRef.delete());
   bgExecutor.submit(() -> docRef.set(ImmutableMap.of("field_1", "value_2")));
}

Please correct me if I'm not understanding it correctly

milaGGL · 2023-10-17T19:53:29Z

@fanwgwg, yes, that is correct. Quick question, have there been any updates to your app recently, like upgrading to a newer version of the SDK?

fanwgwg · 2023-10-18T11:24:55Z

@milaGGL We've been following the latest version and Firebase BOM releases, typically we always update to the latest version within 1 week of its release. The latest firebase android BOM update was 32.3.1 on September 15, 2023, and we updated to this version on Sep 16th.

When this issue happened (around Sep 20-ish), we were already using the 32.3.1 release.

fanwgwg · 2023-10-26T10:31:15Z

@milaGGL I've just encountered another case of "stuck" today, this time I've enabled Firestore log and here's log output from starting the app to when firestore got stuck: Google Drive Link

One thing I found is that each time the "stuck" happens, the log ends with these two lines. After these two lines, no matter how long you wait for (no matter hours or days), firestore no longer prints any logs. Any further enqueued Task to Firestore will never complete, neither will them fail, but just in a waiting state.

(24.9.0) [Persistence]: Starting transaction: Collect garbage
(24.9.0) [LruGarbageCollector]: Capping sequence numbers to collect down to the maximum of 1000 from 1059

fanwgwg · 2023-10-26T10:46:11Z

@milaGGL I think I've found something that might be very close to the cause, there is a infinite loop running in this method: SQLiteLruReferenceDelegate#removeOrphanedDocuments:

firebase-android-sdk/firebase-firestore/src/main/java/com/google/firebase/firestore/local/SQLiteLruReferenceDelegate.java

Line 159 in f6b5ecb

public int removeOrphanedDocuments(long upperBound) {

I've attached a screenshot of debugging at the breakpoints inside the method, as you can see, the upperBound = 19923, but each time inside the while loop, the docsToRemove = 0, therefore this loop runs indefinitely.

milaGGL · 2023-10-26T17:41:57Z

@fanwgwg, this is amazing! I wonder if !isPinned(key) ever returns true and gets inside the if block to populate the docsToRemove.
Would you be able to provide a minimal repro app?

fanwgwg · 2023-10-27T07:40:06Z

Yes isPinned(key) always returns true because mutationQueuesContainKey(key) always return true.

Unfortunately I do not have a way to consistently reproduce the same issue, except that it only happens when there is a large volume of mutations. Even for large volume of mutations, which we often perform such tests on a daily basis, I've only encountered it twice so far.

In case anyone else is encountering the same issue, our mitigation for the issue is to disable persistence because our app doesn't use the persistence feature, i.e.,

    FirebaseFirestore.getInstance()
        .setFirestoreSettings(
            new FirebaseFirestoreSettings.Builder().setPersistenceEnabled(false).build());

fanwgwg · 2023-11-30T08:38:56Z

@milaGGL @ehsannas Friendly ping, is there any update on this issue?

While disabling persistence works for now, we don't like this approach as mitigation as it might prevent us from developing features that rely on offline persistence, which is one of the core features of Firestore.

Just thinking out loud based on my superficial understanding of the SDK, it seems like that:

Firestore runs a garbage collection logic to remove orphaned documents from its persistent cache if the cache becomes too large
When trying to remove the orphaned documents, it looks up documents according to sequence number (maybe trying to look for the oldest documents?) and then try to remove them. However, if the document is in the mutation queue, it will skip.
What if there is a case where the cache is small, but mutation queue contains so many documents that all documents in cache also happen to be inside the mutation queue? Does this sound like what's happening here?

wu-hui · 2023-11-30T16:14:06Z

@fanwgwg Thanks for the detailed investigation, I think you are right that there is a bug in removeOrphanedDocuments.

Your operation put the SDK's GC under pressure, it try to go through all orphaned documents in batches (with batch size REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE), and it will stop when it processed a batch of a size different than REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE, this means it has processed all the orphaned documents with sequence number in scope.

Typically, the number of orphaned documents do not exceed REMOVE_ORPHANED_DOCUMENTS_BATCH_SIZE, thus the loop breaks with one iteration. The operation you did triggered the loop into running multiple iterations.

Unfortunately, from the second iteration onwards, the SDK did not "resume" the query from iteration 1, and instead it issue the same query from iteration 1, thus an infinite loop.

We will get this fixed ASAP.

fanwgwg · 2023-11-30T16:24:54Z

@wu-hui Thanks for the quick response! Looking forward to the fix!

…5585) Fixes: #5417

…irebase#5585) Fixes: firebase#5417

google-oss-bot added the api: firestore label Oct 13, 2023

milaGGL self-assigned this Oct 13, 2023

wu-hui mentioned this issue Dec 1, 2023

Fixed an issue where GC runs into a infinite loop in a certain case. #5585

Merged

wu-hui closed this as completed in #5585 Dec 5, 2023

wu-hui added a commit that referenced this issue Dec 5, 2023

Fixed an issue where GC runs into a infinite loop in a certain case. (#…

42a6951

…5585) Fixes: #5417

firebase locked and limited conversation to collaborators Jan 5, 2024

Saeedsidj pushed a commit to Part-Packages/firebase-android-sdk that referenced this issue Oct 13, 2024

Fixed an issue where GC runs into a infinite loop in a certain case. (f…

e883d99

…irebase#5585) Fixes: firebase#5417

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firestore Android get stuck permanently after a large volume of mutations #5417

Firestore Android get stuck permanently after a large volume of mutations #5417

fanwgwg commented Oct 13, 2023 •

edited

Loading

milaGGL commented Oct 13, 2023

fanwgwg commented Oct 14, 2023 •

edited

Loading

milaGGL commented Oct 17, 2023

fanwgwg commented Oct 18, 2023

fanwgwg commented Oct 26, 2023 •

edited

Loading

fanwgwg commented Oct 26, 2023

milaGGL commented Oct 26, 2023

fanwgwg commented Oct 27, 2023 •

edited

Loading

fanwgwg commented Nov 30, 2023 •

edited

Loading

wu-hui commented Nov 30, 2023

fanwgwg commented Nov 30, 2023

Firestore Android get stuck permanently after a large volume of mutations #5417

Firestore Android get stuck permanently after a large volume of mutations #5417

Comments

fanwgwg commented Oct 13, 2023 • edited Loading

[READ] Step 1: Are you in the right place?

[REQUIRED] Step 2: Describe your environment

[REQUIRED] Step 3: Describe the problem

Steps to reproduce:

Relevant Code:

milaGGL commented Oct 13, 2023

fanwgwg commented Oct 14, 2023 • edited Loading

milaGGL commented Oct 17, 2023

fanwgwg commented Oct 18, 2023

fanwgwg commented Oct 26, 2023 • edited Loading

fanwgwg commented Oct 26, 2023

milaGGL commented Oct 26, 2023

fanwgwg commented Oct 27, 2023 • edited Loading

fanwgwg commented Nov 30, 2023 • edited Loading

wu-hui commented Nov 30, 2023

fanwgwg commented Nov 30, 2023

fanwgwg commented Oct 13, 2023 •

edited

Loading

fanwgwg commented Oct 14, 2023 •

edited

Loading

fanwgwg commented Oct 26, 2023 •

edited

Loading

fanwgwg commented Oct 27, 2023 •

edited

Loading

fanwgwg commented Nov 30, 2023 •

edited

Loading