Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

shosokawa · 2019-01-07T08:29:50Z

Describe your environment

Xcode version: Version 10.1 (10B61)
Firebase SDK version: 5.15.0
Firebase Component: Firestore
Component version: 0.16.1

Describe the problem

An app on a certain device was crashed by HARD_ASSERT in this code repeatedly. I couldn't find a way to handle or recover this error except uninstalling the app once to delete persistent data.

firebase-ios-sdk/Firestore/Source/Local/FSTLevelDBMutationQueue.mm

Lines 129 to 140 in d671776

    
             metadata.lastAcknowledgedBatchId = kFSTBatchIDUnknown; 
        
           } else { 
        
             BatchId lastAcked = metadata.lastAcknowledgedBatchId; 
        
             if (lastAcked >= nextBatchID) { 
        
               HARD_ASSERT([self isEmpty], "Reset nextBatchID is only possible when the queue is empty"); 
        
               lastAcked = kFSTBatchIDUnknown; 
        
               metadata.lastAcknowledgedBatchId = lastAcked; 
        
               _db.currentTransaction->Put([self keyForCurrentMutationQueue], metadata); 
        
             } 
        
           }

I'm not sure why data was stored like that. Although I suppose that there might be chances as requirement to reset nextBatchID with isEmpty == false.

Steps to reproduce:

I couldn't find a way to create same situation.

Stack trace for the crash:

Fatal Exception: NSInternalInconsistencyException
0  CoreFoundation                 0x196a90ec4 __exceptionPreprocess
1  libobjc.A.dylib                0x195c61a40 objc_exception_throw
2  CoreFoundation                 0x1969a6b3c +[_CFXNotificationTokenRegistration keyCallbacks]
3  Foundation                     0x1974e3980 -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:]
4  Alpaca                         0x104a548d0 firebase::firestore::util::internal::Fail(char const*, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (hard_assert_apple.mm:34)
5  Alpaca                         0x104a549e0 firebase::firestore::util::internal::Fail(char const*, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*) (hard_assert_apple.mm:54)
6  Alpaca                         0x1049f370c -[FSTLevelDBMutationQueue start] (FSTLevelDBMutationQueue.mm:133)
7  Alpaca                         0x104a01320 -[FSTLocalStore startMutationQueue] (FSTLocalStore.mm:124)
8  Alpaca                         0x104a011e4 -[FSTLocalStore start] (FSTLocalStore.mm:119)
9  Alpaca                         0x1049e7720 -[FSTFirestoreClient initializeWithUser:settings:] (FSTFirestoreClient.mm:245)
10 Alpaca                         0x1049e97ec std::__1::__function::__func<-[FSTFirestoreClient initWithDatabaseInfo:settings:credentialsProvider:userExecutor:workerQueue:]::$_1, std::__1::allocator<-[FSTFirestoreClient initWithDatabaseInfo:settings:credentialsProvider:userExecutor:workerQueue:]::$_1>, void ()>::operator()() (string:1255)
11 Alpaca                         0x1049b7850 firebase::firestore::util::AsyncQueue::ExecuteBlocking(std::__1::function<void ()> const&) (atomic:921)
12 Alpaca                         0x1049bef0c firebase::firestore::util::internal::DispatchAsync(NSObject<OS_dispatch_queue>*, std::__1::function<void ()>&&)::$_0::__invoke(void*) (executor_libdispatch.mm:58)
13 libdispatch.dylib              0x1964ca484 _dispatch_client_callout
14 libdispatch.dylib              0x196471bd0 _dispatch_lane_serial_drain$VARIANT$mp
15 libdispatch.dylib              0x196472718 _dispatch_lane_invoke$VARIANT$mp
16 libdispatch.dylib              0x19647aeb8 _dispatch_workloop_worker_thread
17 libsystem_pthread.dylib        0x1966ad0dc _pthread_wqthread
18 libsystem_pthread.dylib        0x1966afcec start_wqthread

The text was updated successfully, but these errors were encountered:

wilhuff · 2019-01-07T23:46:19Z

So unfortunately, the stack trace for this crash isn't informative: if this assertion were to fail it can only fail during startup exactly in the way it has.

Do you still have a device that's failing this way? If so we may be able to discover what's happening by examining the stored persistent data.

The failure here indicates that there's a mismatch between the mutation queue contents and the metadata about the queue. These should always be in sync (they're modified together transactionally) and if they're not in sync it points to a serious error somewhere.

Have you changed Firestore SDK versions recently?

shosokawa · 2019-01-08T00:29:15Z

Do you still have a device that's failing this way?

I'm sorry, I already have made the device recovered and I don't know a way to reproduce that.

Have you changed Firestore SDK versions recently?

We fixed version of Firestore SDK to 5.12.0 for a while in order to prevent a problem around #2138. It seems that #2138 had been resolved, we updated version to 5.15.0 recently. Before fixing version to 5.12.0, we used 5.13.0 once.

wilhuff · 2019-01-09T06:38:38Z

OK. Looking through the changes around those versions I think I've figured out you could get in this state.

In #2029 (released with Firebase 5.13.0) we changed the way we store writes and documents that have been acknowledged by the server but not yet observed via an active listen. Previously we would hold these in our outbound queue and keep track in memory of these held write acknowledgements. On restart these held writes would be dropped because the on-disk structures were insufficient to reconstruct the in-memory tracking structures. As a result, prior to Firebase 5.13.0, it was possible to observe flicker across restarts, where a value would fleetingly revert to a state prior to a local write if the app happened to stop in this in-between state.

#2029 fixed this flicker (and made a bunch of other internal improvements possible) by storing the SDK's estimate of the resulting document in the cache and removing acknowledged writes immediately. A one-time schema migration was included that performed a final delete of any existing held write acks on restart.

What I think is happening in your case is that while the schema migration deletes acknowledged writes it doesn't reset the highest acknowledged batch id if all writes were deleted. This in turn causes the violation of the invariant you're observing.

The good news is that this in-between state does not exist for very long and to get in this state a user would have to shut down your app just as the last write the queue was acknowledged but not yet observed. That should make this bug rare.

I need to get back with the team to evaluate possible fixes. Specifically I'm not yet convinced that it's safe to just remove the assertion that your device tripped over.

var-const · 2019-01-11T17:46:39Z

PR #2243, which will be part of the next release, fixes this.

google-oss-bot added the api: firestore label Jan 7, 2019

wilhuff self-assigned this Jan 9, 2019

wilhuff added the type: bug label Jan 9, 2019

wilhuff added this to the M41 milestone Jan 9, 2019

wilhuff assigned var-const and unassigned wilhuff Jan 9, 2019

var-const mentioned this issue Jan 11, 2019

Stop persisting last acknowledged batch ID #2243

Merged

paulb777 closed this as completed Jan 11, 2019

firebase locked and limited conversation to collaborators Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

Comments

Describe your environment

Describe the problem

Steps to reproduce: