Description
Describe the bug
There seem to be a couple scenarios where the search/graph indexes can become out-of-sync with the SQL database:
- If the MCL publish fails, it seems entityService increments a metric and moves on without rolling back the SQL persist
- If something fails during elasticsearch updates, the mae-consumer increments a metric and skips past the MCL event
To Reproduce
These scenarios seem like they could happen anytime there is a transient/network level error with the ElasticSearch or the Kafka cluster, which is not uncommon
Expected behavior
-
In the 1st case, I would expect entityService to rollback the SQL update and return a failure code to the client. If the client was the mce-consumer, I would expect mce-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds. Ideally the entire persist+MCL publish is atomic (both succeed or both fail)
-
In the 2nd case, I would expect the mae-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds
-
Alternatively, an easier solution might be to automatically call the restoreIndices GMS endpoint when one of these failures is detected
Screenshots
N/A
Desktop (please complete the following information):
N/A
Additional context
N/A