DBZ-2386 Notify errorHandler on Postgres errors#5971
Conversation
| } | ||
| } | ||
| catch (SQLException e) { | ||
| errorHandler.setProducerThrowable(e); |
There was a problem hiding this comment.
@brewneaux Is this necessary here? As the exception is rethrown it should be caught by the change in ChangeEventSourceCoordinator. The errorHandler object is a single one per connector so it will be set there.
There was a problem hiding this comment.
Also IMHO there should be the same cleanup as is in the execute method so it should be extracted there to a separate method. But at the same time I believe it is possible that the calls are done from separate threads. So maybe there should be a flag field that is set in commitOffset and the execute method will fail when it is set. This might be the safest approach.
There was a problem hiding this comment.
In the testing I did on my initial PR, it appeared like it was. I just retested, though, in a few other scenarios with just the ChangeEventSourceCoordinator change and it continues to work as desired.
There was a problem hiding this comment.
@jpechane - just updated this PR. I did my best to try to understand what you were asking for and make it work within this context.
0b4be9c to
04fc8ac
Compare
|
The falures are completely unrelated and I cannot reporudce them locally. I pulled the latest TimescaleDb image and it still works. Let's wait couple of hours to see if it is a transient issue with GHA. |
Hey @jpechane - anything we can do to move this forward? My team is currently using a homebuilt version of Debezium so we can handle this, but we'd like to be on a more official build if we could |
|
@brewneaux Hi, so I re-run this locally on the latest main. The |
|
@jpechane this should be working now, but I'm having a ton of problems getting the full test suite to run. I can run little sections of it - and the Timescale tests all passed. Can you run the suite via GHA? |
|
Hm, that result from Github Actions is strange. All of the failures I saw on there, I was able to run successfully locally. @jpechane - do you have any other suggestions? I'm still not able to get a full run locally the same way GHA would (I'm blaming Windows and my lack of Java experience), but when I run the failures individually I do get success. |
|
Hi @brewneaux the REST test failures are related to a fix coming in PR #6059, you can safely ignore those errors for now. But to add some explanation, those tests are executed when the Looking in build_postgresql:
strategy:
# Runs each combination concurrently
matrix:
profile: [ "assembly,postgres-12", "assembly,postgres-17,pgoutput-decoder" ]So to execute locally in the same way with PG 12, you would set |
|
@brewneaux The referred PR was merged so the tests should run correctly now. |
32e4d77 to
6bd4a7f
Compare
|
Thanks everyone. I rebased this against debezium/debezium main. I'm still having some issues getting the full test suite to run. I'm (now) on an Ubuntu machine (after giving up on my windows laptop doing stupid things with the MongoDbReplicaSetContainerIT), running I get through a large part of the test suite after quite a while. But I seem to get different results (failures) each time I run it. A lot of them look like they might be just flakey tests but I can't tell. One of the big issues I'm running into though is that eventually, get to an error where it cannot build the docker image for the rest tests: It is building the plugin as version 3.1.0-SNAPSHOT, and looking for 3.0.5-SNAPSHOT. I appreciate everyones help on this. The Java world is so very different from the C# and Python I do on a daily basis. |
This was one of the errors I was describing above, that I attributed to a local misconfiguration. Does anyone have any reason why this would be happening? |
During certain scenarios, especially during RDS reboots (planned or unplanned), Postgres becomes available in such a way that Debezium is not aware that Postgres went away. In many cases, this can result in connectors thinking they are healthy (with all statuses returning RUNNING), but are not actually reading from the replication slots. By notifying the errorHandler when these occur, we are able to have the in-built retry mechanism reconnect after these failures. Due to the inconsistent nature of how this happens - not every reboot will cause a connector to get stuck - we are catching erros happening in both the PostgresStreamingChangeEventSource, and the core ChangeEventSourceCoordinator. In my testing, I have seen only some types of events get caught by the error handler in the ChangeEventSourceCoordinator.
In further testing, doesn't appear that the postgres-specific one was necessary
…e streaming gracefully (if we can)
6bd4a7f to
d6ed1f4
Compare
|
Reabsed on the latest main to see fresh CI results |
|
@brewneaux Applied, thanks! |
During certain scenarios, especially during RDS reboots (planned or unplanned), Postgres becomes available in such a way that Debezium is not aware that Postgres went away. In many cases, this can result in connectors thinking they are healthy (with all statuses returning RUNNING), but are not actually reading from the replication slots.
By notifying the errorHandler when these occur, we are able to have the in-built retry mechanism reconnect after these failures.
Due to the inconsistent nature of how this happens - not every reboot will cause a connector to get stuck - we are catching erros happening in both the PostgresStreamingChangeEventSource, and the core ChangeEventSourceCoordinator. In my testing, I have seen only some types of events get caught by the error handler in the ChangeEventSourceCoordinator.