DBZ-8430 Use list of columns in incremental snapshot select stmt#6000
DBZ-8430 Use list of columns in incremental snapshot select stmt#6000wukachn wants to merge 2 commits intodebezium:mainfrom
Conversation
|
Hi @wukachn, it seems the test failures are related. Could you take a look? |
|
Ill take a look at the ITs at some point today |
|
Looks like the initial change has made MySQL schema changes not supported at all during an incremental snapshot. If thats true then schema changes must have never been supported during a MySQL incremental snapshot if the user was using column filters. @Naros was the team aware of this nuance? I haven't looked too deep into how to get it working again but I imagine it'd take some time for me to gain the confidence to make a meaningful change. Im happy to take suggestions from someone who has more knowledge of the codebase. |
|
hey @wukachn, I've been tracking this issue as it relates to a problem we had in Prod for Mysql. I think you may be right and that incremental snapshot has possibly just never worked with a column filter applied for some of the sources. As far as I can tell there isn't a test in place for incremental snapshots with column filters applied. I will ask about it in Zulip as well as I guess that would be unrelated to the specific issue of handling invisible columns but if I set column.include.list to include a subset of columns for my table and then run an incremental or blocking snapshot I get this warning: {"timestamp":"2024-11-24T13:36:04.785293758+02:00","sequence":991,"loggerClassName":"org.slf4j.impl.Slf4jLogger","loggerName":"io.debezium.config.CommonConnectorConfig","level":"WARN","message":"The signal event 'Struct{}' should have 3 fields but has 0" This is coming from the below code it seems in CommonConnectorConfig. It would be cool if you could sanity check if you see the same behaviour in 3.0.1 for sql server. I'm also not sure why that would cause the schema change to time out though which is what I saw when I ran tests using your code. Edit: I believe your Oracle test passed. I wonder if this relates to sources that use db.schema.table for snapshots like SQL server and oracle Vs sources that use just database.table like Mysql |
I tried to set the column filters in the With your changes it works if you specify the column filters and not work if you not specify it. The scenario is
So during the incremental snapshot can happen that when it is processing some chunk an the last table definition was with the column This before was not happening since the Or can happen this Record will be emitted with new column before the schema event has been processed. What I think is wrong here is that we are using the select statement for chunk query also for read the table in |
|
@mfvitale I asked what I guess is a related question here: https://debezium.zulipchat.com/#narrow/channel/348104-community-mysql-mariadb/topic/Expected.20behaviour.20of.20column.2Einclude.2Elist.20for.20multiple.20table Basically just trying to understand the intended behaviour of the column filter as it appears to apply globally even to the debezium_signal table (i.e. in RelationalDatabaseSchema it excludes the signal table columns and you can't then read signals from it) and so if you specify include columns you need to make sure to include every column across all of your tables which is is not ideal |
|
@mfvitale If you have a Using the From looking at the code this weekend, I came to the conclusion that a change was needed to |
Hey @mfvitale, I just wanted to follow up on what you were saying above as I've been tracking this issue. Is there any downside to just having always having the select statement for readSchema be select * from table with conditions? Not sure what the least hacky way to do that would be though. Maybe a new method so instead of calling buildChunkQuery in readSchema you call a new buildSchemaQuery method or something like that so you can have something more similar to the old logic where you select * unless there is include/exclude column list? |
There is a window between the schema change validation step and the execution of the query where I could imagine this error happening, and we should likely guard against this race condition. The question is whether or not we can efficiently guard against this 🤔 .
I believe we need both, its what we do for the initial snapshots I do believe, right? |
In that case why not use |
I believe the idea @nathan-smit-1 was to always apply a list of columns. For example in the initial snapshot, if we find the user supplies a |
@Naros Apologies, ended up deleting some responses for clarity. As I understand the issue, when a column is dropped - and we explicitly list out the columns - readSchema fails because it does select a, b, c from table to generate the schema but [c] no longer exists. So the process to verify if the schema is changed is itself failing due to the changed schema, if that makes sense. I was just thinking now though that I recently added readSchemaForTable as part of DBZ-4903 and I'm wondering now if that can't be adapted to do the schema read instead? I'm busy running some tests at the moment. EDIT: So the above doesn't work because the schema generated by jdbcConnection.readSchema is subtly different from the schema from context.getSchema() |
The main difference with the initial snapshot is that we lock tables while reading the schema. |
|
@mfvitale @Naros I don't wanna monopolise discussion to much on this as it's not my pull request, but FWIW, I did find that having a fallback like the below in the case of sql exception did pass the Mysql tests at least and does then allow snapshots on invisible columns. The below is also kinda what I was referring to as a "hacky" solution though (i.e. checking the current schema in event of sql exception). Definitely a problem that is much trickier than it initially seemed. |
@nathan-smit-1 the problem with the hack is that you can effectively have a table schema without the hidden column and then read the data with the hidden column. I added |
|
@mfvitale that makes sense. Just to put another idea out there: What if this was configurable such that you could choose to have Debezium list the column but accept the risk that you can't have schema changes during the snapshot? So the equivalent of setting snapshot.locking.mode to none |
@nathan-smit-1 I think this is not going to works since its what is happening in the tests. Schema must match otherwise you will get |
|
My apologies for circling back to this relatively late @mfvitale and @nathan-smit-1. So I came across a similar situation with Oracle recently during streaming where any This makes me wonder whether it makes more sense to consider whether hidden/generated columns should be part of the change event payload, particularly given that they're often not easily obtainable or even recorded in some vendor transaction logs. Perhaps once we introduce connector-specific relational model implementations, that might help here. |
Well this is interesting. It would be good if we can put down somewhere (DDD?) all this findings so we can also discuss better on the solution.
In case of the generated, that are not in the transaction logs, it sounds good but I have some doubts on the hidden one. Think in uses case where you just want to do same database replication.
Also here, do you think is there any change to in someway exclude or optionally exclude those columns while we get the table metadata? |
@mfvitale, absolutely, I think this highlights the fact it could be useful to introduce a visibility attribute or mechanism for tracking column visibility in the relational model, since I suspect other databases are similar to Oracle where virtual/generated is tracked independently of column's visibility setting for generalized queries.
We could, but column position details would also need to change with what I suspect may be a database-specific hook. In Oracle for example, the driver by default provides generated columns first, but is this somehow influenced by NULL ordering, where some databases you can set NULLs to be first or last. Whatever we do, we'd need to take it connector by connector. For MariaDB, MySQL, and Oracle, combining a code change in One big open question is whether or not it would be possible to make such a change for existing connectors. Particularly when we rehydrate the schema history, etc. Is all the information truly there or not. |
|
Closing due to inactivity. |
https://issues.redhat.com/browse/DBZ-8430
https://debezium.zulipchat.com/#narrow/channel/302529-community-general/topic/Incremental.20Snapshots.20-.20Hidden.20Columns