[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152
Labels
data-skipping
index
metadata
metadata table
priority:critical
production down; pipelines stalled; Need help asap.
reader-core
Added two record keys(customer_id,name) and configured Record index as below
hudi_options = { 'hoodie.table.name': "hudi-table-with-rli-two-record-keys", 'hoodie.datasource.write.recordkey.field': "customer_id,name", 'hoodie.datasource.write.partitionpath.field': "state", 'hoodie.datasource.write.precombine.field': "created_at", 'hoodie.datasource.write.operation': "upsert", # Use upsert operation 'hoodie.index.type': "RECORD_INDEX", 'hoodie.metadata.enable': "true", 'hoodie.metadata.index.column.stats.enable': "true", 'hoodie.metadata.record.index.enable': "true" } df.write.format("hudi").options(**hudi_options).mode("append").save("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/")
Reading record with composite keys
spark.read.format("hudi") \ .option("hoodie.enable.data.skipping", "true") \ .option("hoodie.metadata.enable", "true") \ .option("hoodie.metadata.record.index.enable", "true") \ .option("hoodie.metadata.index.column.stats.enable", "true") \ .load("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/") \ .createOrReplaceTempView("hudi_snapshot1") spark.sql("select * from hudi_snapshot1 where customer_id='04da8419-fb9e-47f1-a44f-3cf2199ad20a'and name='Customer_43680' ").show(truncate=False)
Observations:
Spark is reading from all the partition as show in attached image
The text was updated successfully, but these errors were encountered: