[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

RameshkumarChikoti123 · 2024-10-23T16:05:46Z

Added two record keys(customer_id,name) and configured Record index as below

hudi_options = { 'hoodie.table.name': "hudi-table-with-rli-two-record-keys", 'hoodie.datasource.write.recordkey.field': "customer_id,name", 'hoodie.datasource.write.partitionpath.field': "state", 'hoodie.datasource.write.precombine.field': "created_at", 'hoodie.datasource.write.operation': "upsert", # Use upsert operation 'hoodie.index.type': "RECORD_INDEX", 'hoodie.metadata.enable': "true", 'hoodie.metadata.index.column.stats.enable': "true", 'hoodie.metadata.record.index.enable': "true" } df.write.format("hudi").options(**hudi_options).mode("append").save("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/")

Reading record with composite keys

spark.read.format("hudi") \ .option("hoodie.enable.data.skipping", "true") \ .option("hoodie.metadata.enable", "true") \ .option("hoodie.metadata.record.index.enable", "true") \ .option("hoodie.metadata.index.column.stats.enable", "true") \ .load("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/") \ .createOrReplaceTempView("hudi_snapshot1") spark.sql("select * from hudi_snapshot1 where customer_id='04da8419-fb9e-47f1-a44f-3cf2199ad20a'and name='Customer_43680' ").show(truncate=False)

Observations:
Spark is reading from all the partition as show in attached image

The text was updated successfully, but these errors were encountered:

ad1happy2go · 2024-10-24T08:20:25Z

Thanks for raising this @RameshkumarChikoti123 . I checked with master code too and i confirmed the issue is still there.

Created tracking jira to fix - https://issues.apache.org/jira/browse/HUDI-8432

mzheng-plaid · 2024-10-24T18:33:52Z

@ad1happy2go I'm seeing the same behavior on Hudi 0.14.1 with a partitioned table with a single record key (hoodie.datasource.write.recordkey.field) - is that expected?

ad1happy2go · 2024-10-30T12:40:20Z

@mzheng-plaid Yes, this issue was always there with RLI as we didnt had this support.

mzheng-plaid · 2024-10-30T16:19:48Z

Sorry, so is there currently any benefit to using RLI if you have a partitioned table? Or any composite record key?

ad1happy2go · 2024-11-07T14:44:00Z

@mzheng-plaid RLI is still very useful for fast upserts even in case of partitioned table and composite record key. We are working on adding support on query read side for composite key.

ad1happy2go added index metadata metadata table reader-core data-skipping priority:critical production down; pipelines stalled; Need help asap. labels Oct 24, 2024

ad1happy2go added this to Hudi Issue Support Oct 24, 2024

ad1happy2go moved this to 🏁 Triaged in Hudi Issue Support Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

RameshkumarChikoti123 commented Oct 23, 2024

ad1happy2go commented Oct 24, 2024

mzheng-plaid commented Oct 24, 2024

ad1happy2go commented Oct 30, 2024

mzheng-plaid commented Oct 30, 2024

ad1happy2go commented Nov 7, 2024

[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

Comments

RameshkumarChikoti123 commented Oct 23, 2024

ad1happy2go commented Oct 24, 2024

mzheng-plaid commented Oct 24, 2024

ad1happy2go commented Oct 30, 2024

mzheng-plaid commented Oct 30, 2024

ad1happy2go commented Nov 7, 2024