Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]Spark reads from all partitions when using composite keys with a record index. #12152

Open
RameshkumarChikoti123 opened this issue Oct 23, 2024 · 5 comments
Labels
data-skipping index metadata metadata table priority:critical production down; pipelines stalled; Need help asap. reader-core

Comments

@RameshkumarChikoti123
Copy link

Added two record keys(customer_id,name) and configured Record index as below

hudi_options = { 'hoodie.table.name': "hudi-table-with-rli-two-record-keys", 'hoodie.datasource.write.recordkey.field': "customer_id,name", 'hoodie.datasource.write.partitionpath.field': "state", 'hoodie.datasource.write.precombine.field': "created_at", 'hoodie.datasource.write.operation': "upsert", # Use upsert operation 'hoodie.index.type': "RECORD_INDEX", 'hoodie.metadata.enable': "true", 'hoodie.metadata.index.column.stats.enable': "true", 'hoodie.metadata.record.index.enable': "true" } df.write.format("hudi").options(**hudi_options).mode("append").save("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/")

Reading record with composite keys

spark.read.format("hudi") \ .option("hoodie.enable.data.skipping", "true") \ .option("hoodie.metadata.enable", "true") \ .option("hoodie.metadata.record.index.enable", "true") \ .option("hoodie.metadata.index.column.stats.enable", "true") \ .load("s3a://bucket/var/proj/hudipoc-proj/hudi-table-with-rli-two-record-key/") \ .createOrReplaceTempView("hudi_snapshot1") spark.sql("select * from hudi_snapshot1 where customer_id='04da8419-fb9e-47f1-a44f-3cf2199ad20a'and name='Customer_43680' ").show(truncate=False)

Observations:
Spark is reading from all the partition as show in attached image

image

@ad1happy2go
Copy link
Collaborator

Thanks for raising this @RameshkumarChikoti123 . I checked with master code too and i confirmed the issue is still there.

Created tracking jira to fix - https://issues.apache.org/jira/browse/HUDI-8432

@ad1happy2go ad1happy2go added index metadata metadata table reader-core data-skipping priority:critical production down; pipelines stalled; Need help asap. labels Oct 24, 2024
@ad1happy2go ad1happy2go moved this to 🏁 Triaged in Hudi Issue Support Oct 24, 2024
@mzheng-plaid
Copy link

@ad1happy2go I'm seeing the same behavior on Hudi 0.14.1 with a partitioned table with a single record key (hoodie.datasource.write.recordkey.field) - is that expected?

@ad1happy2go
Copy link
Collaborator

@mzheng-plaid Yes, this issue was always there with RLI as we didnt had this support.

@mzheng-plaid
Copy link

Sorry, so is there currently any benefit to using RLI if you have a partitioned table? Or any composite record key?

@ad1happy2go
Copy link
Collaborator

@mzheng-plaid RLI is still very useful for fast upserts even in case of partitioned table and composite record key. We are working on adding support on query read side for composite key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-skipping index metadata metadata table priority:critical production down; pipelines stalled; Need help asap. reader-core
Projects
Status: 🏁 Triaged
Development

No branches or pull requests

3 participants