Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Slow upsert performance for Flink upsert #12046

Open
dheemanthgowda opened this issue Oct 3, 2024 · 3 comments
Open

[SUPPORT] Slow upsert performance for Flink upsert #12046

dheemanthgowda opened this issue Oct 3, 2024 · 3 comments
Labels

Comments

@dheemanthgowda
Copy link

dheemanthgowda commented Oct 3, 2024

Describe the problem you faced

We were experiencing slow upsert performance when using Hudi with Flink SQL on AWS S3. Tried enabling metadata table, which improved update speed, but the cleaner is not triggering even after 3 commits.

To Reproduce

Steps to reproduce the behavior:

Configure Hudi with the following settings for upserting data via Flink SQL:

'connector' = 'hudi'
'write.operation' = 'upsert'
'write.tasks' = '800'
'table.type' = 'MERGE_ON_READ'
'index.type' = 'BUCKET'
'hoodie.bucket.index.num.buckets' = '10'
'hoodie.index.bucket.engine' = 'SIMPLE'
'hoodie.clean.automatic' = 'true'
'hoodie.cleaner.parallelism' = '200'
'clean.policy' = 'KEEP_LATEST_COMMITS'
'clean.async.enabled' = 'true'
'hoodie.keep.max.commits' = '20'
'hoodie.keep.min.commits' = '6'
'clean.retain_commits' = '3'
'hoodie.datasource.write.hive_style_partitioning' = 'true'
'hoodie.parquet.compression.codec' = 'snappy'
'compaction.max_memory' = '30000'
'hoodie.write.set.null.for.missing.columns' = 'true'
'hoodie.archive.automatic' = 'false'
'hoodie.archive.async' = 'false'
'hoodie.schema.on.read.enable' = 'true'
'hoodie.fs.atomic_creation.support' = 's3a'
'compaction.async.enabled' = 'false'
'compaction.delta_commits' = '1'
'compaction.schedule.enabled' = 'true'
'compaction.trigger.strategy' = 'num_commits'
'hoodie.cleaner.incremental.mode' = 'false'
'hoodie.compaction.logfile.size.threshold' = '1'
'metadata.enabled' = 'false'
'hoodie.compaction.strategy' = 'org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy'

Run a batch job to perform upserts.
Monitor logs for cleaning operations.
Expected behavior

We expect the cleaner to trigger and remove older commits as per the defined configuration.

Environment Description

Hudi version: 1.14.1
Flink version: 1.17.1
Storage (HDFS/S3/GCS..): S3
Running on Docker? (yes/no): running on K8s
Additional context
After enabling metadata.enabled to true, we observed a notable improvement in upsert speed. However, the cleaner does not seem to be functioning as expected.
Are we missing any configs?

2024-09-15 14:48:38,399 WARN  org.apache.hudi.config.HoodieWriteConfig                     [] - Increase hoodie.keep.min.commits=6 to be greater than hoodie.cleaner.commits.retained=20 (there is risk of incremental pull missing data from few instants based on the current configuration). The Hudi archiver will automatically adjust the configuration regardless.
2024-09-15 14:48:38,909 INFO  org.apache.hudi.metadata.HoodieBackedTableMetadataWriter     [] - Latest deltacommit time found is 20240915143952010, running clean operations.
2024-09-15 14:48:39,153 INFO  org.apache.hudi.client.BaseHoodieWriteClient                 [] - Scheduling cleaning at instant time :20240915143952010002
2024-09-15 14:48:39,160 INFO  org.apache.hudi.table.action.clean.CleanPlanner              [] - No earliest commit to retain. No need to scan partitions !!
2024-09-15 14:48:39,160 INFO  org.apache.hudi.table.action.clean.CleanPlanActionExecutor   [] - Nothing to clean here. It is already clean
@ad1happy2go
Copy link
Collaborator

Thanks for raising @dheemanthgowda . Can you also update the subject please.

There is one more issue raised before which explains your issue also. - #11436

@danny0405 danny0405 changed the title [SUPPORT] [SUPPORT] Slow upsert performance for Flink upsert Oct 9, 2024
@danny0405
Copy link
Contributor

@dheemanthgowda Thanks for the feedback, it looks like your table does not have partitioning fields, then each compaction would triger a whole table rewrite which is indeed costly for streaming ingestion. Did you try to move the compaction out as a separate job.

@danny0405 danny0405 added the flink Issues related to flink label Oct 9, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 9, 2024
@ad1happy2go
Copy link
Collaborator

@dheemanthgowda Were you able to check on it more by using aync compaction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants