Skip to content

[Bug] Write paimon table with hive engine by MR got exeception #4537

Open
@FrommyMind

Description

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.9

Compute Engine

hive: 2.1-cdh-6.3-1

Minimal reproduce step

create paimon table in beeline

SET hive.metastore.warehouse.dir=/user/hive/warehouse;

CREATE TABLE hive_test_table(
    a INT COMMENT 'The a field',
    b STRING COMMENT 'The b field'
)
STORED BY 'org.apache.paimon.hive.PaimonStorageHandler';

after that, try insert data

insert into hive_test_table values (3, 'paimon');

What doesn't meet your expectations?

MapReduce failed with bellow error.

2024-11-17 22:11:46,830 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing operator TS[0]
2024-11-17 22:11:46,830 INFO [main] org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing operator SEL[1]
2024-11-17 22:11:47,196 INFO [main] org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct<tmp_values_col1:string,tmp_values_col2:string>
2024-11-17 22:11:47,203 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing operator FS[3]
2024-11-17 22:11:47,204 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2024-11-17 22:11:47,621 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: Using serializer : org.apache.paimon.hive.PaimonSerDe@6ede46f6 and formatter : org.apache.paimon.hive.mapred.PaimonOutputFormat@66273da0
2024-11-17 22:11:47,621 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.healthChecker.script.timeout is deprecated. Instead, use mapreduce.tasktracker.healthchecker.script.timeout
2024-11-17 22:11:47,634 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS hdfs://cdh01.daniel.com:8020/user/hive/warehouse/default.db/_tmp.hive_test_table/000000_0
2024-11-17 22:11:47,781 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[3]: records written - 1
2024-11-17 22:11:47,913 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.zstd]
2024-11-17 22:11:48,249 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[0]: records read - 1
2024-11-17 22:11:48,249 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[0]: Total records read - 1. abort - false
2024-11-17 22:11:48,249 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0, RECORDS_IN:1, 
2024-11-17 22:11:48,250 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[3]: records written - 1
2024-11-17 22:11:48,250 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: TOTAL_TABLE_ROWS_WRITTEN:1, RECORDS_OUT_1_default.hive_test_table:1, 
2024-11-17 22:11:48,272 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1731851152753_0005_m_000000_0 is done. And is in the process of committing
2024-11-17 22:11:48,276 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1731851152753_0005_m_000000_0 is allowed to commit now
2024-11-17 22:11:48,357 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.UnsatisfiedLinkError: com.github.luben.zstd.ZstdOutputStreamNoFinalizer.recommendedCOutSize()J
	at com.github.luben.zstd.ZstdOutputStreamNoFinalizer.recommendedCOutSize(Native Method)
	at com.github.luben.zstd.ZstdOutputStreamNoFinalizer.<clinit>(ZstdOutputStreamNoFinalizer.java:30)
	at com.github.luben.zstd.RecyclingBufferPool.<clinit>(RecyclingBufferPool.java:17)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.codec.ZstandardCodec.createOutputStream(ZstandardCodec.java:107)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.codec.ZstandardCodec.createOutputStream(ZstandardCodec.java:100)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.compress(CodecFactory.java:176)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:168)
	at org.apache.paimon.shade.org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:59)
	at org.apache.paimon.shade.org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:389)
	at org.apache.paimon.shade.org.apache.parquet.column.impl.ColumnWriteStoreBase.flush(ColumnWriteStoreBase.java:186)
	at org.apache.paimon.shade.org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:29)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:185)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:124)
	at org.apache.paimon.shade.org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:112)
	at org.apache.paimon.format.parquet.writer.ParquetBulkWriter.close(ParquetBulkWriter.java:52)
	at org.apache.paimon.io.SingleFileWriter.close(SingleFileWriter.java:170)
	at org.apache.paimon.io.RowDataFileWriter.close(RowDataFileWriter.java:104)
	at org.apache.paimon.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:131)
	at org.apache.paimon.io.RollingFileWriter.close(RollingFileWriter.java:168)
	at org.apache.paimon.append.AppendOnlyWriter$DirectSinkWriter.flush(AppendOnlyWriter.java:418)
	at org.apache.paimon.append.AppendOnlyWriter.flush(AppendOnlyWriter.java:219)
	at org.apache.paimon.append.AppendOnlyWriter.prepareCommit(AppendOnlyWriter.java:207)
	at org.apache.paimon.operation.AbstractFileStoreWrite.prepareCommit(AbstractFileStoreWrite.java:210)
	at org.apache.paimon.operation.MemoryFileStoreWrite.prepareCommit(MemoryFileStoreWrite.java:152)
	at org.apache.paimon.table.sink.TableWriteImpl.prepareCommit(TableWriteImpl.java:253)
	at org.apache.paimon.table.sink.TableWriteImpl.prepareCommit(TableWriteImpl.java:260)
	at org.apache.paimon.hive.mapred.PaimonOutputCommitter.commitTask(PaimonOutputCommitter.java:95)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:343)
	at org.apache.hadoop.mapred.Task.commit(Task.java:1341)
	at org.apache.hadoop.mapred.Task.done(Task.java:1185)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:351)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

2024-11-17 22:11:48,460 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2024-11-17 22:11:48,460 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2024-11-17 22:11:48,461 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

Anything else?

I checked if my cluster supports zstd compression:
Use the following command to run MR job

hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec -Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec wcin wcout-zst

The job ran successfully.

a part of that log.

2024-11-17 22:07:43,023 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2024-11-17 22:07:43,410 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 2
2024-11-17 22:07:43,410 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-11-17 22:07:43,421 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2024-11-17 22:07:43,565 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://cdh01.daniel.com:8020/user/hive/wcin/yum.log:0+0
2024-11-17 22:07:43,672 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 67108860(268435440)
2024-11-17 22:07:43,672 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 256
2024-11-17 22:07:43,672 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 214748368
2024-11-17 22:07:43,672 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 268435456
2024-11-17 22:07:43,672 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 67108860; length = 16777216
2024-11-17 22:07:43,679 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2024-11-17 22:07:43,696 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2024-11-17 22:07:43,715 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.zst]
2024-11-17 22:07:43,739 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1731851152753_0004_m_000000_0 is done. And is in the process of committing
2024-11-17 22:07:43,770 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1731851152753_0004_m_000000_0' done.
2024-11-17 22:07:43,778 INFO [main] org.apache.hadoop.mapred.Task: Final Counters for attempt_1731851152753_0004_m_000000_0: Counters: 29
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=220970
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=116
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=3
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=0
		Map output records=0
		Map output bytes=0
		Map output materialized bytes=90
		Input split bytes=116
		Combine input records=0
		Combine output records=0
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=41
		CPU time spent (ms)=430
		Physical memory (bytes) snapshot=482156544
		Virtual memory (bytes) snapshot=2589884416
		Total committed heap usage (bytes)=480247808
		Peak Map Physical memory (bytes)=482156544
		Peak Map Virtual memory (bytes)=2589884416
	File Input Format Counters 
		Bytes Read=0
2024-11-17 22:07:43,879 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2024-11-17 22:07:43,879 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2024-11-17 22:07:43,879 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions