Skip to content

Conversation

@abhishekd0907
Copy link

Issue

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present

if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

Fix

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:

public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
} 

So new change in ParquetMetadataCOnvertor should be like:

if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Test

Added a new UT for this scenario. All existing UTs should also pass.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks for the fix and adding tests!

@wgtmac
Copy link
Member

wgtmac commented May 6, 2024

Error:  Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error:      src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Error:          @@ -1277,9 +1277,8 @@
Error:           ····return·createParquetMetaData(dicEncoding,·dataEncoding,·true);
Error:           ··}
Error:           
Error:          -
Error:          -··private·static·ParquetMetadata·createParquetMetaData(Encoding·dicEncoding,·Encoding·dataEncoding,
Error:          -·······················································boolean·includeDicStats)·{
Error:          +··private·static·ParquetMetadata·createParquetMetaData(
Error:          +······Encoding·dicEncoding,·Encoding·dataEncoding,·boolean·includeDicStats)·{
Error:           ····MessageType·schema·=·parseMessageType("message·schema·{·optional·int32·col·(INT_32);·}");
Error:           ····org.apache.parquet.hadoop.metadata.FileMetaData·fileMetaData·=
Error:           ········new·org.apache.parquet.hadoop.metadata.FileMetaData(schema,·new·HashMap<String,·String>(),·null);
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]

Please make the CI happy.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are some test failures. Please fix them accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants