Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3075: [C++] Merge parquet-cpp codebase into Arrow C++ codebase #2453

Merged
merged 319 commits into from
Sep 8, 2018

Conversation

wesm
Copy link
Member

@wesm wesm commented Aug 21, 2018

Merges the apache/parquet-cpp code tree into the apache/arrow C++ codebase for the community's combined development process.

xhochy and others added 30 commits May 9, 2016 11:55
Author: Uwe L. Korn <[email protected]>

Closes apache#100 from xhochy/parquet-607 and squashes the following commits:

88fbfb5 [Uwe L. Korn] PARQUET-607: Public writer header

Change-Id: Iee0998f05d5a9625e17c3e9ae4b94a07ac5376a4
Also fixes few warnings in release mode build

Author: Deepak Majeti <[email protected]>

Closes apache#102 from majetideepak/ReleaseBuildErrors and squashes the following commits:

b54f126 [Deepak Majeti] added api to read additional column metadata

Change-Id: Iddf193def450c092090a95758f27c1dcad87185a
The LZ4 algorithm is not used in the Parquet format. We will need to make dynamic calls to liblzo in a form that does not cause GPL complications when we do implement the LZO compression type.

Author: Wes McKinney <[email protected]>

Closes apache#103 from wesm/PARQUET-614 and squashes the following commits:

dd35249 [Wes McKinney] Remove unneeded LZ4-related code

Change-Id: I470a34b91930841b0cde20f5299ad2fd39827758
Author: Uwe L. Korn <[email protected]>

Closes apache#105 from xhochy/parquet-616 and squashes the following commits:

eb614c5 [Uwe L. Korn] PARQUET-616: WriteBatch should accept const arrays

Change-Id: I306278772f3a2f77d3a3e48941c615c13b3bbb99
Author: Uwe L. Korn <[email protected]>

Closes apache#95 from xhochy/parquet-600 and squashes the following commits:

87882fd [Uwe L. Korn] Use MaxBufferSize for size estimation
0fd2a34 [Uwe L. Korn] PARQUET-600: Add benchmarks for RLE-Level encoding

Change-Id: I9fa9bf6bb5f08ac760b5cf6c9f65b797c8bb7444
Author: Uwe L. Korn <[email protected]>

Closes apache#108 from xhochy/parquet-620 and squashes the following commits:

da122ad [Uwe L. Korn] Ensure metadata is written only once

Change-Id: I7653597fdf69c961545d6c978fdc1367267adee7
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files

Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
Author: Uwe L. Korn <[email protected]>

Closes apache#109 from xhochy/parquet-621 and squashes the following commits:

3850a0f [Uwe L. Korn] Add unit test
dc1552f [Uwe L. Korn] Add flag to indicate if decimalmetadata is set

Change-Id: I7cb412aa8165af4cae1b1aca7360a01129cfe456
Author: Uwe L. Korn <[email protected]>

Closes apache#113 from xhochy/parquet-598 and squashes the following commits:

a22d7b7 [Uwe L. Korn] Address review comments
a82159b [Uwe L. Korn] Move specialization of InitValues to separate header
e2b48ea [Uwe L. Korn] Move specialization of InitValues to separate header
c3a6790 [Uwe L. Korn] PARQUET-598: Test writing all primitive data types

Change-Id: I694f229427244a6be993587030623133caeb2483
Also fixes the RLE decoding benchmark which did not pass the size to the
buffer.

Author: Uwe L. Korn <[email protected]>

Closes apache#111 from xhochy/parquet-625 and squashes the following commits:

94d1e60 [Uwe L. Korn] Cast to signed int
4ccc149 [Uwe L. Korn] Formatting fixes
e117706 [Uwe L. Korn] Use int instead of uint32_t for the batch size
a4241ec [Uwe L. Korn] Remove obsolete TODOs
65091be [Uwe L. Korn] PARQUET-625: Improve RLE write performance

Change-Id: I8958473ed3392e1800d3144789ec60a394643306
Author: Uwe L. Korn <[email protected]>

Closes apache#117 from xhochy/parquet-629 and squashes the following commits:

8f8c9f1 [Uwe L. Korn] PARQUET-629: RowGroupSerializer should only close itself once

Change-Id: If37bb0ba4ba07bffca614afd41341316e7a04f0a
Yet only exposing the property publicly so I can make use of it in Arrow.

Also used the interface design from parquet-mr to make the `WriterProperties` immutable so we can safely pass references of them around once they have been passed to a `ParquetFileWriter`.

Author: Uwe L. Korn <[email protected]>

Closes apache#119 from xhochy/parquet-633 and squashes the following commits:

d86436a [Uwe L. Korn] Make Writerproperties references const
2803aca [Uwe L. Korn] PARQUET-633: Add version to WriterProperties

Change-Id: I4634b846899993e9b126887cb32d95b572ac9153
Link all (static) third-party dependencies privately to not expose their API.

Also combine all static libs containing public interfaces into one. To correctly link against parquet_static from another lib, you still need all thirdparty static libs but at least the main parquet static library is only a single one. Will address the scenario "single static lib with all dependencies included" later once we have visibility macros.

Author: Uwe L. Korn <[email protected]>

Closes apache#120 from xhochy/parquet-634 and squashes the following commits:

7059b74 [Uwe L. Korn] PARQUET-634: Consistent private linking of dependencies

Change-Id: I7d32dc4b129e15109fe5e7003b7401c485c1b8a8
Hello,

I'm working on compressed writes and would like some advice:

* What would be the preferred way to pass down per-column codec settings? `Builder::set_column_codec(column index (?), Compression::type)`?
* It appears that there's some duplication in `ParquetFileWriter::Open`, namely, there is an allocator included into `WriterProperties` — I presume this should be cleaned up via a separate PR?

Author: Artem Tarasov <[email protected]>

Closes apache#121 from lomereiter/parquet-592 and squashes the following commits:

03d4ed8 [Artem Tarasov] address comments from reviewers
a1fe78e [Artem Tarasov] remove redundant 'allocator' constructor parameter
d2c0473 [Artem Tarasov] allow to reuse builder object
dfbee1f [Artem Tarasov] props -> properties
e687e81 [Artem Tarasov] serialization test cleanup
470e176 [Artem Tarasov] more convenient Builder methods
5d236df [Artem Tarasov] per-column compression support
a80c2f2 [Artem Tarasov] don't use shared_ptr where raw is enough;
7c745d9 [Artem Tarasov] run three separate tests
25a23e5 [Artem Tarasov] loop over codec types in file serialization test
4a0a044 [Artem Tarasov] PARQUET-592: Support compressed writes

Change-Id: I0ff33617ddcf833e4a06e7a67177879adceca39a
Yet we still only support PLAIN encoding. Will implement the other encodings in separate PRs to not have huge changesets.

Author: Uwe L. Korn <[email protected]>

Closes apache#122 from xhochy/parquet-636 and squashes the following commits:

b98a575 [Uwe L. Korn] Lint fixes
37f1b7b [Uwe L. Korn] Add comment to describe the column default/specific vars
7ef8b12 [Uwe L. Korn] PARQUET-636: Expose selection for different encodings

Change-Id: If021893dc679f92af5b2a19ebd43c5bf71b12f36
…eReader::NextPage

Small change but still some visible impact.

Author: Uwe L. Korn <[email protected]>

Closes apache#126 from xhochy/parquet-641 and squashes the following commits:

9902b0e [Uwe L. Korn] PARQUET-641: Instantiate stringstream only if needed in SerializedPageReader::NextPage

Change-Id: I6aee2cd288ca4aa00fc19fc2cadd8821f60fdc37
I added a test so that DCHECK does not leak in the public headers. I prefer this to renaming the macro

Author: Wes McKinney <[email protected]>

Closes apache#127 from wesm/no-export-dcheck and squashes the following commits:

52a2d22 [Wes McKinney] Remove exposure of DCHECK macros from publicly-visible headers

Change-Id: Ib79d09fe31de928fe55ffa1f04d34a84092f5494
Author: Uwe L. Korn <[email protected]>

Closes apache#128 from xhochy/parquet-643 and squashes the following commits:

6f18fca [Uwe L. Korn] PARQUET-643: Add const modifier to schema pointer reference in ParquetFileWriter

Change-Id: I1e019444d6154785e062c195d9160e4dd8799a5a
…gcc easier

Also fixed some clang compiler warnings. See additions to `README.md` for more.

Author: Wes McKinney <[email protected]>

Closes apache#130 from wesm/PARQUET-646 and squashes the following commits:

78eb21d [Wes McKinney] Add missing CompilerInfo.cmake
f774fef [Wes McKinney] Add CMAKE_CLANG_OPTIONS to cmake file, README instructions, fix clang compiler warnings

Change-Id: Ic862896ee527039bf10ec464ef80dce9c73eb27e
This was a bit of rabbit hole, because on Linux you must instruct gcc how to hide symbols from your static thirdparty libraries (see `src/parquet/symbols.map`). I learned about this from Apache Kudu (incubating) client cmake files.

Probably still a bit more to do here (and I may have missed some things we should export, and exporting things that should not be exported); I will make sure the Arrow-Parquet tests still pass, but wanted to get some early feedback.

Author: Wes McKinney <[email protected]>

Closes apache#131 from wesm/PARQUET-489 and squashes the following commits:

2cd834a [Wes McKinney] Link static libs to all executables
ed47d1e [Wes McKinney] Hide more symbols when statically-linking libstdc++
d052387 [Wes McKinney] Correctly export default_allocator
9f17ea3 [Wes McKinney] Install visibility.h
9d97985 [Wes McKinney] - Adapt symbol.map from Kudu to hide thirdparty static deps - Link tests to static lib - Add visibility header - First cut exports

Change-Id: I83de28c7483d2e3c4563ba9f6feb3af39e771744
…se builds

Author: Deepak Majeti <[email protected]>

Closes apache#132 from majetideepak/PARQUET-551 and squashes the following commits:

3b1212a [Deepak Majeti] addressed comments
6fdd3c4 [Deepak Majeti] PARQUET-551:Handle compiler warnings due to disabled DCHECKs in release builds

Change-Id: Ieb7d6c08c134c5361038b60812896d6a1b11ab4f
This can cause conflicts with other libraries that also use this common macro (from Google's gutil library originally)

Author: Wes McKinney <[email protected]>

Closes apache#135 from wesm/PARQUET-657 and squashes the following commits:

bc4f09b [Wes McKinney] Do not define macro if has been defined elsewhere

Change-Id: Iffcb753b4a4ed129af80826d126de6c23e533728
Author: Korn, Uwe <[email protected]>

Closes apache#136 from xhochy/parquet-658 and squashes the following commits:

1e9ca77 [Korn, Uwe] PARQUET-658: Add virtual destructor to ColumnReader

Change-Id: I107371209072e611a755bd3a556b94ed96eff97d
…lasses

While we've already force-instantiated these template classes, they were not visible across the board. gcc on Linux has no problem with it, but found that the Arrow OS X builds were core dumping (with a misleading error message) because of it. The test suites pass locally for me now.

Author: Wes McKinney <[email protected]>

Closes apache#137 from wesm/PARQUET-659 and squashes the following commits:

74907dc [Wes McKinney] Export extern templates for typed column reader/writer classes

Change-Id: I28598e3eff6239a4dc83465486afdfd6725ffdbc
…xport

This was causing downstream failure in ARROW-237. Now thirdparty libraries can safely throw `ParquetException` and be properly recognized on both ends.

Author: Wes McKinney <[email protected]>

Closes apache#139 from wesm/PARQUET-662 and squashes the following commits:

c1458d2 [Wes McKinney] Move ParquetException impl to object code, explicitly export. Add LINKAGE option to ADD_PARQUET_TEST, test throwing ParquetException via shared lib. Build shared library in gcc build.

Change-Id: I27160eb88f1b0c651323f30e4b709348112763e9
Testing on my own data shows an order-of-magnitude improvement.

I separated the commits for clarity, each one gives an imcremental improvement.

The motivation for the last commit (allowing NULL for def_levels/rep_level) is a workaround for Spark  which doesn't seem to be able to generate columns without def_level, even when a column is specified as "not nullable".

Author: Eric Daniel <[email protected]>

Closes apache#140 from edani/decode-perf and squashes the following commits:

eec0855 [Eric Daniel] Ran "make format"
0568de6 [Eric Daniel] Only check num. of repetition levels when def_levels is set
5f54e1c [Eric Daniel] Added benchmarks for dictionary decoding
087945b [Eric Daniel] Style fixes from code review
906be73 [Eric Daniel] Allow the reader to skip rep/def decoding
04b7391 [Eric Daniel] Fast bit unpacking
bda5d84 [Eric Daniel] The bit reader can decode in batches
3f10378 [Eric Daniel] Improve decoding of repeated values in the dict encoding

Change-Id: I45421fc2ada5d06863ddd765470f79b45ec4991a
Working version but still some minor stuff to clean up (but I'd be happy for comments).

Had to change the Encoder interface to make it work with dictionary encoding. Also dictionary encoded RowGroups are currently limited to a single DataPage.

Author: Uwe L. Korn <[email protected]>

Closes apache#142 from xhochy/parquet-666 and squashes the following commits:

95e4753 [Uwe L. Korn] Fix PARQUET-591 and get make more use of WriterProperties
8f5b2bb [Uwe L. Korn] Comment out unused private variable
32adfd9 [Uwe L. Korn] Remove compression code duplication
6804e2a [Uwe L. Korn] Remove deprecated TODO
0d58ef2 [Uwe L. Korn] Pass DictionaryPage objects to the page writer
95fdccd [Uwe L. Korn] Get rid of temporarly created struct
cce17f9 [Uwe L. Korn] Store serialised pages
2841494 [Uwe L. Korn] Remove unsigned/signed comparison
7ce98ca [Uwe L. Korn] Store pages in RAM until the end of the column
17533bf [Uwe L. Korn] Remove code duplication
a3b4043 [Uwe L. Korn] PARQUET-666: Add support for writing dictionaries

Change-Id: I72a4dfcd9f17159d338ed78f9bf20ff6106dcbf2
Author: Uwe L. Korn <[email protected]>

Closes apache#146 from xhochy/parquet-694 and squashes the following commits:

a21638b [Uwe L. Korn] PARQUET-694: Revert default data page size back to 1M

Change-Id: I288ff6f8a630f607abc2792c0698ab6286c11136
This patch adds API to read and write metadata as well as improves the writer properties class.
I am planning to add some comments to the code next. Meanwhile, any feedback will be helpful.

Author: Deepak Majeti <[email protected]>
Author: Uwe L. Korn <[email protected]>

Closes apache#143 from majetideepak/metadata and squashes the following commits:

2b8a546 [Deepak Majeti] comments and more testing for metadata api
59147c0 [Deepak Majeti] fix memory leak
34e8975 [Deepak Majeti] review comments and format
a977c6a [Deepak Majeti] added comment for file path
d4f0e82 [Deepak Majeti] friendship between reader and writer. implements PARQUET-692
1047507 [Uwe L. Korn] Better dictionary encoding user experience
7f37f85 [Deepak Majeti] review edits
9dab591 [Deepak Majeti] minor rename
a6b0646 [Deepak Majeti] added more dictionary fallback and enabled options to writer properties
3b9bad3 [Deepak Majeti] Metadata Reader writer

Change-Id: Id73817e20f4beb4810294067d02bb8acb9b5f033
…quet-format

Support logical types TIME_MICROS and TIMESTAMP_MICROS

Also I think the current code was incorrect. Parquet reserved the LogicalTypes 8 and 10, but those were completely omitted types.h. So types with greater indices were mapped incorrectly.

Author: Florian Scheibner <[email protected]>

Closes apache#147 from flode/master and squashes the following commits:

6f81adc [Florian Scheibner] PARQUET-699: Update parquet.thrift from https://github.com/apache/parquet-format

Change-Id: I04bb0f5fe6b1da5e4e8c5d61b8117e179cf10790
@wesm wesm force-pushed the parquet-cpp-merge branch 3 times, most recently from c3d8f60 to 354e05a Compare September 5, 2018 00:56
@wesm
Copy link
Member Author

wesm commented Sep 5, 2018

I have access to a Windows machine finally today so I should be able to get the build passing...

@wesm
Copy link
Member Author

wesm commented Sep 5, 2018

To explain what's going on for people watching -- I ran into an issue where we need to use __declspec(dllimport) when linking Arrow into Parquet so that certain symbols are visible there (e.g. static members, like stuff in arrow::CpuInfo). We weren't using dllimport in Parquet (which could result in linker issues for people using parquet.dll) so I went ahead and added that here

@wesm
Copy link
Member Author

wesm commented Sep 5, 2018

Oof. I'll resolve the rebase conflict with pyarrow/parquet.py right before we merge this to make things less painful for me

@wesm
Copy link
Member Author

wesm commented Sep 5, 2018

I'm stuck on this linker issue:

   Creating library release\parquet-column_scanner-test.lib and object release\parquet-column_scanner-test.exp
column_scanner-test.cc.obj : error LNK2019: unresolved external symbol "private: static bool arrow::CpuInfo::initialized_" (?initialized_@CpuInfo@arrow@@0_NA) referenced in function "public: __cdecl parquet::DictEncoder<struct parquet::DataType<1> >::DictEncoder<struct parquet::DataType<1> >(class parquet::ColumnDescriptor const *,class parquet::ChunkedAllocator *,class arrow::MemoryPool *)" (??0?$DictEncoder@U?$DataType@$00@parquet@@@parquet@@QEAA@PEBVColumnDescriptor@1@PEAVChunkedAllocator@1@PEAVMemoryPool@arrow@@@Z)
parquet_static.lib(column_writer.cc.obj) : error LNK2001: unresolved external symbol "private: static bool arrow::CpuInfo::initialized_" (?initialized_@CpuInfo@arrow@@0_NA)
release\parquet-column_scanner-test.exe : fatal error LNK1120: 1 unresolved externals
[337/364] Building CXX object src\parquet\CMakeFiles\parquet-column_writer-test.dir\column_writer-test.cc.obj
ninja: build stopped: subcommand failed.
(arrow) C:\projects\arrow\cpp\build>set lastexitcode=1 

I haven't been able to figure out what is different about the linker configuration yet that is causing this to fail.

@wesm
Copy link
Member Author

wesm commented Sep 6, 2018

I plan to have this rebased and ready to merge today

@wesm
Copy link
Member Author

wesm commented Sep 6, 2018

OK, I have fixed the MSVC issues. I'm going to rebase and see if we get a clean build

@wesm wesm force-pushed the parquet-cpp-merge branch 3 times, most recently from 23093d7 to 31d5a8a Compare September 6, 2018 22:57
wesm referenced this pull request in wesm/arrow Sep 6, 2018
This also changes the compiler options to use only SSE3. We can explore benchmarking to optionally enable SSE4 on supported systems for performance gains in the future.

@nongli would it be OK if I remove the unused delta encodings (they depend on code that is not in Impala anymore, or never way)? Are these encodings used extensively, or primarily PLAIN / DICTIONARY encoding?

Author: Wes McKinney <[email protected]>

Closes apache#31 from wesm/PARQUET-438 and squashes the following commits:

df9da64 [Wes McKinney] Add back (commented-out) ZigZig encode/decode impls
65c49bc [Wes McKinney] Fix comment
73c1dc1 [Wes McKinney] Fix cpplint errors post PARQUET-496
553ca0e [Wes McKinney] Fix cpplint errors
9528c83 [Wes McKinney] Updates RLE encoding routines and dependencies from upstream changes in Apache Impala (incubating). Add rle-test and bit-util-test modules unchanged from Impala.

Change-Id: I6dd1ba027b316de945c56f70b390cf44dcf09f63
@wesm wesm force-pushed the parquet-cpp-merge branch from 31d5a8a to ce2746e Compare September 6, 2018 23:48
@wesm
Copy link
Member Author

wesm commented Sep 7, 2018

pkgconfig for libparquet is currently broken so there are some issues with the Windows wheel build still. I will try to fix the pkgconfig later so we don't have to hack around it in the meantime.

@wesm
Copy link
Member Author

wesm commented Sep 7, 2018

The build is passing except for the Meson issue in the GLib build ARROW-3186. Please give me a thumbs up so I can rebase and merge this cleanly to avoid any extra pain, thank you =)

@kou
Copy link
Member

kou commented Sep 7, 2018

I think that the GLib build with Meson failure isn't a Meson problem.
Maybe, the Python environment that runs /home/travis/miniconda/bin/meson doesn't include Meson by pip install meson https://travis-ci.org/apache/arrow/jobs/425563611#L1493 .

The following debug print may help us:

which meson
head $(which meson)
$(head -n1 $(which meson) | sed -e 's/^#! *//') -c 'import os; import sys; import pprint; pprint.pprint([list(os.walk(path)) for path in sys.path])'

@wesm
Copy link
Member Author

wesm commented Sep 7, 2018

@kou I pinned meson to the prior version 0.47.1 in #2507 and it fixed the build. I think it's a problem with meson version 0.47.2 released on August 25

system. Add unit test label granularity options, ability to add component group
targets like 'make parquet' that build libraries and tests

Change-Id: I250fd10da3a9505952115b3cec18cd7cb5589bdb
@wesm wesm force-pushed the parquet-cpp-merge branch from 1b81fed to fe5d435 Compare September 7, 2018 05:58
@kou
Copy link
Member

kou commented Sep 7, 2018

Umm. It's strange... Because https://travis-ci.org/apache/arrow/jobs/425583565#L3060 uses Meson 0.47.2 and there are no error.

Anyway, it's OK to me that we use 0.47.1 for now.

@wesm
Copy link
Member Author

wesm commented Sep 7, 2018

Just rebased this again. In theory should have a fully green build now

@wesm
Copy link
Member Author

wesm commented Sep 7, 2018

Just in case anyone gets an itchy trigger finger

  • do NOT use git rebase with this branch -- I have been taking great care to preserve the $COMMITTER for each historical patch

  • do NOT use the merge script

This should be merged to apache master with git merge --ff-only

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

also verified that the build works locally.

@wesm
Copy link
Member Author

wesm commented Sep 8, 2018

Thanks. I'm just rebasing now; if you could hold off on merging any patches to master until I can FF-merge this that would be great, should be in 20-30 minutes

@wesm wesm force-pushed the parquet-cpp-merge branch from fe5d435 to 9b4cd9c Compare September 8, 2018 13:56
@wesm
Copy link
Member Author

wesm commented Sep 8, 2018

Build is clean (see my appveyor builds: https://ci.appveyor.com/project/wesm/arrow/build/1.0.2049), merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.