-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3075: [C++] Merge parquet-cpp codebase into Arrow C++ codebase #2453
Conversation
Author: Uwe L. Korn <[email protected]> Closes apache#100 from xhochy/parquet-607 and squashes the following commits: 88fbfb5 [Uwe L. Korn] PARQUET-607: Public writer header Change-Id: Iee0998f05d5a9625e17c3e9ae4b94a07ac5376a4
Also fixes few warnings in release mode build Author: Deepak Majeti <[email protected]> Closes apache#102 from majetideepak/ReleaseBuildErrors and squashes the following commits: b54f126 [Deepak Majeti] added api to read additional column metadata Change-Id: Iddf193def450c092090a95758f27c1dcad87185a
The LZ4 algorithm is not used in the Parquet format. We will need to make dynamic calls to liblzo in a form that does not cause GPL complications when we do implement the LZO compression type. Author: Wes McKinney <[email protected]> Closes apache#103 from wesm/PARQUET-614 and squashes the following commits: dd35249 [Wes McKinney] Remove unneeded LZ4-related code Change-Id: I470a34b91930841b0cde20f5299ad2fd39827758
Author: Uwe L. Korn <[email protected]> Closes apache#105 from xhochy/parquet-616 and squashes the following commits: eb614c5 [Uwe L. Korn] PARQUET-616: WriteBatch should accept const arrays Change-Id: I306278772f3a2f77d3a3e48941c615c13b3bbb99
Author: Uwe L. Korn <[email protected]> Closes apache#95 from xhochy/parquet-600 and squashes the following commits: 87882fd [Uwe L. Korn] Use MaxBufferSize for size estimation 0fd2a34 [Uwe L. Korn] PARQUET-600: Add benchmarks for RLE-Level encoding Change-Id: I9fa9bf6bb5f08ac760b5cf6c9f65b797c8bb7444
Author: Uwe L. Korn <[email protected]> Closes apache#108 from xhochy/parquet-620 and squashes the following commits: da122ad [Uwe L. Korn] Ensure metadata is written only once Change-Id: I7653597fdf69c961545d6c978fdc1367267adee7
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
Author: Uwe L. Korn <[email protected]> Closes apache#109 from xhochy/parquet-621 and squashes the following commits: 3850a0f [Uwe L. Korn] Add unit test dc1552f [Uwe L. Korn] Add flag to indicate if decimalmetadata is set Change-Id: I7cb412aa8165af4cae1b1aca7360a01129cfe456
Author: Uwe L. Korn <[email protected]> Closes apache#113 from xhochy/parquet-598 and squashes the following commits: a22d7b7 [Uwe L. Korn] Address review comments a82159b [Uwe L. Korn] Move specialization of InitValues to separate header e2b48ea [Uwe L. Korn] Move specialization of InitValues to separate header c3a6790 [Uwe L. Korn] PARQUET-598: Test writing all primitive data types Change-Id: I694f229427244a6be993587030623133caeb2483
Also fixes the RLE decoding benchmark which did not pass the size to the buffer. Author: Uwe L. Korn <[email protected]> Closes apache#111 from xhochy/parquet-625 and squashes the following commits: 94d1e60 [Uwe L. Korn] Cast to signed int 4ccc149 [Uwe L. Korn] Formatting fixes e117706 [Uwe L. Korn] Use int instead of uint32_t for the batch size a4241ec [Uwe L. Korn] Remove obsolete TODOs 65091be [Uwe L. Korn] PARQUET-625: Improve RLE write performance Change-Id: I8958473ed3392e1800d3144789ec60a394643306
Author: Uwe L. Korn <[email protected]> Closes apache#117 from xhochy/parquet-629 and squashes the following commits: 8f8c9f1 [Uwe L. Korn] PARQUET-629: RowGroupSerializer should only close itself once Change-Id: If37bb0ba4ba07bffca614afd41341316e7a04f0a
Yet only exposing the property publicly so I can make use of it in Arrow. Also used the interface design from parquet-mr to make the `WriterProperties` immutable so we can safely pass references of them around once they have been passed to a `ParquetFileWriter`. Author: Uwe L. Korn <[email protected]> Closes apache#119 from xhochy/parquet-633 and squashes the following commits: d86436a [Uwe L. Korn] Make Writerproperties references const 2803aca [Uwe L. Korn] PARQUET-633: Add version to WriterProperties Change-Id: I4634b846899993e9b126887cb32d95b572ac9153
Link all (static) third-party dependencies privately to not expose their API. Also combine all static libs containing public interfaces into one. To correctly link against parquet_static from another lib, you still need all thirdparty static libs but at least the main parquet static library is only a single one. Will address the scenario "single static lib with all dependencies included" later once we have visibility macros. Author: Uwe L. Korn <[email protected]> Closes apache#120 from xhochy/parquet-634 and squashes the following commits: 7059b74 [Uwe L. Korn] PARQUET-634: Consistent private linking of dependencies Change-Id: I7d32dc4b129e15109fe5e7003b7401c485c1b8a8
Hello, I'm working on compressed writes and would like some advice: * What would be the preferred way to pass down per-column codec settings? `Builder::set_column_codec(column index (?), Compression::type)`? * It appears that there's some duplication in `ParquetFileWriter::Open`, namely, there is an allocator included into `WriterProperties` — I presume this should be cleaned up via a separate PR? Author: Artem Tarasov <[email protected]> Closes apache#121 from lomereiter/parquet-592 and squashes the following commits: 03d4ed8 [Artem Tarasov] address comments from reviewers a1fe78e [Artem Tarasov] remove redundant 'allocator' constructor parameter d2c0473 [Artem Tarasov] allow to reuse builder object dfbee1f [Artem Tarasov] props -> properties e687e81 [Artem Tarasov] serialization test cleanup 470e176 [Artem Tarasov] more convenient Builder methods 5d236df [Artem Tarasov] per-column compression support a80c2f2 [Artem Tarasov] don't use shared_ptr where raw is enough; 7c745d9 [Artem Tarasov] run three separate tests 25a23e5 [Artem Tarasov] loop over codec types in file serialization test 4a0a044 [Artem Tarasov] PARQUET-592: Support compressed writes Change-Id: I0ff33617ddcf833e4a06e7a67177879adceca39a
Yet we still only support PLAIN encoding. Will implement the other encodings in separate PRs to not have huge changesets. Author: Uwe L. Korn <[email protected]> Closes apache#122 from xhochy/parquet-636 and squashes the following commits: b98a575 [Uwe L. Korn] Lint fixes 37f1b7b [Uwe L. Korn] Add comment to describe the column default/specific vars 7ef8b12 [Uwe L. Korn] PARQUET-636: Expose selection for different encodings Change-Id: If021893dc679f92af5b2a19ebd43c5bf71b12f36
…eReader::NextPage Small change but still some visible impact. Author: Uwe L. Korn <[email protected]> Closes apache#126 from xhochy/parquet-641 and squashes the following commits: 9902b0e [Uwe L. Korn] PARQUET-641: Instantiate stringstream only if needed in SerializedPageReader::NextPage Change-Id: I6aee2cd288ca4aa00fc19fc2cadd8821f60fdc37
I added a test so that DCHECK does not leak in the public headers. I prefer this to renaming the macro Author: Wes McKinney <[email protected]> Closes apache#127 from wesm/no-export-dcheck and squashes the following commits: 52a2d22 [Wes McKinney] Remove exposure of DCHECK macros from publicly-visible headers Change-Id: Ib79d09fe31de928fe55ffa1f04d34a84092f5494
Author: Uwe L. Korn <[email protected]> Closes apache#128 from xhochy/parquet-643 and squashes the following commits: 6f18fca [Uwe L. Korn] PARQUET-643: Add const modifier to schema pointer reference in ParquetFileWriter Change-Id: I1e019444d6154785e062c195d9160e4dd8799a5a
…gcc easier Also fixed some clang compiler warnings. See additions to `README.md` for more. Author: Wes McKinney <[email protected]> Closes apache#130 from wesm/PARQUET-646 and squashes the following commits: 78eb21d [Wes McKinney] Add missing CompilerInfo.cmake f774fef [Wes McKinney] Add CMAKE_CLANG_OPTIONS to cmake file, README instructions, fix clang compiler warnings Change-Id: Ic862896ee527039bf10ec464ef80dce9c73eb27e
This was a bit of rabbit hole, because on Linux you must instruct gcc how to hide symbols from your static thirdparty libraries (see `src/parquet/symbols.map`). I learned about this from Apache Kudu (incubating) client cmake files. Probably still a bit more to do here (and I may have missed some things we should export, and exporting things that should not be exported); I will make sure the Arrow-Parquet tests still pass, but wanted to get some early feedback. Author: Wes McKinney <[email protected]> Closes apache#131 from wesm/PARQUET-489 and squashes the following commits: 2cd834a [Wes McKinney] Link static libs to all executables ed47d1e [Wes McKinney] Hide more symbols when statically-linking libstdc++ d052387 [Wes McKinney] Correctly export default_allocator 9f17ea3 [Wes McKinney] Install visibility.h 9d97985 [Wes McKinney] - Adapt symbol.map from Kudu to hide thirdparty static deps - Link tests to static lib - Add visibility header - First cut exports Change-Id: I83de28c7483d2e3c4563ba9f6feb3af39e771744
…se builds Author: Deepak Majeti <[email protected]> Closes apache#132 from majetideepak/PARQUET-551 and squashes the following commits: 3b1212a [Deepak Majeti] addressed comments 6fdd3c4 [Deepak Majeti] PARQUET-551:Handle compiler warnings due to disabled DCHECKs in release builds Change-Id: Ieb7d6c08c134c5361038b60812896d6a1b11ab4f
This can cause conflicts with other libraries that also use this common macro (from Google's gutil library originally) Author: Wes McKinney <[email protected]> Closes apache#135 from wesm/PARQUET-657 and squashes the following commits: bc4f09b [Wes McKinney] Do not define macro if has been defined elsewhere Change-Id: Iffcb753b4a4ed129af80826d126de6c23e533728
Author: Korn, Uwe <[email protected]> Closes apache#136 from xhochy/parquet-658 and squashes the following commits: 1e9ca77 [Korn, Uwe] PARQUET-658: Add virtual destructor to ColumnReader Change-Id: I107371209072e611a755bd3a556b94ed96eff97d
…lasses While we've already force-instantiated these template classes, they were not visible across the board. gcc on Linux has no problem with it, but found that the Arrow OS X builds were core dumping (with a misleading error message) because of it. The test suites pass locally for me now. Author: Wes McKinney <[email protected]> Closes apache#137 from wesm/PARQUET-659 and squashes the following commits: 74907dc [Wes McKinney] Export extern templates for typed column reader/writer classes Change-Id: I28598e3eff6239a4dc83465486afdfd6725ffdbc
…xport This was causing downstream failure in ARROW-237. Now thirdparty libraries can safely throw `ParquetException` and be properly recognized on both ends. Author: Wes McKinney <[email protected]> Closes apache#139 from wesm/PARQUET-662 and squashes the following commits: c1458d2 [Wes McKinney] Move ParquetException impl to object code, explicitly export. Add LINKAGE option to ADD_PARQUET_TEST, test throwing ParquetException via shared lib. Build shared library in gcc build. Change-Id: I27160eb88f1b0c651323f30e4b709348112763e9
Testing on my own data shows an order-of-magnitude improvement. I separated the commits for clarity, each one gives an imcremental improvement. The motivation for the last commit (allowing NULL for def_levels/rep_level) is a workaround for Spark which doesn't seem to be able to generate columns without def_level, even when a column is specified as "not nullable". Author: Eric Daniel <[email protected]> Closes apache#140 from edani/decode-perf and squashes the following commits: eec0855 [Eric Daniel] Ran "make format" 0568de6 [Eric Daniel] Only check num. of repetition levels when def_levels is set 5f54e1c [Eric Daniel] Added benchmarks for dictionary decoding 087945b [Eric Daniel] Style fixes from code review 906be73 [Eric Daniel] Allow the reader to skip rep/def decoding 04b7391 [Eric Daniel] Fast bit unpacking bda5d84 [Eric Daniel] The bit reader can decode in batches 3f10378 [Eric Daniel] Improve decoding of repeated values in the dict encoding Change-Id: I45421fc2ada5d06863ddd765470f79b45ec4991a
Working version but still some minor stuff to clean up (but I'd be happy for comments). Had to change the Encoder interface to make it work with dictionary encoding. Also dictionary encoded RowGroups are currently limited to a single DataPage. Author: Uwe L. Korn <[email protected]> Closes apache#142 from xhochy/parquet-666 and squashes the following commits: 95e4753 [Uwe L. Korn] Fix PARQUET-591 and get make more use of WriterProperties 8f5b2bb [Uwe L. Korn] Comment out unused private variable 32adfd9 [Uwe L. Korn] Remove compression code duplication 6804e2a [Uwe L. Korn] Remove deprecated TODO 0d58ef2 [Uwe L. Korn] Pass DictionaryPage objects to the page writer 95fdccd [Uwe L. Korn] Get rid of temporarly created struct cce17f9 [Uwe L. Korn] Store serialised pages 2841494 [Uwe L. Korn] Remove unsigned/signed comparison 7ce98ca [Uwe L. Korn] Store pages in RAM until the end of the column 17533bf [Uwe L. Korn] Remove code duplication a3b4043 [Uwe L. Korn] PARQUET-666: Add support for writing dictionaries Change-Id: I72a4dfcd9f17159d338ed78f9bf20ff6106dcbf2
Author: Uwe L. Korn <[email protected]> Closes apache#146 from xhochy/parquet-694 and squashes the following commits: a21638b [Uwe L. Korn] PARQUET-694: Revert default data page size back to 1M Change-Id: I288ff6f8a630f607abc2792c0698ab6286c11136
This patch adds API to read and write metadata as well as improves the writer properties class. I am planning to add some comments to the code next. Meanwhile, any feedback will be helpful. Author: Deepak Majeti <[email protected]> Author: Uwe L. Korn <[email protected]> Closes apache#143 from majetideepak/metadata and squashes the following commits: 2b8a546 [Deepak Majeti] comments and more testing for metadata api 59147c0 [Deepak Majeti] fix memory leak 34e8975 [Deepak Majeti] review comments and format a977c6a [Deepak Majeti] added comment for file path d4f0e82 [Deepak Majeti] friendship between reader and writer. implements PARQUET-692 1047507 [Uwe L. Korn] Better dictionary encoding user experience 7f37f85 [Deepak Majeti] review edits 9dab591 [Deepak Majeti] minor rename a6b0646 [Deepak Majeti] added more dictionary fallback and enabled options to writer properties 3b9bad3 [Deepak Majeti] Metadata Reader writer Change-Id: Id73817e20f4beb4810294067d02bb8acb9b5f033
…quet-format Support logical types TIME_MICROS and TIMESTAMP_MICROS Also I think the current code was incorrect. Parquet reserved the LogicalTypes 8 and 10, but those were completely omitted types.h. So types with greater indices were mapped incorrectly. Author: Florian Scheibner <[email protected]> Closes apache#147 from flode/master and squashes the following commits: 6f81adc [Florian Scheibner] PARQUET-699: Update parquet.thrift from https://github.com/apache/parquet-format Change-Id: I04bb0f5fe6b1da5e4e8c5d61b8117e179cf10790
c3d8f60
to
354e05a
Compare
I have access to a Windows machine finally today so I should be able to get the build passing... |
To explain what's going on for people watching -- I ran into an issue where we need to use |
Oof. I'll resolve the rebase conflict with pyarrow/parquet.py right before we merge this to make things less painful for me |
I'm stuck on this linker issue:
I haven't been able to figure out what is different about the linker configuration yet that is causing this to fail. |
I plan to have this rebased and ready to merge today |
OK, I have fixed the MSVC issues. I'm going to rebase and see if we get a clean build |
23093d7
to
31d5a8a
Compare
This also changes the compiler options to use only SSE3. We can explore benchmarking to optionally enable SSE4 on supported systems for performance gains in the future. @nongli would it be OK if I remove the unused delta encodings (they depend on code that is not in Impala anymore, or never way)? Are these encodings used extensively, or primarily PLAIN / DICTIONARY encoding? Author: Wes McKinney <[email protected]> Closes apache#31 from wesm/PARQUET-438 and squashes the following commits: df9da64 [Wes McKinney] Add back (commented-out) ZigZig encode/decode impls 65c49bc [Wes McKinney] Fix comment 73c1dc1 [Wes McKinney] Fix cpplint errors post PARQUET-496 553ca0e [Wes McKinney] Fix cpplint errors 9528c83 [Wes McKinney] Updates RLE encoding routines and dependencies from upstream changes in Apache Impala (incubating). Add rle-test and bit-util-test modules unchanged from Impala. Change-Id: I6dd1ba027b316de945c56f70b390cf44dcf09f63
31d5a8a
to
ce2746e
Compare
pkgconfig for libparquet is currently broken so there are some issues with the Windows wheel build still. I will try to fix the pkgconfig later so we don't have to hack around it in the meantime. |
The build is passing except for the Meson issue in the GLib build ARROW-3186. Please give me a thumbs up so I can rebase and merge this cleanly to avoid any extra pain, thank you =) |
I think that the GLib build with Meson failure isn't a Meson problem. The following debug print may help us: which meson
head $(which meson)
$(head -n1 $(which meson) | sed -e 's/^#! *//') -c 'import os; import sys; import pprint; pprint.pprint([list(os.walk(path)) for path in sys.path])'
|
system. Add unit test label granularity options, ability to add component group targets like 'make parquet' that build libraries and tests Change-Id: I250fd10da3a9505952115b3cec18cd7cb5589bdb
1b81fed
to
fe5d435
Compare
Umm. It's strange... Because https://travis-ci.org/apache/arrow/jobs/425583565#L3060 uses Meson 0.47.2 and there are no error. Anyway, it's OK to me that we use 0.47.1 for now. |
Just rebased this again. In theory should have a fully green build now |
Just in case anyone gets an itchy trigger finger
This should be merged to apache master with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
also verified that the build works locally.
Thanks. I'm just rebasing now; if you could hold off on merging any patches to master until I can FF-merge this that would be great, should be in 20-30 minutes |
fe5d435
to
9b4cd9c
Compare
Build is clean (see my appveyor builds: https://ci.appveyor.com/project/wesm/arrow/build/1.0.2049), merging |
Merges the apache/parquet-cpp code tree into the apache/arrow C++ codebase for the community's combined development process.