Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3075: [C++] Merge parquet-cpp codebase into Arrow C++ codebase #2453

Merged
merged 319 commits into from
Sep 8, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
319 commits
Select commit Hold shift + click to select a range
dd58e39
PARQUET-607: Public writer header
xhochy May 9, 2016
227f66f
PARQUET-610: Print additional ColumnMetaData for each RowGroup
May 12, 2016
af71bad
PARQUET-614: Remove unneeded LZ4-related code
wesm May 13, 2016
e1cba98
PARQUET-616: WriteBatch should accept const arrays
xhochy May 13, 2016
3ff3b58
PARQUET-600: Add benchmarks for RLE-Level encoding
xhochy May 17, 2016
e1e0d28
PARQUET-620: Ensure metadata is written only once
xhochy May 18, 2016
cd8906c
PARQUET-619: Add OutputStream for local files
xhochy May 18, 2016
634132e
PARQUET-621: Add flag to indicate if decimalmetadata is set
xhochy May 18, 2016
01d31db
PARQUET-598: Test writing all primitive data types
xhochy Jun 3, 2016
466aa3d
PARQUET-625: Improve RLE read performance
xhochy Jun 3, 2016
2e42359
PARQUET-629: RowGroupSerializer should only close itself once
xhochy Jun 8, 2016
b1a816d
PARQUET-633: Add version to WriterProperties
xhochy Jun 17, 2016
ba3012e
PARQUET-634: Consistent private linking of dependencies
xhochy Jun 17, 2016
ce0cdeb
PARQUET-592: Support compressed writes
Jun 20, 2016
c21b041
PARQUET-636: Expose selection for different encodings
xhochy Jun 22, 2016
427d0a2
PARQUET-641: Instantiate stringstream only if needed in SerializedPag…
xhochy Jun 25, 2016
a8ad3a4
PARQUET-639: Do not export DCHECK in public headers
wesm Jun 25, 2016
8e98b5c
PARQUET-643: Add const modifier to schema pointer reference
xhochy Jun 27, 2016
05c4c9d
PARQUET-646: Add options to make developing with clang and 3rd-party …
wesm Jun 28, 2016
b3d9a14
PARQUET-489: Shared library symbol visibility
wesm Jul 1, 2016
b6abc6d
PARQUET-551:Handle compiler warnings due to disabled DCHECKs in relea…
Jul 6, 2016
1e73c0a
PARQUET-657: Do not define DISALLOW_COPY_AND_ASSIGN if already defined
wesm Jul 10, 2016
3b88b05
PARQUET-658: Add virtual destructor to ColumnReader
xhochy Jul 10, 2016
e21857a
PARQUET-659: Export extern templates for typed column reader/writer c…
wesm Jul 10, 2016
b283264
PARQUET-662: Compile ParquetException implementation and explicitly e…
wesm Jul 12, 2016
a0bfd9d
PARQUET-671: performance improvements for rle/bit-packed decoding
Aug 2, 2016
602b9be
PARQUET-666: Add support for writing dictionaries
xhochy Aug 28, 2016
4078b87
PARQUET-694: Revert default data page size back to 1M
xhochy Sep 1, 2016
18aa1ac
PARQUET-573: Create a public API for reading and writing file metadata
Sep 1, 2016
f128d51
PARQUET-699: Update parquet.thrift from https://github.com/apache/par…
flode Sep 1, 2016
3976997
PARQUET-701: Ensure that Close can be called multiple times
xhochy Sep 1, 2016
08ce126
PARQUET-700: Disable dictionary encoding for boolean columns
xhochy Sep 2, 2016
ddf0297
PARQUET-676: Fix incorrect MaxBufferSize for small bit widths
wesm Sep 3, 2016
78b1de3
PARQUET-681: Add tool to scan a parquet file
Sep 5, 2016
cc1fdec
PARQUET-704: Install scan-all.h
wesm Sep 5, 2016
66e7299
PARQUET-703: Validate that ColumnChunk metadata counts nulls in num_v…
wesm Sep 6, 2016
8268107
PARQUET-708: account for "worst case scenario" in MaxBufferSize for b…
xhochy Sep 6, 2016
ce843c8
PARQUET-710: Remove unneeded private member variables from RowGroupRe…
Sep 6, 2016
10ebdbd
PARQUET-711: Use metadata builders in parquet writer
Sep 9, 2016
adcabc4
PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large
Sep 15, 2016
20c2cb2
PARQUET-718: Fix I/O of non-dictionary encoded pages
xhochy Sep 15, 2016
87ff504
PARQUET-719: Fix WriterBatch API to handle NULL values
Sep 15, 2016
9e46b37
PARQUET-689: C++: Compress DataPages eagerly
Sep 16, 2016
b8de4d0
PARQUET-720: Mark ScanAllValues as inline to prevent link error
xhochy Sep 18, 2016
13980f9
PARQUET-712: Add library to read into Arrow memory
xhochy Sep 18, 2016
9dfa948
PARQUET-724: Test more advanced properties setting
xhochy Sep 22, 2016
287cf01
PARQUET-728: Incorporate upstream Arrow API changes
wesm Sep 25, 2016
c9c7f4a
PARQUET-721: benchmarks for reading into Arrow
xhochy Sep 26, 2016
9aae125
PARQUET-731: API to return metadata size and Skip reading values
Oct 2, 2016
43c7154
PARQUET-593: Add API for writing Page statistics
Oct 3, 2016
53958b1
PARQUET-737: Use absolute namespace in macros
xhochy Oct 6, 2016
20d4a17
PARQUET-741: Always allocate fresh buffers while compressing
xhochy Oct 7, 2016
408d788
PARQUET-739: Don't use a static buffer for data accessed by multiple …
flode Oct 10, 2016
c81a26e
PARQUET-739: Don't use a static buffer for data accessed by multiple …
Oct 11, 2016
350e520
PARQUET-747: Better hide TypedRowGroupStatistics in public API
wesm Oct 11, 2016
73eb456
PARQUET-742: Add missing license headers
xhochy Oct 12, 2016
67ae6e5
PARQUET-752: Account for upstream Arrow API changes
wesm Oct 18, 2016
05a168c
PARQUET-759: Fix handling of columns of empty strings
xhochy Nov 1, 2016
9c527b7
PARQUET-760: Store correct encoding in fallback data pages
xhochy Nov 1, 2016
96a6dd4
PARQUET-745: TypedRowGroupStatistics fails to PlainDecode min and max…
flode Nov 1, 2016
4409707
PARQUET-763: C++: Expose ParquetFileReader through Arrow reader
xhochy Nov 4, 2016
676d61c
PARQUET-766: Expose ParquetFileReader through Arrow reader as const
xhochy Nov 6, 2016
6da9e8a
PARQUET-764: Support batches for PLAIN boolean writes that aren't a m…
xhochy Nov 6, 2016
163b2ac
PARQUET-762: C++: Use optimistic allocation instead of Arrow Builders
xhochy Nov 6, 2016
5abf985
PARQUET-775: Make TrackingAllocator thread-safe
xhochy Nov 9, 2016
246ec91
PARQUET-702: Add a writer + reader example with detailed comments
Nov 15, 2016
a414be7
PARQUET-778: Standardize the schema output to match the parquet-mr f…
trink Nov 18, 2016
e0f9806
PARQUET-769: Add support for Brotli compression
xhochy Nov 26, 2016
fbdba4f
PARQUET-779: Export TypedRowGroupStatistics in libparquet
Nov 26, 2016
7a5f274
PARQUET-780: WriterBatch API does not properly handle NULL values for…
Nov 26, 2016
38416c4
PARQUET-782: Support writing to Arrow sinks
xhochy Nov 27, 2016
8bbb5d7
PARQUET-789: Catch/translate ParquetExceptions in parquet::arrow::Fil…
wesm Dec 6, 2016
b801505
PARQUET-785: LIST schema conversion for Arrow lists
xhochy Dec 12, 2016
912d7af
PARQUET-797: Updates for ARROW-418 header API changes
wesm Dec 13, 2016
7752273
PARQUET-799: Fix bug in MemoryMapSource::CloseFile
Dec 19, 2016
b50e626
PARQUET-805: Read Int96 into Arrow Timestamp(ns)
xhochy Dec 20, 2016
7790183
PARQUET-812: Read BYTE_ARRAY with no logical type as arrow::BinaryArray
wesm Dec 21, 2016
ffb7f06
PARQUET-816: Workaround for incorrect column chunk metadata in parque…
wesm Dec 25, 2016
e348a6d
PARQUET-813: Build thirdparty dependencies using ExternalProject
xhochy Dec 29, 2016
deb5680
PARQUET-818: Refactoring to utilize common IO, buffer, memory managem…
wesm Dec 30, 2016
bfb24f6
PARQUET-819: Don't try to install no longer existing arrow/utils.h
xhochy Dec 30, 2016
d36dc11
PARQUET-807: Allow user to retain ownership of parquet::FileMetaData.
wesm Jan 5, 2017
1867981
PARQUET-809: Add SchemaDescriptor::Equals method
wesm Jan 5, 2017
6d354a1
PARQUET-827: Account for arrow::MemoryPool API change and fix bug in …
wesm Jan 7, 2017
ea9c4d3
PARQUET-828: Do not implicitly cast ParquetVersion enum to int
wesm Jan 10, 2017
2cbd797
PARQUET-829: Make use of ARROW-469
xhochy Jan 11, 2017
6312724
PARQUET-830: Add parquet::arrow::OpenFile with additional properties …
wesm Jan 12, 2017
97e69b4
PARQUET-820: Decoders should directly emit arrays with spacing for nu…
xhochy Jan 18, 2017
f3a3c69
PARQUET-833: C++: Provide API to write spaced arrays
xhochy Jan 23, 2017
4e52f61
PARQUET-837: Remove RandomAccessSource::Seek method which can be a so…
wesm Jan 23, 2017
18caeab
PARQUET-835: Read Arrow columns in parallel with thread pool
wesm Jan 23, 2017
c195976
PARQUET-836: Bugfix + testcase for column subsetting in arrow::FileRe…
wesm Jan 24, 2017
d0446e1
PARQUET-691: Write ColumnChunk metadata after chunk is complete
wesm Jan 24, 2017
38a6a98
PARQUET-841: Version number being incorrectly written for v1 files
wesm Jan 25, 2017
493603d
PARQUET-842: Do not set unnecessary fields in the parquet::SchemaElement
wesm Jan 25, 2017
5a21610
PARQUET-843: Impala is thrown off by a REPEATED root schema node
wesm Jan 26, 2017
c016b72
PARQUET-844: Schema, compression consolidation / flattening
wesm Jan 26, 2017
61b7b12
PARQUET-846: CpuInfo::Init() is not thread safe
Jan 26, 2017
270bda0
PARQUET-848: Build Thrift bits as part of main parquet_objlib component
wesm Jan 29, 2017
8fda954
PARQUET-834: Support I/O of arrow::ListArray
xhochy Feb 2, 2017
7f305a6
PARQUET-857: Flatten parquet/encodings directory, consolidate code
wesm Feb 3, 2017
7a65d43
PARQUET-862: Provide defaut cache size values
mvertes Feb 4, 2017
d53bb1a
PARQUET-866: API fixes for ARROW-33 patch
wesm Feb 6, 2017
3eda0d2
PARQUET-867: Support writing sliced Arrow arrays
xhochy Feb 9, 2017
ee62a34
PARQUET-874: Use default memory allocator from Arrow
xhochy Feb 11, 2017
6a9631a
PARQUET-793: Do not return incorrect statistics
Feb 12, 2017
72cb04b
PARQUET-877: Update Arrow Hash, update Version in metadata.
xhochy Feb 14, 2017
cff54fa
PARQUET-880: Prevent destructors from throwing
Feb 16, 2017
74db8d1
PARQUET-882: Improve Application Version parsing
Feb 21, 2017
cb8eab9
PARQUET-888: Add missing virtual dtor.
daedric Feb 21, 2017
0d2b951
PARQUET-889: Fix compilation when SSE is enabled
daedric Feb 22, 2017
220aa56
PARQUET-894: Fix compilation warning
daedric Feb 23, 2017
5ab15c6
PARQUET-895: Fix broken reading of nested repeated columns
mvertes Feb 23, 2017
48b70d0
PARQUET-894: Fix compilation warnings
daedric Feb 28, 2017
9ca26c7
PARQUET-903: Add option to set RPATH to origin
xhochy Mar 7, 2017
fb325c3
PARQUET-890: Support I/O of DATE columns in parquet_arrow
xhochy Mar 7, 2017
6060d83
PARQUET-908: Fix shared library visibility of some symbols in types.h
wesm Mar 9, 2017
b6b5aac
PARQUET-909: Reduce buffer allocations (mallocs) on critical path
Mar 17, 2017
aaf4ffd
PARQUET-897: Only use designated public headers from libarrow
xhochy Mar 20, 2017
9d27375
PARQUET-919: Account for ARROW-683 changes, but make no functional ch…
wesm Mar 23, 2017
f0d1456
PARQUET-923: Account for Time type changes in Arrow
wesm Mar 25, 2017
22d95d2
PARQUET-928: Support pkg-config
kou Mar 29, 2017
0f93007
PARQUET-933: Account for API changes in ARROW-728
wesm Mar 30, 2017
22279eb
PARQUET-934: Support multiarch on Debian
kou Mar 31, 2017
7bf8f04
PARQUET-935: Set version to shared library
kou Apr 1, 2017
81c2696
PARQUET-943: Fix build error on x86
kou Apr 2, 2017
4b53921
PARQUET-946: Add ReadRowGroup and num_row_group methods to arrow::Fil…
wesm Apr 6, 2017
d2c347d
PARQUET-947: Account for Arrow library consolidation in ARROW-795, AP…
wesm Apr 10, 2017
2ea0d60
PARQUET-918: FromParquetSchema API crashes on nested schemas
itaiin Apr 10, 2017
b3dedf4
PARQUET-953: Add static constructors to arrow::FileWriter for initial…
wesm Apr 13, 2017
ada05fa
PARQUET-918: Keep ordering in column indices when converting Parquet …
advancedxy Apr 14, 2017
c3dc8a1
PARQUET-898: Upgrade to googletest 1.8.0, move back to Xcode 6.4 in T…
wesm Apr 16, 2017
4ea7124
PARQUET-508: Add ParquetFilePrinter
Apr 21, 2017
dac6505
PARQUET-958: [C++] Print Parquet metadata in JSON format
Apr 25, 2017
61f3b1d
PARQUET-915: Support additional Arrow date/time types and metadata
wesm Apr 25, 2017
6a27975
PARQUET-963: Return NotImplemented when attempting to read a struct f…
wesm Apr 25, 2017
35d09d4
PARQUET-595: API for KeyValue metadata
wesm Apr 29, 2017
5e60bfc
PARQUET-965: Add FIXED_LEN_BYTE_ARRAY read and write support in parq…
advancedxy May 2, 2017
4e96056
PARQUET-679: Local Windows build and Appveyor support
maxhora May 2, 2017
f444dfe
PARQUET-936: Return Invalid Status if chunk_size <= 0 when WriteTable…
advancedxy May 4, 2017
7242b1c
PARQUET-914: Rewording exception message in column writer.
advancedxy May 5, 2017
8255ccc
PARQUET-679: [C++] Resolve unit tests issues on Windows; Run unit tes…
maxhora May 9, 2017
bd02cca
PARQUET-930: Add timestamp[us] to schema test
xhochy May 12, 2017
8bc6ec5
PARQUET-679: Fix debug asserts in tests (msvc/debug build)
May 12, 2017
2fab6a2
PARQUET-984: Add abi and so version to pkg-config
xhochy May 15, 2017
0e4c4a1
PARQUET-992: Do not transitively include zlib.h in public API
wesm May 17, 2017
7638af1
PARQUET-995: Use sizeof(Int96) instead of Int96Type
wesm May 18, 2017
a821f09
PARQUET-997: Fix override compiler warnings
cpcloud May 22, 2017
0e1f467
PARQUET-978: [C++] Minimizing footer reads for small(ish) metadata
May 27, 2017
7d476b2
PARQUET-991: Resolve msvc warnings; Appveyor treats msvc warnings as …
maxhora May 29, 2017
a8d8d22
PARQUET-967: Combine libparquet, libparquet_arrow libraries
wesm May 31, 2017
5aa2339
PARQUET-999: Improve MSVC build - Enable PARQUET_BUILD_BENCHMARKS
rip-nsk May 31, 2017
5f42afa
PARQUET-1008: [C++] TypedColumnReader::ReadBatch method updated to ac…
maxhora Jun 8, 2017
9dcb12d
PARQUET-1003: Modify DEFAULT_CREATED_BY value for every new release v…
Jun 11, 2017
94e351c
PARQUET-1029: [C++] Some extern template symbols not being exported i…
wesm Jun 14, 2017
8f7282b
PARQUET-1007: Update parquet.thrift
Jun 17, 2017
13f3fde
PARQUET-991: Fix msvc warning C4100: '<id>': unreferenced formal para…
rip-nsk Jun 19, 2017
514b74c
PARQUET-1033: Improve documentation about WriteBatchSpaced
xhochy Jun 19, 2017
2d98407
PARQUET-911: [C++] Support nested structs in parquet_arrow
itaiin Jun 22, 2017
cc46aff
PARQUET-1038: Key value metadata should be nullptr if not set
cpcloud Jun 23, 2017
1fdd816
PARQUET-1042: Fix Compilation breaks on GCC 4.8
xhochy Jun 23, 2017
61da26c
PARQUET-1041: Support Arrow's NullArray
xhochy Jun 23, 2017
0a32c6b
PARQUET-1043: Raise minimum CMake version to 3.2, delete cruft.
wesm Jun 23, 2017
b0414cc
PARQUET-1044: Use compression libraries from Apache Arrow
wesm Jun 25, 2017
81db371
PARQUET-858: Flatten column directory, minor code consolidation
wesm Jun 26, 2017
40527c3
PARQUET-1045: Remove code that's being moved to Apache Arrow in ARROW…
wesm Jun 27, 2017
5374737
PARQUET-1040: Add missing writer methods
xhochy Jul 3, 2017
3e34c37
PARQUET-1048: Apache Arrow static transitive dependencies
Jul 10, 2017
658c7fb
PARQUET-1053: Fix unused result warnings due to unchecked Statuses
cpcloud Jul 11, 2017
68315b8
PARQUET-1054: Fixes for Arrow API changes in ARROW-1199
wesm Jul 11, 2017
2395770
PARQUET-1035: Write Int96 from Arrow timestamp(ns)
c-nichols Jul 16, 2017
6c97fe6
PARQUET-1068: Modify .clang-format to use straight Google format with…
wesm Jul 31, 2017
facce86
PARQUET-1072: Build with ARROW_NO_DEPRECATED_API in Travis CI
wesm Aug 6, 2017
82d516e
PARQUET-1078: Add option to coerce Arrow timestamps to a particular unit
wesm Aug 7, 2017
7fd1519
PARQUET-1079: Remove Arrow offset shift unneeded after ARROW-1335
wesm Aug 8, 2017
38a4e9f
PARQUET-1083: Factor logic in parquet-scan.cc into a library function…
wesm Aug 30, 2017
eadc62e
PARQUET-1085: [C++] Use namespaced macros from arrow/util/macros.h, w…
wesm Sep 3, 2017
5f54be7
PARQUET-1087: Add ScanContents function to arrow::FileReader that cat…
wesm Sep 5, 2017
4845e76
PARQUET-1088: Remove parquet_version.h from version control since it …
Sep 6, 2017
751eb00
PARQUET-1090: Add max row group length option, fix int32 overflow
wesm Sep 6, 2017
200774e
PARQUET-1093: Improve Arrow level generation error message
xhochy Sep 11, 2017
dcf96ed
PARQUET-1002: Compute statistics based on Sort Order
Sep 11, 2017
92e7dae
PARQUET-1098: Install util/comparison.h
wesm Sep 11, 2017
d29d4a9
PARQUET-1104: Upgrade to Apache Arrow 0.7.0 RC0
wesm Sep 13, 2017
75cf66a
PARQUET-929: Handle arrow::DictionaryArray when writing Arrow data
xhochy Sep 17, 2017
cd1c622
PARQUET-1094: Add benchmark for boolean Arrow column I/O
xhochy Sep 17, 2017
d7003c0
PARQUET-1100: Introduce RecordReader interface to better support nest…
wesm Sep 20, 2017
468e737
PARQUET-1037: allow arbitrary size row-groups
Sep 21, 2017
9809754
PARQUET-1108: Fix Int96 comparators
Sep 21, 2017
ac1a5d3
PARQUET-1114 Apply changes for ARROW-1601 ARROW-1611, change shared l…
renesugar Sep 27, 2017
f5c7aee
PARQUET-1123: [C++] Update parquet-cpp to use Arrow's AssertArraysEqual
cpcloud Oct 7, 2017
f36231d
PARQUET-1121: Handle Dictionary[Null] arrays on writing Arrow tables
xhochy Oct 7, 2017
f1dabe9
PARQUET-1138: Fix Arrow 0.7.1 build
wesm Oct 16, 2017
dcea0ab
PARQUET-1150: Hide statically linked boost symbols
xhochy Oct 30, 2017
da29595
PARQUET-1095: [C++] Read and write Arrow decimal values
cpcloud Nov 20, 2017
d619050
PARQUET-1164: [C++] Account for API changes in ARROW-1808
wesm Nov 22, 2017
1124a79
PARQUET-970: Add Lz4 and Zstd compression codecs
advancedxy Nov 23, 2017
adc569a
PARQUET-1167: [C++] FieldToNode function should return a status when …
cpcloud Dec 3, 2017
4acd139
PARQUET-1175: Fix arrow::ArrayData method rename from ShallowCopy to …
wesm Dec 11, 2017
2b37b1f
PARQUET-1165: Pin clang-format version to 4.0
xhochy Dec 11, 2017
5324ee9
PARQUET-859: Flatten parquet/file directory, consolidate file reader,…
wesm Dec 12, 2017
bcc1f88
PARQUET-1177: Add PARQUET_BUILD_WARNING_LEVEL option and more rigorou…
wesm Dec 13, 2017
46e1d4e
PARQUET-1092: Support writing chunked arrow::Table columns
wesm Dec 17, 2017
7dbe374
PARQUET-1180: Fix behaviour of num_children element of primitive nodes
Posnet Dec 29, 2017
4538a2e
PARQUET-1086: [C++] Remove usage of arrow/util/compiler-util.h
xhochy Jan 16, 2018
d257a88
PARQUET-1193: [CPP] Implement ColumnOrder to support min_value and ma…
Jan 24, 2018
39c0b7b
PARQUET-1179: Upgrade to Thrift 0.11, use std::shared_ptr instead of …
xhochy Jan 28, 2018
5ebb78c
PARQUET-1200: Support reading a single Arrow column from a Parquet file
xhochy Feb 13, 2018
7f1b0c0
PARQUET-1226: Fixes for CHECKIN compiler warning level with clang 5.0
wesm Feb 20, 2018
cae28c0
PARQUET-1218: More informative error message on too short pages
xhochy Feb 21, 2018
96a0265
PARQUET-1233: Enable option to switch between stl classes and boost c…
Feb 21, 2018
15e8661
PARQUET-1225: NaN values may lead to incorrect filtering under certai…
Feb 24, 2018
b3f3c09
PARQUET-1245: Fix creating Arrow table with duplicate column names
pitrou Mar 22, 2018
102d951
PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader
advancedxy Mar 23, 2018
f28563d
PARQUET-1071: Check that arrow::FileWriter::Close() is idempotent
pitrou Mar 28, 2018
de865da
PARQUET-1255: Fix error message when PARQUET_TEST_DATA isn't defined
pitrou Mar 28, 2018
b73771b
PARQUET-1265: Segfault on static ApplicationVersion initialization
Apr 12, 2018
828783d
PARQUET-1267: [C++] replace "unsafe" std::equal by std::memcmp
rip-nsk Apr 17, 2018
a251714
PARQUET-1268: Fix conversion of null list Arrow arrays
pitrou Apr 17, 2018
9d99820
PARQUET-1273: Properly write dictionary values when writing in chunks
Apr 18, 2018
42f287c
PARQUET-1274: Prevent segfault that was occurring when writing a nano…
Apr 18, 2018
2d0a904
PARQUET-1272: Return correct row count for nested columns in ScanFile…
xhochy Apr 18, 2018
fa53ea7
PARQUET-1279: [C++] Adding use of ASSERT_NO_FATAL_FAILURE in unit tes…
Apr 23, 2018
b9e80c8
PARQUET-1283: [C++] Remove trailing space for string and int96 statis…
May 1, 2018
076fbc6
PARQUET-979: Limit size of min, max or disable stats for long binary …
May 21, 2018
f20fe7e
PARQUET-1307: Fix memory-test for newer Arrow
pitrou May 26, 2018
129d845
PARQUET-1315: ColumnChunkMetaData.has_dictionary_page() should return…
May 31, 2018
f38245b
PARQUET-1340: Fix Travis Ci valgrind errors related to std::random_de…
Jun 28, 2018
ea8798d
PARQUET-1334: [C++] memory_map parameter seems missleading in parquet…
hochphil Jun 28, 2018
08ca177
PARQUET-1333: [C++] Reading of files with dictionary size 0 fails on …
hochphil Jun 28, 2018
079ae70
PARQUET-1346: [C++] Protect against empty Arrow arrays with null values
pitrou Jul 12, 2018
630cf0a
PARQUET-1350: [C++] Use abstract ResizableBuffer instead of concrete …
pitrou Jul 23, 2018
fee8d70
PARQUET-1323: Fix compiler warnings on clang-6
wesm Jul 24, 2018
bd5243e
PARQUET-1358: index_page_offset should be unset as it is not supported
xhochy Jul 26, 2018
0ccf832
PARQUET-1348: Add ability to write FileMetaData in arrow FileWriter
Jul 28, 2018
673ccfa
PARQUET-1360: Use conforming API style, variable names in WriteFileMe…
wesm Jul 29, 2018
0e0f838
PARQUET-1227: Thrift crypto metadata structures
Jul 31, 2018
40b21b3
PARQUET-1357: FormatStatValue truncates binary statistics on zero cha…
xhochy Aug 1, 2018
b6ad261
PARQUET-1366: [C++] Streamline use of Arrow's bit-util.h APIs
pitrou Aug 1, 2018
ed7242e
PARQUET-1301: [C++] Crypto package in parquet-cpp
Aug 4, 2018
bdeed71
PARQUET-1332: Add bloom filter for parquet
Aug 15, 2018
e26afc2
PARQUET-1378: Allow RowGroups with zero rows to be written
Aug 15, 2018
72795ef
PARQUET-1308: [C++] Use Arrow thread pool, not Arrow ParallelFor, fix…
wesm Aug 17, 2018
d146452
PARQUET-1382: [C++] Prepare for arrow::test namespace removal
pitrou Aug 17, 2018
aa166ed
PARQUET-1384: fix clang build error for bloom_filter-test.cc
Aug 17, 2018
41ae86d
PARQUET-1256: Add --print-key-value-metadata option to parquet_reader…
JacekPliszka Aug 17, 2018
cdf2e3f
PARQUET-1276: [C++] Reduce the amount of memory used for writing null…
pitrou Aug 20, 2018
1dffe22
PARQUET-1392: Read multiple RowGroups at once into an Arrow table
xhochy Aug 23, 2018
1463276
PARQUET-1372: Add an API to allow writing RowGroups based on size
Aug 25, 2018
9b4cd9c
ARROW-3075: [C++] Incorporate parquet-cpp codebase into Arrow C++ build
wesm Sep 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
PARQUET-671: performance improvements for rle/bit-packed decoding
Testing on my own data shows an order-of-magnitude improvement.

I separated the commits for clarity, each one gives an imcremental improvement.

The motivation for the last commit (allowing NULL for def_levels/rep_level) is a workaround for Spark  which doesn't seem to be able to generate columns without def_level, even when a column is specified as "not nullable".

Author: Eric Daniel <[email protected]>

Closes #140 from edani/decode-perf and squashes the following commits:

eec0855 [Eric Daniel] Ran "make format"
0568de6 [Eric Daniel] Only check num. of repetition levels when def_levels is set
5f54e1c [Eric Daniel] Added benchmarks for dictionary decoding
087945b [Eric Daniel] Style fixes from code review
906be73 [Eric Daniel] Allow the reader to skip rep/def decoding
04b7391 [Eric Daniel] Fast bit unpacking
bda5d84 [Eric Daniel] The bit reader can decode in batches
3f10378 [Eric Daniel] Improve decoding of repeated values in the dict encoding

Change-Id: I45421fc2ada5d06863ddd765470f79b45ec4991a
  • Loading branch information
Eric Daniel authored and wesm committed Aug 2, 2016
commit a0bfd9d83f64b0842bf00828b4f4731598caf2b7
5 changes: 1 addition & 4 deletions cpp/src/parquet/column/levels.cc
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,7 @@ int LevelDecoder::Decode(int batch_size, int16_t* levels) {
if (encoding_ == Encoding::RLE) {
num_decoded = rle_decoder_->GetBatch(levels, num_values);
} else {
for (int i = 0; i < num_values; ++i) {
if (!bit_packed_decoder_->GetValue(bit_width_, levels + i)) { break; }
++num_decoded;
}
num_decoded = bit_packed_decoder_->GetBatch(bit_width_, levels, num_values);
}
num_values_remaining_ -= num_decoded;
return num_decoded;
Expand Down
10 changes: 7 additions & 3 deletions cpp/src/parquet/column/reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,10 @@ class PARQUET_EXPORT TypedColumnReader : public ColumnReader {
// may be less than the number of repetition and definition levels. With
// nested data this is almost certainly true.
//
// Set def_levels or rep_levels to nullptr if you want to skip reading them.
// This is only safe if you know through some other source that there are no
// undefined values.
//
// To fully exhaust a row group, you must read batches until the number of
// values read reaches the number of stored values according to the metadata.
//
Expand Down Expand Up @@ -171,7 +175,7 @@ inline int64_t TypedColumnReader<DType>::ReadBatch(int batch_size, int16_t* def_
int64_t values_to_read = 0;

// If the field is required and non-repeated, there are no definition levels
if (descr_->max_definition_level() > 0) {
if (descr_->max_definition_level() > 0 && def_levels) {
num_def_levels = ReadDefinitionLevels(batch_size, def_levels);
// TODO(wesm): this tallying of values-to-decode can be performed with better
// cache-efficiency if fused with the level decoding.
Expand All @@ -184,9 +188,9 @@ inline int64_t TypedColumnReader<DType>::ReadBatch(int batch_size, int16_t* def_
}

// Not present for non-repeated fields
if (descr_->max_repetition_level() > 0) {
if (descr_->max_repetition_level() > 0 && rep_levels) {
num_rep_levels = ReadRepetitionLevels(batch_size, rep_levels);
if (num_def_levels != num_rep_levels) {
if (def_levels && num_def_levels != num_rep_levels) {
throw ParquetException("Number of decoded rep / def levels did not match");
}
}
Expand Down
20 changes: 9 additions & 11 deletions cpp/src/parquet/encodings/dictionary-encoding.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,22 +64,15 @@ class DictionaryDecoder : public Decoder<Type> {

virtual int Decode(T* buffer, int max_values) {
max_values = std::min(max_values, num_values_);
for (int i = 0; i < max_values; ++i) {
buffer[i] = dictionary_[index()];
}
int decoded_values = idx_decoder_.GetBatchWithDict(dictionary_, buffer, max_values);
if (decoded_values != max_values) { ParquetException::EofException(); }
num_values_ -= max_values;
return max_values;
}

private:
using Decoder<Type>::num_values_;

int index() {
int idx = 0;
if (!idx_decoder_.Get(&idx)) ParquetException::EofException();
--num_values_;
return idx;
}

// Only one is set.
Vector<T> dictionary_;

Expand Down Expand Up @@ -177,7 +170,12 @@ class DictEncoderBase {
/// Returns a conservative estimate of the number of bytes needed to encode the buffered
/// indices. Used to size the buffer passed to WriteIndices().
int EstimatedDataEncodedSize() {
return 1 + RleEncoder::MaxBufferSize(bit_width(), buffered_indices_.size());
// Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
// reserve
// an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
// but not reserving them would cause the encoder to fail.
return 1 + RleEncoder::MaxBufferSize(bit_width(), buffered_indices_.size()) +
RleEncoder::MinBufferSize(bit_width());
}

/// The minimum bit width required to encode the currently buffered indices.
Expand Down
72 changes: 71 additions & 1 deletion cpp/src/parquet/encodings/encoding-benchmark.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,23 @@

#include "benchmark/benchmark.h"

#include "parquet/encodings/plain-encoding.h"
#include "parquet/encodings/dictionary-encoding.h"
#include "parquet/file/reader-internal.h"
#include "parquet/util/mem-pool.h"

namespace parquet {

using format::ColumnChunk;
using schema::PrimitiveNode;

namespace benchmark {

std::shared_ptr<ColumnDescriptor> Int64Schema(Repetition::type repetition) {
auto node = PrimitiveNode::Make("int64", repetition, Type::INT64);
return std::make_shared<ColumnDescriptor>(
node, repetition != Repetition::REQUIRED, repetition == Repetition::REPEATED);
}

static void BM_PlainEncodingBoolean(::benchmark::State& state) {
std::vector<bool> values(state.range_x(), 64);
PlainEncoder<BooleanType> encoder(nullptr);
Expand Down Expand Up @@ -86,6 +97,65 @@ static void BM_PlainDecodingInt64(::benchmark::State& state) {

BENCHMARK(BM_PlainDecodingInt64)->Range(1024, 65536);

template <typename Type>
static void DecodeDict(
std::vector<typename Type::c_type>& values, ::benchmark::State& state) {
typedef typename Type::c_type T;
int num_values = values.size();

MemPool pool;
MemoryAllocator* allocator = default_allocator();
std::shared_ptr<ColumnDescriptor> descr = Int64Schema(Repetition::REQUIRED);
std::shared_ptr<OwnedMutableBuffer> dict_buffer =
std::make_shared<OwnedMutableBuffer>();
auto indices = std::make_shared<OwnedMutableBuffer>();

DictEncoder<T> encoder(&pool, allocator, descr->type_length());
for (int i = 0; i < num_values; ++i) {
encoder.Put(values[i]);
}

dict_buffer->Resize(encoder.dict_encoded_size());
encoder.WriteDict(dict_buffer->mutable_data());
indices->Resize(encoder.EstimatedDataEncodedSize());
int actual_bytes = encoder.WriteIndices(indices->mutable_data(), indices->size());
indices->Resize(actual_bytes);

while (state.KeepRunning()) {
PlainDecoder<Type> dict_decoder(descr.get());
dict_decoder.SetData(encoder.num_entries(), dict_buffer->data(), dict_buffer->size());
DictionaryDecoder<Type> decoder(descr.get());
decoder.SetDict(&dict_decoder);
decoder.SetData(num_values, indices->data(), indices->size());
decoder.Decode(values.data(), num_values);
}

state.SetBytesProcessed(state.iterations() * state.range_x() * sizeof(T));
}

static void BM_DictDecodingInt64_repeats(::benchmark::State& state) {
typedef Int64Type Type;
typedef typename Type::c_type T;

std::vector<T> values(state.range_x(), 64);
DecodeDict<Type>(values, state);
}

BENCHMARK(BM_DictDecodingInt64_repeats)->Range(1024, 65536);

static void BM_DictDecodingInt64_literals(::benchmark::State& state) {
typedef Int64Type Type;
typedef typename Type::c_type T;

std::vector<T> values(state.range_x());
for (size_t i = 0; i < values.size(); ++i) {
values[i] = i;
}
DecodeDict<Type>(values, state);
}

BENCHMARK(BM_DictDecodingInt64_literals)->Range(1024, 65536);

} // namespace benchmark

} // namespace parquet
6 changes: 2 additions & 4 deletions cpp/src/parquet/encodings/plain-encoding.h
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,8 @@ class PlainDecoder<BooleanType> : public Decoder<BooleanType> {

virtual int Decode(bool* buffer, int max_values) {
max_values = std::min(max_values, num_values_);
bool val;
for (int i = 0; i < max_values; ++i) {
if (!bit_reader_.GetValue(1, &val)) { ParquetException::EofException(); }
buffer[i] = val;
if (bit_reader_.GetBatch(1, buffer, max_values) != max_values) {
ParquetException::EofException();
}
num_values_ -= max_values;
return max_values;
Expand Down
4 changes: 4 additions & 0 deletions cpp/src/parquet/util/bit-stream-utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@ class BitReader {
template <typename T>
bool GetValue(int num_bits, T* v);

/// Get a number of values from the buffer. Return the number of values actually read.
template <typename T>
int GetBatch(int num_bits, T* v, int batch_size);

/// Reads a 'num_bytes'-sized value from the buffer and stores it in 'v'. T
/// needs to be a little-endian native type and big enough to store
/// 'num_bytes'. The value is assumed to be byte-aligned so the stream will
Expand Down
102 changes: 84 additions & 18 deletions cpp/src/parquet/util/bit-stream-utils.inline.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,10 @@
#ifndef PARQUET_UTIL_BIT_STREAM_UTILS_INLINE_H
#define PARQUET_UTIL_BIT_STREAM_UTILS_INLINE_H

#include <algorithm>

#include "parquet/util/bit-stream-utils.h"
#include "parquet/util/bpacking.h"

namespace parquet {

Expand Down Expand Up @@ -85,35 +88,98 @@ inline bool BitWriter::PutVlqInt(uint32_t v) {
return result;
}

template <typename T>
inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer,
int* bit_offset, int* byte_offset, uint64_t* buffered_values) {
*v = BitUtil::TrailingBits(*buffered_values, *bit_offset + num_bits) >> *bit_offset;

*bit_offset += num_bits;
if (*bit_offset >= 64) {
*byte_offset += 8;
*bit_offset -= 64;

int bytes_remaining = max_bytes - *byte_offset;
if (LIKELY(bytes_remaining >= 8)) {
memcpy(buffered_values, buffer + *byte_offset, 8);
} else {
memcpy(buffered_values, buffer + *byte_offset, bytes_remaining);
}

// Read bits of v that crossed into new buffered_values_
*v |= BitUtil::TrailingBits(*buffered_values, *bit_offset)
<< (num_bits - *bit_offset);
DCHECK_LE(*bit_offset, 64);
}
}

template <typename T>
inline bool BitReader::GetValue(int num_bits, T* v) {
return GetBatch(num_bits, v, 1) == 1;
}

template <typename T>
inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
DCHECK(buffer_ != NULL);
// TODO: revisit this limit if necessary
DCHECK_LE(num_bits, 32);
DCHECK_LE(num_bits, static_cast<int>(sizeof(T) * 8));

if (UNLIKELY(byte_offset_ * 8 + bit_offset_ + num_bits > max_bytes_ * 8)) return false;

*v = BitUtil::TrailingBits(buffered_values_, bit_offset_ + num_bits) >> bit_offset_;

bit_offset_ += num_bits;
if (bit_offset_ >= 64) {
byte_offset_ += 8;
bit_offset_ -= 64;
int bit_offset = bit_offset_;
int byte_offset = byte_offset_;
uint64_t buffered_values = buffered_values_;
int max_bytes = max_bytes_;
const uint8_t* buffer = buffer_;

uint64_t needed_bits = num_bits * batch_size;
uint64_t remaining_bits = (max_bytes - byte_offset) * 8 - bit_offset;
if (remaining_bits < needed_bits) { batch_size = remaining_bits / num_bits; }

int i = 0;
if (UNLIKELY(bit_offset != 0)) {
for (; i < batch_size && bit_offset != 0; ++i) {
GetValue_(num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset,
&buffered_values);
}
}

int bytes_remaining = max_bytes_ - byte_offset_;
if (LIKELY(bytes_remaining >= 8)) {
memcpy(&buffered_values_, buffer_ + byte_offset_, 8);
} else {
memcpy(&buffered_values_, buffer_ + byte_offset_, bytes_remaining);
if (sizeof(T) == 4) {
int num_unpacked = unpack32(reinterpret_cast<const uint32_t*>(buffer + byte_offset),
reinterpret_cast<uint32_t*>(v + i), batch_size - i, num_bits);
i += num_unpacked;
byte_offset += num_unpacked * num_bits / 8;
} else {
const int buffer_size = 1024;
static uint32_t unpack_buffer[buffer_size];
while (i < batch_size) {
int unpack_size = std::min(buffer_size, batch_size - i);
int num_unpacked = unpack32(reinterpret_cast<const uint32_t*>(buffer + byte_offset),
unpack_buffer, unpack_size, num_bits);
if (num_unpacked == 0) { break; }
for (int k = 0; k < num_unpacked; ++k) {
v[i + k] = unpack_buffer[k];
}
i += num_unpacked;
byte_offset += num_unpacked * num_bits / 8;
}
}

// Read bits of v that crossed into new buffered_values_
*v |= BitUtil::TrailingBits(buffered_values_, bit_offset_)
<< (num_bits - bit_offset_);
int bytes_remaining = max_bytes - byte_offset;
if (bytes_remaining >= 8) {
memcpy(&buffered_values, buffer + byte_offset, 8);
} else {
memcpy(&buffered_values, buffer + byte_offset, bytes_remaining);
}
DCHECK_LE(bit_offset_, 64);
return true;

for (; i < batch_size; ++i) {
GetValue_(
num_bits, &v[i], max_bytes, buffer, &bit_offset, &byte_offset, &buffered_values);
}

bit_offset_ = bit_offset;
byte_offset_ = byte_offset;
buffered_values_ = buffered_values;

return batch_size;
}

template <typename T>
Expand Down
Loading