-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-242: Support Timestamp Data Type #107
Conversation
def test_timestamps_notimezone(self): | ||
data = {} | ||
|
||
df = pd.DataFrame({'datetime64[ns]': np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64')}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indicate the unit in this dtype=
explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also: PEP8 here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add time + milliseconds precision to these sample values (otherwise the time bit is all zero)?
To make life more interesting, Impala stores timestamps in Parquet as Int96 https://github.com/apache/incubator-impala/blob/master/be/src/runtime/timestamp-value.h#L260 JIRA indicates that that this may be the only real use for INT96: https://issues.apache.org/jira/browse/PARQUET-323 I think it would be safe for us to insert logic that converts all INT96 values into arrow Timestamps, what do you think? @nongli @julienledem if you have any thoughts on this let us know |
@@ -45,6 +45,7 @@ const auto INT64 = std::make_shared<Int64Type>(); | |||
const auto FLOAT = std::make_shared<FloatType>(); | |||
const auto DOUBLE = std::make_shared<DoubleType>(); | |||
const auto UTF8 = std::make_shared<StringType>(); | |||
const auto TIMESTAMP = std::make_shared<TimestampType>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimestampType
defaults to TimestampType::Unit::MILLI
-- not sure if you wanted to make this explicit here or add tests for the other units (otherwise create follow-up JIRA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I actually did not see that. ups, will have to improve a bit on my code then.
I think we might as well set the I'd be perfectly fine deferring the Int96 auto-conversion to a separate patch. Sorry this took me almost a week to get to. |
@wesm |
Change-Id: I8e5551cddec30888bbe37f1fc010774eba43f5dd
Change-Id: I0c33c43e4fe099b3c9261eac1544cb3343ea5834
@wesm Any chance to get a review soon? |
Sorry for the delay — while there wasn't complete consensus on the ML I think we should definitely get something merged that works for us for now. Taking a look |
+1 -- some items to address in follow up JIRAs
|
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
Author: Uwe L. Korn <[email protected]> Closes apache#107 from xhochy/parquet-619 and squashes the following commits: 612a2f2 [Uwe L. Korn] Check that the correct data was written 66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream a1179d1 [Uwe L. Korn] Add missing header 6ecb940 [Uwe L. Korn] Add OutputStream for local files Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
* Let the index start from 0 for split_part * Correct the expected error message in unit test
For the Pandas<->Parquet bridge this is a lossy conversion but must be explicitly activated by the user.
Regarding Parquet 1.0: Yes, the logical type is not supported but should be simply ignored by the reader. Implementation for INT96 timestamps is not in the scope of this PR.