Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-242: Support Timestamp Data Type #107

Closed
wants to merge 2 commits into from
Closed

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Jul 26, 2016

For the Pandas<->Parquet bridge this is a lossy conversion but must be explicitly activated by the user.

Regarding Parquet 1.0: Yes, the logical type is not supported but should be simply ignored by the reader. Implementation for INT96 timestamps is not in the scope of this PR.

def test_timestamps_notimezone(self):
data = {}

df = pd.DataFrame({'datetime64[ns]': np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64')})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indicate the unit in this dtype= explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also: PEP8 here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add time + milliseconds precision to these sample values (otherwise the time bit is all zero)?

@wesm
Copy link
Member

wesm commented Aug 1, 2016

To make life more interesting, Impala stores timestamps in Parquet as Int96

https://github.com/apache/incubator-impala/blob/master/be/src/runtime/timestamp-value.h#L260

JIRA indicates that that this may be the only real use for INT96:

https://issues.apache.org/jira/browse/PARQUET-323

I think it would be safe for us to insert logic that converts all INT96 values into arrow Timestamps, what do you think? @nongli @julienledem if you have any thoughts on this let us know

@@ -45,6 +45,7 @@ const auto INT64 = std::make_shared<Int64Type>();
const auto FLOAT = std::make_shared<FloatType>();
const auto DOUBLE = std::make_shared<DoubleType>();
const auto UTF8 = std::make_shared<StringType>();
const auto TIMESTAMP = std::make_shared<TimestampType>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimestampType defaults to TimestampType::Unit::MILLI -- not sure if you wanted to make this explicit here or add tests for the other units (otherwise create follow-up JIRA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I actually did not see that. ups, will have to improve a bit on my code then.

@wesm
Copy link
Member

wesm commented Aug 1, 2016

I think we might as well set the TimestampType unit metadata and then separately convert automatically to milliseconds (at some point we can add an option to raise exceptions if conversion would discard some nanoseconds, for example).

I'd be perfectly fine deferring the Int96 auto-conversion to a separate patch.

Sorry this took me almost a week to get to.

@nongli
Copy link
Contributor

nongli commented Aug 1, 2016

@wesm
Having int96 treated as timestamp seems reasonable to me. I don't know of any other use for it.

@julienledem
Copy link
Member

@wesm @nongli yes, int96 was just because of (nanosec) timestamps. So it is safe to do that.

xhochy added 2 commits August 28, 2016 12:16
Change-Id: I8e5551cddec30888bbe37f1fc010774eba43f5dd
Change-Id: I0c33c43e4fe099b3c9261eac1544cb3343ea5834
@xhochy
Copy link
Member Author

xhochy commented Aug 28, 2016

@wesm Any chance to get a review soon?

@wesm
Copy link
Member

wesm commented Aug 28, 2016

Sorry for the delay — while there wasn't complete consensus on the ML I think we should definitely get something merged that works for us for now. Taking a look

@wesm
Copy link
Member

wesm commented Aug 28, 2016

+1 -- some items to address in follow up JIRAs

  • Auto-convert INT96 nanosecond timestamps to datetime64[ns] (may require using boost)
  • More general parameter for configuring type casting behavior (e.g. unit conversion) on serialization of pandas.DataFrame

@asfgit asfgit closed this in 0a411fd Aug 28, 2016
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files

Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files

Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files

Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Author: Uwe L. Korn <[email protected]>

Closes apache#107 from xhochy/parquet-619 and squashes the following commits:

612a2f2 [Uwe L. Korn] Check that the correct data was written
66fb32a [Uwe L. Korn] Add unit test for LocalFileOutputStream
a1179d1 [Uwe L. Korn] Add missing header
6ecb940 [Uwe L. Korn] Add OutputStream for local files

Change-Id: Ie9403af419f8712982752b4f5256136bf9d7b26f
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Jun 2, 2022
* Let the index start from 0 for split_part

* Correct the expected error message in unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants