Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] pandas migration #1347

Closed
wants to merge 140 commits into from
Closed

[WIP] pandas migration #1347

wants to merge 140 commits into from

Conversation

sstanovnik
Copy link
Contributor

@sstanovnik sstanovnik commented Jun 17, 2016

This is the giant (well, WIP) pull request for converting the existing architecture to pandas.
See this wiki page and this Google Sheet for comments on more general architectural decisions.

So far, the majority of the additional new API is done, but few functions actually work and none have been tested. This will break until I start writing tests and modifying existing code. This currently still does not pass tests, but the Table by itself is completely functional, as well as some other learners. Now in progress: making sure everything works.

Over 95 % of all base widgets now work! Most unit tests pass, the ones that don't are tied to sparse matrices and SQL support. Up next: (re)adding sparse Table and SQL Table support. All unit tests now pass (not including widget tests)!

Multi-type sparse support depends on pandas 0.19.0 (unreleased). That release will fix a lot of sparse issues (some were even fixed by me =D).

Todos and progress, along with code examples, is monitored in the Gsheet.

To check-out and properly use this right now, you have to build pandas from their git master so you have the necessary sparse changes. If you don't, some sparse features may not correctly.

Linked pull requests: biolab/orange3-text#97

@property
def columns_X(self):
"""A read-only list of X column variables."""
return [c for c in self.columns if c in self._columns_X]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is ok. If I only need "X", i should get X as a subset dataframe and then query its columns. Like:

only_data = table.columns_subset(Role.DATA)
columns_X = only_data.columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, this is ok. I, too, would rather have:

columns_X = data.columns_X
subset_X = data[columns_X]

than

subset_X = data.subset_X
columns_X = subset_X.columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or not, I don't know.

What are some pros and cons for either approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll use a variant of the first approach. Getting the columns through Domain (@astaric and I had a talk today about Domain, I'll fill you in tomorrow), then selecting the data through those columns.

@xiaoerlaigeid
Copy link

Hi, I am a university student from China. My major is computer science. I want to participate Gsoc 2017. I keen on machine learning and python. Could you tell me something about Orange program? I want to contribute this program.
Thank you!

@AlexS12
Copy link

AlexS12 commented May 19, 2017

I have been having a look at Orange3 lately and find this approach (of using pandas dataframes internally) really interesting. However, I see that there has been little activity lately. Is this something that you still want to merge or the development has dead due to any major incompatibility? Are there any specific things on which I could help to accomplish this pull request?

@kernc
Copy link
Contributor

kernc commented May 19, 2017

I've continued development on a local, not yet pushed, branch. It's stalled for the moment, but it was progressing quite nicely, and I expect to have something to show quite soon!

@AlexS12
Copy link

AlexS12 commented May 19, 2017

Oh! That's great! I am looking forward to having a look at it =) Thanks for such a quick response! I will subscribe to the pull req!

@astaric
Copy link
Member

astaric commented Dec 20, 2017

As this PR touches almost every file in the repository and has not been updated for more than a year, it is highly unlikely that it will be merged. We are still interested in porting the Table to pandas, but it will have to be done in a more gradual way.

If anyone is interested in working on this, I would suggest the following steps:

  1. Add new methods to Table object that match pandas interface
    (empty(), .iterrows(), ...)
  2. Modify the code to call the new methods while deprecating the old ones
    (code calls empty(), bool becomes deprecated, ...)
  3. Figure out if Table can be ported to pandas without modifying more than 10 files, if not, return to 1. :)

Each "pandas compatible" method should be added in a separate pull request, as that eases reviewing and raises the chances of PR actually being merged. I would also split 1. and 2. into two PRs as 1 should only touch a single file and rebasing it should be much easier than the modifications of all places the code is used in 2.

@astaric astaric closed this Dec 20, 2017
@vanatteveldt
Copy link

I would also love to see pandas integration!

I think @astaric's roadmap makes sense, but I suspect that @kernc and @sstanovnik have most experience with trying to integrate pandas and orange. I would love to contribute if there is a feasible path to integration with official support from the biolab team!

@ZoraizQ
Copy link

ZoraizQ commented Jul 15, 2020

Updates?

@janezd
Copy link
Contributor

janezd commented Jul 16, 2020

It's still on a would-be-really-nice-to-have list, but nobody is actively working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants