-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] pandas migration #1347
[WIP] pandas migration #1347
Conversation
Orange/data/table.py
Outdated
@property | ||
def columns_X(self): | ||
"""A read-only list of X column variables.""" | ||
return [c for c in self.columns if c in self._columns_X] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is ok. If I only need "X", i should get X as a subset dataframe and then query its columns. Like:
only_data = table.columns_subset(Role.DATA)
columns_X = only_data.columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, this is ok. I, too, would rather have:
columns_X = data.columns_X
subset_X = data[columns_X]
than
subset_X = data.subset_X
columns_X = subset_X.columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or not, I don't know.
What are some pros and cons for either approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll use a variant of the first approach. Getting the columns through Domain (@astaric and I had a talk today about Domain, I'll fill you in tomorrow), then selecting the data through those columns.
Fix domain editor in the file widget when non-string discrete values. Also file reader hardening. Xref #1471.
Hi, I am a university student from China. My major is computer science. I want to participate Gsoc 2017. I keen on machine learning and python. Could you tell me something about Orange program? I want to contribute this program. |
I have been having a look at Orange3 lately and find this approach (of using pandas dataframes internally) really interesting. However, I see that there has been little activity lately. Is this something that you still want to merge or the development has dead due to any major incompatibility? Are there any specific things on which I could help to accomplish this pull request? |
I've continued development on a local, not yet pushed, branch. It's stalled for the moment, but it was progressing quite nicely, and I expect to have something to show quite soon! |
Oh! That's great! I am looking forward to having a look at it =) Thanks for such a quick response! I will subscribe to the pull req! |
As this PR touches almost every file in the repository and has not been updated for more than a year, it is highly unlikely that it will be merged. We are still interested in porting the Table to pandas, but it will have to be done in a more gradual way. If anyone is interested in working on this, I would suggest the following steps:
Each "pandas compatible" method should be added in a separate pull request, as that eases reviewing and raises the chances of PR actually being merged. I would also split 1. and 2. into two PRs as 1 should only touch a single file and rebasing it should be much easier than the modifications of all places the code is used in 2. |
I would also love to see pandas integration! I think @astaric's roadmap makes sense, but I suspect that @kernc and @sstanovnik have most experience with trying to integrate pandas and orange. I would love to contribute if there is a feasible path to integration with official support from the biolab team! |
Updates? |
It's still on a would-be-really-nice-to-have list, but nobody is actively working on it. |
This is the giant
(well, WIP)pull request for converting the existing architecture topandas
.See this wiki page and this Google Sheet for comments on more general architectural decisions.
So far, the majority of the additional new API is done, but few functions actually work and none have been tested. This will break until I start writing tests and modifying existing code. This currently still does not pass tests, but the Table by itself is completely functional, as well as some other learners. Now in progress: making sure everything works.Over 95 % of all base widgets now work!
Most unit tests pass, the ones that don't are tied to sparse matrices and SQL support. Up next: (re)adding sparse Table and SQL Table support.All unit tests now pass(not including widget tests)!Multi-type sparse support depends on pandas 0.19.0 (unreleased). That release will fix a lot of sparse issues (some were even fixed by me =D).
Todos and progress, along with code examples, is monitored in the Gsheet.
To check-out and properly use this right now, you have to build
pandas
from their git master so you have the necessary sparse changes. If you don't, some sparse features may not correctly.Linked pull requests: biolab/orange3-text#97