-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pandas] Data Table crash (test needed) #1518
Milestone
Comments
Blocks #1347. |
kernc
pushed a commit
to kernc/orange3
that referenced
this issue
Nov 11, 2016
Note: Contains commit messages of the whole initial branch squashed, including those that belong into some of the commits following this one. Extend str in Variable. This is needed so a Variable plays nicely with pandas' columns. pandas migration: first huge, breaking, table update. Changes to other parts of the codebase will follow. This exposes the new API and is meant for review purposes. Nothing has been tested so far, no compatibility changes have been applied. Discussion in the issue and linked pages. Enable strict read-only access on Table X/Y/meta views. Further changes to Table, as per recent comments. Add Table.attributes to the pandas persistence scheme. Insert pandas into requirements-core. Table constructors and other fixes. Constructors should now work, but no tests have been written or modified yet. This also deletes filter helpers from Table. Also, various miscellaneous fixes found through debugging. Major missing functionality: domains and reading/saving data from files. OWSelectRows: transform usage of Filter into pandas syntax. There is now a new (internal) Filter class that joins display and filtering functionality, instead of relying on indexing magic and the archaic Filters. Includes tests for the new Filter class, but the widget itself has not been tested. Completely remove Filter. Remove filter and transform the few usages into pandas syntax. SQL filter is left over and dummied out for when we'll tackle the whole SQL debacle. Table Domain changes, Variable inference and miscellaneous fixes. Made Table.domain the authority on table roles. Raw data constructors now work. Moved Variable type and role inference to Table and improved it to handle the variable-column separation we now have. Table.append now has the same contract as pandas. Make indexing and weights work from empty Tables. Empty tables need to have a new index set when they don't have one, and a set of weights if the table was empty (otherwise they'd be NA). We also always select the weights when subsetting more than one column, so they're preserved. We don't transfer weights when explicitly selecting only one column (with t["colname"]) because that would break the contract that pandas returns a Series in that case. This does not apply for selecting one column in multiple column selection mode (with t[["colname]]), because that returns a dataframe and the user reasonably expects to have the weights transferred (because the return type is expected to be a DataFrame, not a series). Remove RowInstance completely. Remove and transform infrequently-used old syntax. This includes, but is not limited to: RowInstance, Table.columns, Table.x|y|metas assignments, Table.extend|append|shuffle|ensure_copy, Table weights operations (old set_weights syntax, has_weights, total_weight), Domain.from_numpy (this is in Table._infer_from now). Completely remove Instance. Instance is just a single-row slice of a table. Some usages will need to be adapted (in the base layer) and maybe some things in TableSeries. Remove Storage. The dense/sparse constants are now in Table. Some minor fixes and cleanup in Table. A multitude of small fixes of bugs shown by tests. The tests have also been changed where appropriate to reflect the new API and fix old hacks that are not relevant any more. Use pandas' reader for csv and tab. Pandas' reader is more performant and translates nicely into the pandas Table. Variable role, type and name inference was moved to Domain so both Table and IO can use the same code. Excel reading is broken in this commit. Parsing iris works. Tab reader fixes for less common behaviour. Port ExcelReader to use pandas' Excel reader. Pandas' reader is suboptimal and doesn't do the magic we would assume it does, so this code is slightly uglier than expected. Improve DiscreteVariable discreteness determination and parsing. Transform values we interpret as null with actual null values when creating Tables. Handle NA weights when setting them. Use pandas' categorical coltype for DiscreteVariable. This allows better col.describe output and improves semantic separation. Convert TimeVariable functionality to pandas. Now using pandas' datetime columns. Variables keep track of timezones for display purposes only, storage is completely pandas. This adds pytz and dateutil as dependencies so we can properly determine and process timezones. Formats are mostly the same as before, but ambiguous ones (the ones that pandas doesn't parse) were removed. Small fixes: sniffer size, NA weights handling. Variable equality fix and TimeVariable test modification. A lot of small fixes for issues found by tests. Important notes: from_table_rows now preserves domain, from_numpy sets weights in all cases, appending and concatenating now work properly. Fix handling null values in TimeVariable columns. Improve reading tab and csv files. This is now more robust for small datasets and works in edge cases with three-line headers and an empty third line. Some small issues still remain. Remove an unneeded test. Compatibility shims for SQL table. Make Data Table work, transfer basic stats to pandas. Basic stats are no longer based on bottleneck (explicitly). Much hooray! Multiple fixes: TableSeries retain attributes, constructor works properly, discrete value transformations. There is a proxy to Table._constructor_sliced that handles setting attributes in a new Series; but this is ugly. Constructors now forward things to pandas properly and from_table works with TableSeries. Discrete values given with their values (the .values indices) can now be transformed in from_numpy into their correct symbolic value. AdaBoost tests now work. Remove Value. Fix recent TimeVariable changes. We can't parse string timestamps as well as some other all-integer timestring formats. Removed string timestamp from supported formats. Some basic fixes for subscripting pandas. Migrate distributions to pandas. This removes the cython valuecount implementation. Ported contingency to pandas. This removes the cython code and GREATLY simplifies computation and readability. Remove statistics/util.py. This isn't needed: all uses in contingency, basic stats and distributions are now handled by pandas. Adapt distances and tests to work with the new Table. Sparse support pending. Adapt preprocessors (discretize, impute) to work with the new Table. In particular, the table now contains direct discretized labels (which get converted with .X). Loads of test and compatibility fixes. - Domain can now convert TableSeries (was RowInstance) - Table methods now support being used in TableSeries (in the future, a different inheritance structure will be used, so this is okay) - creating a table from a list now allows missing class columns (they get set to NA) - Table.iterrows generates TableSeries instead of Series - when no columns are specfied for contingency, all attributes (instead of all variables) are used - classification test modifications for compatibility - miscellaneous fixes k-Means compatibility fixes. Transform Continuize for usage with pandas. Of note is the new Ordinalize transformation, which returns the numeric value (as it appears in e.g. table.X). Modifications have been made to account for the fact that we now store labels directly--instead of indices. Fix parsing files with discrete variables which specify values. Correctly parse the values into numbers (if we can do so for every one), so comparisons work in all cases. This avoids the problem of having NA values in the table because '1' is not in the categorical values, but 1 is. Discretization pandas compatibility, also test fixes. Evaluation - scoring test compatibility fix. Intepret missing value markers when reading from file. Effect: when a column is numneric and has a missing value marked with '?', this commit makes reading values output numbers. Previously, the whole column would be of strings because it would have to contain the '?' string. Impute pandas adaptation, with tests. Fix transforming discrete ordinal values into descriptor values. Also includes a fix for knn tests. Distributions should use weights instead of counts. Miscellaneous test adaptations. Convert normalization, use groupby instead of value_count in distributions. Add copying to table constructors, very important! Subtle and hard-to-debug errors may arise is data is not copied, because indices and data may become out-of-sync when modifying data in other tables. Migrate randomization to pandas. A small fix for caching table transformations. Miscellaneous test compatibility fixes. Migrate remover and its tests. Simple tree and softmax adaptation. Of note is the forced value type to float: without it, integer arrays are interpreted as double pointers and their values are 0, which breaks weights. Only allow one of specified delimiters when reading file. Example: TabReader would only allow tabs, even if a comma is sniffed. The fallback is the first specified delimiter. Remove Value tests. These weren't removed with Value because it only uses Values implicitly and the usages weren't linked. Miscellaneous table test compatibility fixes. Use 0 instead of NA when values don't exist in distributions. Feature scoring test compatibility. A bucketload of fixes for widgets. Over 90 % of base widgets now work. I haven't checked every single button, but they produce the intended result. Merge data, feature constructor and Venn diagram need a bit more work, so those aren't functional yet. Fix owcontinuize to use proper continuization behaviour. Don't intepret None as a missing value when reading a table. The only drawback is that creating a table manually in the code, with None, doesn't work. Use "?" instead. Use proper top-level imports. D'oh! Use a more robust way of computing basic stats. Discrete variables don't have some stats, so compute them separately. A small fix for the new single-class test. Fixes for some elusive tests. Port SQLTable to a pandas backend. Some breaking changes. A major breaking change is that __len__ no longer returns the backing SQL table length, but instead gives the length of the downloaded data. Use SqlTable.exact_len for previous behaviour. Ths is needed bacause pandas requires __len__ to be the actual table length, otherwise nothhing works. A new package has been introduced: Orange.data.sql.compat. It contains Filter, Value and Instance because porting those would require a radical change, at which point we could just use Spark as well. All unit tests pass. Completely overhaul the Table class inheritance structure. Now with specialized classes for every use case. Most of the code is still in the base class, as there is no need for specialization. What is required, though, is creating a separate SparseTable class which extends from pd.SparseDataFrame (instead of pd.DataFrame) and does not extend Table to avoid inheritance problems. No actual explicit sparse support yet, coming soon. Fix some broken Table imports. Basic SparseTable functionality. Contingency and distributions don't work yet. Distributions for sparse tables. A snail-paced implementationof contingency computation for sparse tables. Will try to make it faster. Improved the reading capabilities. Fix elusive tests. Use add numexpr to requirements-core. Thi is used by pandas for speedups. Use actual values instead of indices when constructing discretes. Merge domain when not rowstacking concatenated tables. Widget test adaptation and widget fixes. REVIEWME: 'fixed' displaying SQL tables. Test fixes, remove sql.compat.Value Hopefully fix some strange failing tests. Regarding changing sys.modules when iterating through it in assertWarns. Remove val_from_str_add. Add to_var_col, a slightly optimized version of to_val. Remove TableBase.DENSE and related indicators. Remove PanelBase and SparseTablePanel. SparsePanel will be removed from pandas in 0.19.0. Docstring bonanza! Orange.data only, though, and without sql. Remove some old, unused, deprecated things from TableBase. Remove variable.to_val_col. It was a weird idea anyway. Merged functionality into to_val. Documentation slightly updated. This was a quick pass for broad-stroke removals and changes. Scripts generally work, but the textual documentation is surely outdated. Much more work needed. Increase test coverage, some bugfixes. Improve distributions, also coverage. Increase contingency coverage. Fix OWHeatmap and its recent tests. Bump minimum version of pandas above 0.18.0. This is needed for sparse multi-type support. Sparse fixes and improvements. Depends on pandas 0.19.0 because of sparse multitype. Excel sheet naming. A multitude of fixes. A truckload of changes. Remove Table.append. From list with missing class fixes, indent, lesser __setitem__ breakage. Remove the many missing-value replaces. Prevent multiple calls to __init__. Proper finalization and domain filtering. Custom __str__ and __repr__, needs some work. REVIEWME: custom __iter__, iterates over rows, breaks pandas contract. Only pandas' __str__ and __repr__ used iteration over rows, our tests pass either way. This is a very high-level decision which needs the concensus of the whole group. Merge Data fix, new fun TableBase.merge method! Fix venn diagram. Fix Data Table. Switch inputs from Table to TableBase. Much better __str__, uses pandas magic. Except Orange instead of pandas behaviour in constructors. Use pure nnumpy ops for transforming discretes into categoricals. Change usages of checksum to hash. Add a notificatoin comment and test for the iterrows wrapper. Remove shuffle in favour of .sample(frac=1). Consolidate usages of the _transferer hack. Comments and tests to setUpClass, other test fixes. Add time component awareness to TimeVariable. Fix a failing doctest. Improve time column display with month and day. Add a pandas git build to travis. Some general fixes, report test fixes. Requirements.txt requires a different requirement format. Further improvements to the documentation. Revert 68b18c5: overriding __iter__. Simplify weight assignment. Cherry-pick: sstanovnik/orange3:benches. Weight setting robustness. Properer sparse handling. Always convert weights to floats on assignment. Fix visualizing continuous variables in Data Table. Closes biolab#1518. Significantly improve feature constructor performance. Mends biolab#1519. Domain editor fix and file reader hardening. Fix domain editor in the file widget when non-string discrete values. Also file reader hardening. Xref biolab#1471. Fix a failing owkmeans test.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Select visualize continuous values in Data Table. It crashes.
The text was updated successfully, but these errors were encountered: