Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pandas] Data Table crash (test needed) #1518

Closed
astaric opened this issue Aug 24, 2016 · 1 comment
Closed

[pandas] Data Table crash (test needed) #1518

astaric opened this issue Aug 24, 2016 · 1 comment
Milestone

Comments

@astaric
Copy link
Member

astaric commented Aug 24, 2016

Select visualize continuous values in Data Table. It crashes.

--------------------------------------------------------------------------------
AttributeError                                Traceback (most recent call last):
  File "/Users/anze/dev/orange3/Orange/widgets/utils/itemmodels.py", line 1040, in data
    return instance.get_class()
  File "/Users/anze/miniconda3/envs/o3/lib/python3.5/site-packages/pandas/core/generic.py", line 2743, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'TableSeries' object has no attribute 'get_class'
--------------------------------------------------------------------------------
AttributeError                                Traceback (most recent call last):
  File "/Users/anze/dev/orange3/Orange/widgets/gui.py", line 2779, in paint
    if class_.variable.is_discrete and \
AttributeError: 'NoneType' object has no attribute 'variable'
--------------------------------------------------------------------------------
@astaric astaric added this to the pandas milestone Aug 24, 2016
@kernc
Copy link
Contributor

kernc commented Aug 24, 2016

Blocks #1347.

kernc pushed a commit to kernc/orange3 that referenced this issue Nov 11, 2016
Note: Contains commit messages of the whole initial branch squashed,
including those that belong into some of the commits following this one.

Extend str in Variable.

This is needed so a Variable plays nicely with pandas' columns.

pandas migration: first huge, breaking, table update.

Changes to other parts of the codebase will follow. This exposes the new
API and is meant for review purposes. Nothing has been tested so far, no
compatibility changes have been applied. Discussion in the issue and
linked pages.

Enable strict read-only access on Table X/Y/meta views.

Further changes to Table, as per recent comments.

Add Table.attributes to the pandas persistence scheme.

Insert pandas into requirements-core.

Table constructors and other fixes.

Constructors should now work, but no tests have been written or modified
yet. This also deletes filter helpers from Table. Also, various
miscellaneous fixes found through debugging. Major missing
functionality: domains and reading/saving data from files.

OWSelectRows: transform usage of Filter into pandas syntax.

There is now a new (internal) Filter class that joins display and
filtering functionality, instead of relying on indexing magic and the
archaic Filters. Includes tests for the new Filter class, but the widget
itself has not been tested.

Completely remove Filter.

Remove filter and transform the few usages into pandas syntax. SQL
filter is left over and dummied out for when we'll tackle the whole SQL
debacle.

Table Domain changes, Variable inference and miscellaneous fixes.

Made Table.domain the authority on table roles. Raw data constructors
now work. Moved Variable type and role inference to Table and improved
it to handle the variable-column separation we now have. Table.append
now has the same contract as pandas.

Make indexing and weights work from empty Tables.

Empty tables need to have a new index set when they don't have one, and
a set of weights if the table was empty (otherwise they'd be NA). We
also always select the weights when subsetting more than one column, so they're
preserved.

We don't transfer weights when explicitly selecting only one column
(with t["colname"]) because that would break the contract that pandas
returns a Series in that case. This does not apply for selecting one
column in multiple column selection mode (with t[["colname]]), because
that returns a dataframe and the user reasonably expects to have the
weights transferred (because the return type is expected to be a
DataFrame, not a series).

Remove RowInstance completely.

Remove and transform infrequently-used old syntax.

This includes, but is not limited to: RowInstance, Table.columns,
Table.x|y|metas assignments, Table.extend|append|shuffle|ensure_copy,
Table weights operations (old set_weights syntax, has_weights,
total_weight), Domain.from_numpy (this is in Table._infer_from now).

Completely remove Instance.

Instance is just a single-row slice of a table. Some usages will need to
be adapted (in the base layer) and maybe some things in TableSeries.

Remove Storage.

The dense/sparse constants are now in Table.

Some minor fixes and cleanup in Table.

A multitude of small fixes of bugs shown by tests.

The tests have also been changed where appropriate to reflect the new
API and fix old hacks that are not relevant any more.

Use pandas' reader for csv and tab.

Pandas' reader is more performant and translates nicely into the pandas
Table. Variable role, type and name inference was moved to Domain so
both Table and IO can use the same code.

Excel reading is broken in this commit. Parsing iris works.

Tab reader fixes for less common behaviour.

Port ExcelReader to use pandas' Excel reader.

Pandas' reader is suboptimal and doesn't do the magic we would assume it
does, so this code is slightly uglier than expected.

Improve DiscreteVariable discreteness determination and parsing.

Transform values we interpret as null with actual null values when
creating Tables.

Handle NA weights when setting them.

Use pandas' categorical coltype for DiscreteVariable.

This allows better col.describe output and improves semantic separation.

Convert TimeVariable functionality to pandas.

Now using pandas' datetime columns. Variables keep track of timezones
for display purposes only, storage is completely pandas. This adds pytz
and dateutil as dependencies so we can properly determine and process
timezones. Formats are mostly the same as before, but ambiguous ones
(the ones that pandas doesn't parse) were removed.

Small fixes: sniffer size, NA weights handling.

Variable equality fix and TimeVariable test modification.

A lot of small fixes for issues found by tests.

Important notes: from_table_rows now preserves domain, from_numpy sets
weights in all cases, appending and concatenating now work properly.

Fix handling null values in TimeVariable columns.

Improve reading tab and csv files.

This is now more robust for small datasets and works in edge cases with
three-line headers and an empty third line. Some small issues still
remain.

Remove an unneeded test.

Compatibility shims for SQL table.

Make Data Table work, transfer basic stats to pandas.

Basic stats are no longer based on bottleneck (explicitly). Much hooray!

Multiple fixes: TableSeries retain attributes, constructor works
properly, discrete value transformations.

There is a proxy to Table._constructor_sliced that handles setting
attributes in a new Series; but this is ugly. Constructors now forward
things to pandas properly and from_table works with TableSeries.
Discrete values given with their values (the .values indices) can now be
transformed in from_numpy into their correct symbolic value.

AdaBoost tests now work.

Remove Value.

Fix recent TimeVariable changes.

We can't parse string timestamps as well as some other all-integer
timestring formats. Removed string timestamp from supported formats.

Some basic fixes for subscripting pandas.

Migrate distributions to pandas.

This removes the cython valuecount implementation.

Ported contingency to pandas.

This removes the cython code and GREATLY simplifies computation and
readability.

Remove statistics/util.py.

This isn't needed: all uses in contingency, basic stats and
distributions are now handled by pandas.

Adapt distances and tests to work with the new Table.

Sparse support pending.

Adapt preprocessors (discretize, impute) to work with the new Table.

In particular, the table now contains direct discretized labels (which
get converted with .X).

Loads of test and compatibility fixes.

 - Domain can now convert TableSeries (was RowInstance)
 - Table methods now support being used in TableSeries
   (in the future, a different inheritance structure will be used, so this is okay)
 - creating a table from a list now allows missing class columns
   (they get set to NA)
 - Table.iterrows generates TableSeries instead of Series
 - when no columns are specfied for contingency, all attributes (instead
   of all variables) are used
 - classification test modifications for compatibility
 - miscellaneous fixes

k-Means compatibility fixes.

Transform Continuize for usage with pandas.

Of note is the new Ordinalize transformation, which returns the numeric
value (as it appears in e.g. table.X). Modifications have been made to
account for the fact that we now store labels directly--instead of
indices.

Fix parsing files with discrete variables which specify values.

Correctly parse the values into numbers (if we can do so for every one),
so comparisons work in all cases. This avoids the problem of having NA
values in the table because '1' is not in the categorical values, but 1
is.

Discretization pandas compatibility, also test fixes.

Evaluation - scoring test compatibility fix.

Intepret missing value markers when reading from file.

Effect: when a column is numneric and has a missing value marked with
'?', this commit makes reading values output numbers. Previously, the
whole column would be of strings because it would have to contain the
'?' string.

Impute pandas adaptation, with tests.

Fix transforming discrete ordinal values into descriptor values.

Also includes a fix for knn tests.

Distributions should use weights instead of counts.

Miscellaneous test adaptations.

Convert normalization, use groupby instead of value_count in distributions.

Add copying to table constructors, very important!

Subtle and hard-to-debug errors may arise is data is not copied, because
indices and data may become out-of-sync when modifying data in other
tables.

Migrate randomization to pandas.

A small fix for caching table transformations.

Miscellaneous test compatibility fixes.

Migrate remover and its tests.

Simple tree and softmax adaptation.

Of note is the forced value type to float: without it, integer arrays
are interpreted as double pointers and their values are 0, which breaks
weights.

Only allow one of specified delimiters when reading file.

Example: TabReader would only allow tabs, even if a comma is sniffed.
The fallback is the first specified delimiter.

Remove Value tests.

These weren't removed with Value because it only uses Values implicitly
and the usages weren't linked.

Miscellaneous table test compatibility fixes.

Use 0 instead of NA when values don't exist in distributions.

Feature scoring test compatibility.

A bucketload of fixes for widgets.

Over 90 % of base widgets now work. I haven't checked every single
button, but they produce the intended result. Merge data, feature
constructor and Venn diagram need a bit more work, so those aren't
functional yet.

Fix owcontinuize to use proper continuization behaviour.

Don't intepret None as a missing value when reading a table.

The only drawback is that creating a table manually in the code, with
None, doesn't work. Use "?" instead.

Use proper top-level imports. D'oh!

Use a more robust way of computing basic stats.

Discrete variables don't have some stats, so compute them separately.

A small fix for the new single-class test.

Fixes for some elusive tests.

Port SQLTable to a pandas backend. Some breaking changes.

A major breaking change is that __len__ no longer returns the backing
SQL table length, but instead gives the length of the downloaded data.
Use SqlTable.exact_len for previous behaviour. Ths is needed bacause
pandas requires __len__ to be the actual table length, otherwise
nothhing works.

A new package has been introduced: Orange.data.sql.compat. It contains
Filter, Value and Instance because porting those would require a radical
change, at which point we could just use Spark as well.

All unit tests pass.

Completely overhaul the Table class inheritance structure.

Now with specialized classes for every use case. Most of the code is
still in the base class, as there is no need for specialization. What is
required, though, is creating a separate SparseTable class which extends
from pd.SparseDataFrame (instead of pd.DataFrame) and does not extend
Table to avoid inheritance problems.

No actual explicit sparse support yet, coming soon.

Fix some broken Table imports.

Basic SparseTable functionality.

Contingency and distributions don't work yet.

Distributions for sparse tables.

A snail-paced implementationof contingency computation for sparse
tables.

Will try to make it faster.

Improved the reading capabilities.

Fix elusive tests.

Use add numexpr to requirements-core.

Thi is used by pandas for speedups.

Use actual values instead of indices when constructing discretes.

Merge domain when not rowstacking concatenated tables.

Widget test adaptation and widget fixes.

REVIEWME: 'fixed' displaying SQL tables.

Test fixes, remove sql.compat.Value

Hopefully fix some strange failing tests.

Regarding changing sys.modules when iterating through it in assertWarns.

Remove val_from_str_add.

Add to_var_col, a slightly optimized version of to_val.

Remove TableBase.DENSE and related indicators.

Remove PanelBase and SparseTablePanel.

SparsePanel will be removed from pandas in 0.19.0.

Docstring bonanza!

Orange.data only, though, and without sql.

Remove some old, unused, deprecated things from TableBase.

Remove variable.to_val_col.

It was a weird idea anyway. Merged functionality into to_val.

Documentation slightly updated.

This was a quick pass for broad-stroke removals and changes. Scripts
generally work, but the textual documentation is surely outdated. Much
more work needed.

Increase test coverage, some bugfixes.

Improve distributions, also coverage.

Increase contingency coverage.

Fix OWHeatmap and its recent tests.

Bump minimum version of pandas above 0.18.0.

This is needed for sparse multi-type support.

Sparse fixes and improvements.

Depends on pandas 0.19.0 because of sparse multitype.

Excel sheet naming.

A multitude of fixes.

A truckload of changes.

Remove Table.append.

From list with missing class fixes, indent, lesser __setitem__ breakage.

Remove the many missing-value replaces.

Prevent multiple calls to __init__.

Proper finalization and domain filtering.

Custom __str__ and __repr__, needs some work.

REVIEWME: custom __iter__, iterates over rows, breaks pandas contract.

Only pandas' __str__ and __repr__ used iteration over rows, our tests
pass either way. This is a very high-level decision which needs the
concensus of the whole group.

Merge Data fix, new fun TableBase.merge method!

Fix venn diagram.

Fix Data Table.

Switch inputs from Table to TableBase.

Much better __str__, uses pandas magic.

Except Orange instead of pandas behaviour in constructors.

Use pure nnumpy ops for transforming discretes into categoricals.

Change usages of checksum to hash.

Add a notificatoin comment and test for the iterrows wrapper.

Remove shuffle in favour of .sample(frac=1).

Consolidate usages of the _transferer hack.

Comments and tests to setUpClass, other test fixes.

Add time component awareness to TimeVariable.

Fix a failing doctest.

Improve time column display with month and day.

Add a pandas git build to travis.

Some general fixes, report test fixes.

Requirements.txt requires a different requirement format.

Further improvements to the documentation.

Revert 68b18c5: overriding __iter__.

Simplify weight assignment.

Cherry-pick: sstanovnik/orange3:benches.

Weight setting robustness.

Properer sparse handling.

Always convert weights to floats on assignment.

Fix visualizing continuous variables in Data Table.

Closes biolab#1518.

Significantly improve feature constructor performance.

Mends biolab#1519.

Domain editor fix and file reader hardening.

Fix domain editor in the file widget when non-string discrete values.
Also file reader hardening. Xref biolab#1471.

Fix a failing owkmeans test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants