Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] pandas migration #1347

Closed
wants to merge 140 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
bb960ae
Extend str in Variable.
sstanovnik Jun 10, 2016
5a8846c
pandas migration: first huge, breaking, table update.
sstanovnik Jun 17, 2016
97fab90
Enable strict read-only access on Table X/Y/meta views.
sstanovnik Jun 17, 2016
3e7a853
Further changes to Table, as per recent comments.
sstanovnik Jun 22, 2016
81ddf10
Add Table.attributes to the pandas persistence scheme.
sstanovnik Jun 23, 2016
89907af
Insert pandas into requirements-core.
sstanovnik Jun 23, 2016
b732ed7
Table constructors and other fixes.
sstanovnik Jun 23, 2016
7af5a67
OWSelectRows: transform usage of Filter into pandas syntax.
sstanovnik Jun 23, 2016
85f6920
Completely remove Filter.
sstanovnik Jun 23, 2016
b8ee3b5
Table Domain changes, Variable inference and miscellaneous fixes.
sstanovnik Jun 25, 2016
7d3e29b
Make indexing and weights work from empty Tables.
sstanovnik Jun 28, 2016
ebb5a1f
Remove RowInstance completely.
sstanovnik Jun 28, 2016
5853ec5
Remove and transform infrequently-used old syntax.
sstanovnik Jun 28, 2016
77bf036
Completely remove Instance.
sstanovnik Jun 28, 2016
43f1a7f
Remove Storage.
sstanovnik Jun 28, 2016
d7a159d
Some minor fixes and cleanup in Table.
sstanovnik Jun 28, 2016
8623abc
A multitude of small fixes of bugs shown by tests.
sstanovnik Jun 29, 2016
3f7369b
Use pandas' reader for csv and tab.
sstanovnik Jun 29, 2016
827d4e1
Tab reader fixes for less common behaviour.
sstanovnik Jun 30, 2016
36d5b94
Port ExcelReader to use pandas' Excel reader.
sstanovnik Jun 30, 2016
8535478
Improve DiscreteVariable discreteness determination and parsing.
sstanovnik Jun 30, 2016
ba08eaf
Transform values we interpret as null with actual null values when
sstanovnik Jun 30, 2016
ce5cf4f
Handle NA weights when setting them.
sstanovnik Jun 30, 2016
f3a064b
Use pandas' categorical coltype for DiscreteVariable.
sstanovnik Jul 1, 2016
c170ade
Convert TimeVariable functionality to pandas.
sstanovnik Jul 1, 2016
745ca03
Small fixes: sniffer size, NA weights handling.
sstanovnik Jul 1, 2016
4a656b1
Variable equality fix and TimeVariable test modification.
sstanovnik Jul 1, 2016
ac914f0
A lot of small fixes for issues found by tests.
sstanovnik Jul 4, 2016
eca5618
Fix handling null values in TimeVariable columns.
sstanovnik Jul 5, 2016
fae01ac
Improve reading tab and csv files.
sstanovnik Jul 5, 2016
601d82d
Remove an unneeded test.
sstanovnik Jul 5, 2016
99ecf47
Compatibility shims for SQL table.
sstanovnik Jul 5, 2016
135eb47
Make Data Table work, transfer basic stats to pandas.
sstanovnik Jul 5, 2016
78f0db1
Multiple fixes: TableSeries retain attributes, constructor works
sstanovnik Jul 11, 2016
699b240
Remove Value.
sstanovnik Jul 11, 2016
360d0df
Fix recent TimeVariable changes.
sstanovnik Jul 11, 2016
38b648c
Some basic fixes for subscripting pandas.
sstanovnik Jul 11, 2016
9a4ef13
Migrate distributions to pandas.
sstanovnik Jul 11, 2016
2711440
Ported contingency to pandas.
sstanovnik Jul 12, 2016
e6f0cd9
Remove statistics/util.py.
sstanovnik Jul 12, 2016
bef5741
Adapt distances and tests to work with the new Table.
sstanovnik Jul 13, 2016
81bad83
Adapt preprocessors (discretize, impute) to work with the new Table.
sstanovnik Jul 13, 2016
5020b89
Loads of test and compatibility fixes.
sstanovnik Jul 13, 2016
90c43ec
k-Means compatibility fixes.
sstanovnik Jul 13, 2016
2dab765
Transform Continuize for usage with pandas.
sstanovnik Jul 13, 2016
54677f0
Fix parsing files with discrete variables which specify values.
sstanovnik Jul 13, 2016
9146bfa
Discretization pandas compatibility, also test fixes.
sstanovnik Jul 14, 2016
494caf0
Evaluation - scoring test compatibility fix.
sstanovnik Jul 14, 2016
f43eb2c
Intepret missing value markers when reading from file.
sstanovnik Jul 14, 2016
d2297f8
Impute pandas adaptation, with tests.
sstanovnik Jul 14, 2016
fc5b426
Fix transforming discrete ordinal values into descriptor values.
sstanovnik Jul 14, 2016
7dde951
Distributions should use weights instead of counts.
sstanovnik Jul 14, 2016
238a838
Miscellaneous test adaptations.
sstanovnik Jul 14, 2016
9d5093d
Convert normalization, use groupby instead of value_count in distribu…
sstanovnik Jul 14, 2016
72e2eb0
Add copying to table constructors, very important!
sstanovnik Jul 14, 2016
dc8027d
Migrate randomization to pandas.
sstanovnik Jul 15, 2016
7317045
A small fix for caching table transformations.
sstanovnik Jul 15, 2016
8c10f2a
Miscellaneous test compatibility fixes.
sstanovnik Jul 15, 2016
9d3cf40
Migrate remover and its tests.
sstanovnik Jul 15, 2016
d019e8d
Simple tree and softmax adaptation.
sstanovnik Jul 15, 2016
db2a2a9
Only allow one of specified delimiters when reading file.
sstanovnik Jul 15, 2016
2b53079
Remove Value tests.
sstanovnik Jul 15, 2016
55d233a
Miscellaneous table test compatibility fixes.
sstanovnik Jul 15, 2016
5cd878f
Use 0 instead of NA when values don't exist in distributions.
sstanovnik Jul 18, 2016
1fde4f2
Feature scoring test compatibility.
sstanovnik Jul 18, 2016
aa1d4c6
A bucketload of fixes for widgets.
sstanovnik Jul 18, 2016
3efd943
Fix owcontinuize to use proper continuization behaviour.
sstanovnik Jul 19, 2016
f5ee17d
Don't intepret None as a missing value when reading a table.
sstanovnik Jul 19, 2016
a44e1e4
Use proper top-level imports. D'oh!
sstanovnik Jul 19, 2016
14efa24
Use a more robust way of computing basic stats.
sstanovnik Jul 19, 2016
74bc375
A small fix for the new single-class test.
sstanovnik Jul 19, 2016
bc64dbc
Fixes for some elusive tests.
sstanovnik Jul 19, 2016
d21a3e8
Port SQLTable to a pandas backend. Some breaking changes.
sstanovnik Jul 21, 2016
23d42c4
Completely overhaul the Table class inheritance structure.
sstanovnik Jul 21, 2016
5c23db1
Fix some broken Table imports.
sstanovnik Jul 22, 2016
6004bed
Basic SparseTable functionality.
sstanovnik Jul 27, 2016
bc84af8
Distributions for sparse tables.
sstanovnik Jul 27, 2016
c190a5a
A snail-paced implementationof contingency computation for sparse
sstanovnik Jul 27, 2016
5864dd1
Improved the reading capabilities.
sstanovnik Jul 27, 2016
30b7f13
Fix elusive tests.
sstanovnik Jul 27, 2016
7e5b420
Use add numexpr to requirements-core.
sstanovnik Jul 27, 2016
9b94cb1
Use actual values instead of indices when constructing discretes.
sstanovnik Jul 28, 2016
c672319
Merge domain when not rowstacking concatenated tables.
sstanovnik Jul 28, 2016
645e50b
Widget test adaptation and widget fixes.
sstanovnik Jul 28, 2016
ed500db
REVIEWME: 'fixed' displaying SQL tables.
sstanovnik Jul 29, 2016
cd93a3f
Test fixes, remove sql.compat.Value
sstanovnik Jul 29, 2016
0d7ee88
Hopefully fix some strange failing tests.
sstanovnik Jul 29, 2016
5d2f8a7
Remove val_from_str_add.
sstanovnik Jul 29, 2016
430be00
Add to_var_col, a slightly optimized version of to_val.
sstanovnik Jul 29, 2016
a9f63c2
Remove TableBase.DENSE and related indicators.
sstanovnik Jul 29, 2016
e1f708d
Remove PanelBase and SparseTablePanel.
sstanovnik Jul 29, 2016
d17088e
Docstring bonanza!
sstanovnik Aug 1, 2016
265b590
Remove some old, unused, deprecated things from TableBase.
sstanovnik Aug 1, 2016
21bcc56
Remove variable.to_val_col.
sstanovnik Aug 1, 2016
6f0f16e
Documentation slightly updated.
sstanovnik Aug 1, 2016
0bd7160
Increase test coverage, some bugfixes.
sstanovnik Aug 2, 2016
9c2c6ba
Improve distributions, also coverage.
sstanovnik Aug 2, 2016
4f6993b
Increase contingency coverage.
sstanovnik Aug 2, 2016
caa733d
Fix OWHeatmap and its recent tests.
sstanovnik Aug 2, 2016
4843efc
Bump minimum version of pandas above 0.18.0.
sstanovnik Aug 9, 2016
b3031c9
Sparse fixes and improvements.
sstanovnik Aug 9, 2016
411d027
Excel sheet naming.
sstanovnik Aug 9, 2016
7ad4909
A multitude of fixes.
sstanovnik Aug 10, 2016
25c5c2b
A truckload of changes.
sstanovnik Aug 11, 2016
496f808
Remove Table.append.
sstanovnik Aug 11, 2016
4f50571
From list with missing class fixes, indent, lesser __setitem__ breakage.
sstanovnik Aug 11, 2016
96755ab
Remove the many missing-value replaces.
sstanovnik Aug 11, 2016
811e3e9
Prevent multiple calls to __init__.
sstanovnik Aug 11, 2016
8517bf2
Proper finalization and domain filtering.
sstanovnik Aug 12, 2016
41ae304
Custom __str__ and __repr__, needs some work.
sstanovnik Aug 12, 2016
33afcf0
REVIEWME: custom __iter__, iterates over rows, breaks pandas contract.
sstanovnik Aug 12, 2016
dfc1aba
Merge Data fix, new fun TableBase.merge method!
sstanovnik Aug 12, 2016
2e156fa
Fix venn diagram.
sstanovnik Aug 12, 2016
2add840
Fix Data Table.
sstanovnik Aug 12, 2016
ac6539d
Switch inputs from Table to TableBase.
sstanovnik Aug 12, 2016
98c6353
Much better __str__, uses pandas magic.
sstanovnik Aug 16, 2016
2fd14db
Except Orange instead of pandas behaviour in constructors.
sstanovnik Aug 16, 2016
78e9450
Use pure nnumpy ops for transforming discretes into categoricals.
sstanovnik Aug 16, 2016
287b049
Change usages of checksum to hash.
sstanovnik Aug 16, 2016
f2ef9ba
Add a notificatoin comment and test for the iterrows wrapper.
sstanovnik Aug 16, 2016
fc9d867
Remove shuffle in favour of .sample(frac=1).
sstanovnik Aug 16, 2016
b576a7e
Consolidate usages of the _transferer hack.
sstanovnik Aug 16, 2016
d0a03dd
Comments and tests to setUpClass, other test fixes.
sstanovnik Aug 16, 2016
118a8d1
Add time component awareness to TimeVariable.
sstanovnik Aug 17, 2016
a70e2c5
Fix a failing doctest.
sstanovnik Aug 17, 2016
8dbc1c7
Improve time column display with month and day.
sstanovnik Aug 19, 2016
b22e0d8
Add a pandas git build to travis.
sstanovnik Aug 19, 2016
98b745f
Some general fixes, report test fixes.
sstanovnik Aug 19, 2016
50e0989
Requirements.txt requires a different requirement format.
sstanovnik Aug 19, 2016
b561d2f
Further improvements to the documentation.
sstanovnik Aug 19, 2016
27c596f
Revert 68b18c5: overriding __iter__.
sstanovnik Aug 19, 2016
6ea711a
Simplify weight assignment.
sstanovnik Aug 19, 2016
06bcc77
Cherry-pick: sstanovnik/orange3:benches.
sstanovnik Aug 19, 2016
2bfca35
Weight setting robustness.
sstanovnik Aug 21, 2016
6460ab0
Properer sparse handling.
sstanovnik Aug 21, 2016
259bfc1
Always convert weights to floats on assignment.
sstanovnik Aug 26, 2016
ac75022
Fix visualizing continuous variables in Data Table.
sstanovnik Aug 26, 2016
3317ff0
Significantly improve feature constructor performance.
sstanovnik Aug 26, 2016
fc48858
Domain editor fix and file reader hardening.
sstanovnik Aug 26, 2016
3e6030f
Fix a failing owkmeans test.
sstanovnik Aug 26, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Further improvements to the documentation.
  • Loading branch information
sstanovnik committed Aug 26, 2016
commit b561d2fd4e68b9480cbc1dc500cd2269c47e8d76
2 changes: 1 addition & 1 deletion doc/data-mining-library/source/reference/data.domain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ Domain conversion
In a typical scenario, we may want to discretize some continuous data before
inducing a model. Discretizers (:mod:`Orange.preprocess`)
construct a new data table with attribute descriptors
(:class:`Orange.data.variable`), that include the corresponding functions
(:class:`Orange.data.Variable`), that include the corresponding functions
for conversion from continuous to discrete values. The trained model stores
this domain descriptor and uses it to convert instances from the original
domain to the discretized one at prediction phase.
Expand Down
25 changes: 11 additions & 14 deletions doc/data-mining-library/source/reference/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,23 @@ variable's name, symbolic values, number of decimals in printouts and similar.

The data is divided into attributes (features, independent variables), class
variables (classes, targets, outcomes, dependent variables) and meta
attributes. This division applies to domain descriptions, data storages that
contain separate arrays for each of the three parts of the data and data
instances.
attributes. This division applies to domain descriptions, which logically separate
a :obj:`Orange.data.Table` into three parts, corresponding to the roles.

Attributes and classes are represented with numeric values and are used in
modelling. Meta attributes contain additional data which may be of any type.
(Currently, only string values are supported in addition to continuous and
numeric.)

In indexing, columns can be referred to by their names,
descriptors or an integer index. For example, if `inst` is a data instance
and `var` is a descriptor of type :obj:`~Orange.data.Continuous`, referring to
the first column in the data, which is also names "petal length", then
`inst[var]`, `inst[0]` and `inst["petal length"]` refer to the first value
of the instance. Negative indices are used for meta attributes, starting with
-1.

Continuous and discrete values can be represented by any numerical type; by
default, Orange uses double precision (64-bit) floats. Discrete values are
represented by whole numbers.
Indexing is inherited from :obj:`pandas`. This means using `.loc` and `.iloc`
to access rows and slice the table, the same as :obj:`pandas` does. The only
difference is that Orange uses globally unique indexing, where different instances
of :obj:`Orange.data.Table` have different :obj:`pandas` indices, used with `.loc`.
Columns are accessed either with domain variables or their names.

Orange stores data directly in the :obj:`Orange.data.Table` exactly like `pandas` does,
and the `.X`, `.Y` and `.metas` descriptors convert the raw data into a float format,
suitable for learning. Columns with :obj:`Orange.data.StringVariable` remain as strings.

.. toctree::
:maxdepth: 2
Expand Down
78 changes: 33 additions & 45 deletions doc/data-mining-library/source/reference/data.table.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,63 +7,34 @@ Data Table (``table``)
.. autoclass:: Orange.data.Table
:members: columns

Stores data instances as a set of 2d tables representing the independent
Stores data instances in a dense :obj:`pandas.DataFrame` representing the independent
variables (attributes, features) and dependent variables
(classes, targets), and the corresponding weights and meta attributes.

The data is stored in 2d numpy arrays :obj:`X`, :obj:`Y`, :obj:`W`,
:obj:`metas`. The arrays may be dense or sparse. All arrays have the same
2D numpy arrays :obj:`X`, :obj:`Y`, :obj:`W`, :obj:`metas` can be generated.
The arrays may be dense or sparse. All arrays have the same
number of rows. If certain data is missing, the corresponding array has
zero columns.

Arrays can be of any type; default is `float` (that is, double precision).
Values of discrete variables are stored as whole numbers.
Arrays for meta attributes usually contain instances of `object`.

The table also stores the associated information about the variables
as an instance of :obj:`Domain`. The number of columns must match the
corresponding number of variables in the description.

There are multiple ways to get values or entire rows of the table.
Indexing the table works the same as in `pandas`. In a nutshell

- The index can be an int, e.g. `table[7]`; the corresponding row is
returned as an instance of :obj:`RowInstance`.
- Use `table.iloc[i]` for position-based indexing.
- Use `table.loc[i]` for index-based indexing.
- Both accept tuples of `(row_index, column_name)` or slices where appropriate.
- Use table[colname] to get columns.

- The index can be a slice or a sequence of ints (e.g. `table[7:10]` or
`table[[7, 42, 15]]`, indexing returns a new data table with the
selected rows.
One-domensional alternatives are called `Series`.

- If there are two indices, where the first is an int (a row number) and
the second can be interpreted as columns, e.g. `table[3, 5]` or
`table[3, 'gender']` or `table[3, y]` (where `y` is an instance of
:obj:`~Orange.data.Variable`), a single value is returned as an instance
of :obj:`~Orange.data.Value`.

- In all other cases, the first index should be a row index, a slice or
a sequence, and the second index, which represent a set of columns,
should be an int, a slice, a sequence or a numpy array. The result is
a new table with a new domain.

Rules for setting the data are as follows.

- If there is a single index (an `int`, `slice`, or a sequence of row
indices) and the value being set is a single scalar, all
attributes (not including the classes) are set to that value. That
is, `table[r] = v` is equivalent to `table.X[r] = v`.

- If there is a single index and the value is a data instance
(:obj:`Orange.data.Instance`), it is converted into the table's domain
and set to the corresponding rows.

- Final option for a single index is that the value is a sequence whose
length equals the number of attributes and target variables. The
corresponding rows are set; meta attributes are set to unknowns.

- For two indices, the row can again be given as a single `int`, a
`slice` or a sequence of indices. Column indices can be a single
`int`, `str` or :obj:`Orange.data.Variable`, a sequence of them,
a `slice` or any iterable. The value can be a single value, or a
sequence of appropriate length.
Setting data works the same as in `pandas`. The only thing you need to be careful
of is that chaining indexers, as in `table.iloc[0].iloc[0]` won't work and
shouldn't be used, use `table.iloc[0, 0]` instead.

.. attribute:: domain

Expand All @@ -79,23 +50,40 @@ The preferred way to construct a table is to invoke a named constructor.

.. automethod:: Table.from_domain
.. automethod:: Table.from_table
.. automethod:: Table.from_dataframe
.. automethod:: Table.from_numpy
.. automethod:: Table.from_list
.. automethod:: Table.from_file
.. automethod:: Table.from_url

Getting Data
------------
.. automethod:: Table.X
.. automethod:: Table.Y
.. automethod:: Table.metas
.. automethod:: Table.weights

Inspection
----------

.. automethod:: Table.has_weights
.. automethod:: Table.approx_len
.. automethod:: Table.exact_len
.. automethod:: Table.has_missing
.. automethod:: Table.has_missing_class
.. automethod:: Table.density
.. automethod:: Table.is_dense
.. automethod:: Table.is_sparse

Row manipulation
----------------
Manipulation
------------

.. automethod:: Table.clear
.. automethod:: Table.concatenate
.. automethod:: Table.merge

Weights
-------

.. automethod:: Table.weights
.. automethod:: Table.set_weights

Aggregators
Expand Down