Skip to content

Pandas Migration

Ajda edited this page Aug 23, 2016 · 1 revision

Orange.data.Table -> pd.DataFrame migration

Part of Google Summer of Code 2016, referent: @sstanovnik.

This page aims to be the progress monitor and later API migration instructions for the pandas migration. See sstanovnik/orange3:koalas for the development branch. See the pull request for code comments.

Goals

  • Compatibility with existing API where sensible, deprecation notices where needed.
  • New Table is pd.DataFrame
  • SQL data source support (not strictly pandas)
  • Nice code docs.

Progress, TODOs, Functionality mapping table

See this Google Sheet for something with more flexibility than this wiki's tables. It is basically a development timeline, including both research and TODOs (with detailed notes). Useful for exploring decisions in-depth.

Scripting-facing changes

  • Orange now uses the same indexing as pandas. I'll leave the explanation to their documentation.
  • If you try to change values in table.X, table.Y or table.metas, you'll get an error. The new way of setting data is directly, the pandas way. See the above indexing link.
  • The table contains actual values directly, instead of e.g. integer indices of the discrete variable's values. This is much more intuitive, and the numeric data is transparently compputed when using X, Y or metas.
  • Read up on pandas' Caterogical here, which is used for discrete variables if you will be doing anything in-depth with them.
  • table.W has been replaced with table.weights
  • RowInstances have been removed. The new equivalent is TableSeries, which works similarly to Table for learning purposes.
  • Filtering rows is no longer done with Filters, but instead with pandas' filtering. Much better.
  • A Table does not have an implicit bool value, so if table does not work. Use specific operators like if table is not None, if not len(table) and so on.
  • Weights are now set through Table.set_weights, which accepts a scalar, a sequence, or an existing column name.
  • Datetime operations on TimeVariable columns are now backed by pandas, which means they're much better than before.
  • Sparse support is much better! However, the whole SparseTable is sparse now, not just the X part. X, Y and metas return dense matrices if they contain StringVariables - this is a scipy limitation.
  • Variable.to_val now transforms a descriptor into its numeric value: e.g. "male" into 1 where variable.values == ["female", "male"].
  • Table.get_column_view is no more, replaced by pandas' own attribute column access. If the column name is a valid Python identifier, table.column_name is the same as table["column_name"].
  • Table.checksum was removed in favour of hash(Table)
  • Table.shuffle was removed in favour of the pandas t.sample(frac=1).
  • A new Table.merge method was added - a wrapper for pd.DataFrame.merge which handles internal columns and the domain.
  • No more Table.from_table_rows, use pandas indexing and slicing.

Migration and developer guide

When this project is "complete", the work will not be over. What follows is a general guide to porting existing code to the new API (because of several breaking changes), what to be mindful of when fixing bugs that could be related to pandas, as well as some TODOs that will likely have to be completed to make the product fully stable.

Porting existing code & general guide

New additions and benefits

As expected from a porting project, not much functionality was explicitly added. However, you get to use all of pandas' functionality for time series, DataFrame handling (e.g. merge is wrapped to handle Orange specifics), filtering, sparse features and other things. Expect a more stable codebase with new features and bugfixes being added by proxy by pandas.

Subclassing Table and SparseTable

Read the pandas subclassing instructions for an intro to subclassing pandas. This section only mentions Orange specifics.

Extend _metadata (not overwrite) if you want to specify any additional attributes. Any custom attribute, no matter how temporary, must be included, otherwise it won't work. If you define custom column names, include the in _INTERNAL_COLUMN_NAMES, which, again, you need to extend. Use and maybe modify _is_orange_construction_path to pass through constructions that weren't called implicitly - pandas uses the constructor every time we want a subset or anything similar.

Indexing

Indexing as it was pre-pandas doesn't exist any more. It is replaced by pure pandas indexing. Read up on it here, especially the different behaviours and selectors (like .ix) and the notion of integer position and the index. In short, t.iloc[i] gets the ith row of the table, whereas t.loc[i] returns the row where the index value (in pandas not necessarily an integer) is i. With unique indexing in Orange (explained later), the latter almost never works. Boolean indexing works with either loc, iloc or just t[...].

The default getitem (and setitem) work on columns. t[0] does not return the first row of the table, but the column with the name 0. See iterating for an elaboration on this.

The implementation before pandas had a notion of a unique index (think _init_ids). The pandas implementation formalizes that further and sets the pandas index (think .loc) to be globally unique. This is done transparently, no modification needed, and works with subsets and such. If tests fail when run in a group, but not by themselves, it's almost always an indexing issue, as that is an omnipresent global state.

The return type depends on the selector. Selecting a single row with t.iloc[0] yields a Series. Selecting a single row with t.iloc[[0]] or multiple with t.iloc[list_of_positions] yields a Table. This distinction is largely inconsequential---read the series section---but is important to keep it in mind.

Columns as properties

For columns with valid python names, attribute indexing works. Example: for a column named sex, t.sex works and is equal to t["sex"] and t[t.domain["sex"]]. However for a column named column with space, there is no attribute, not even with spaces transformed to underscores.

Copies and views (indexing cont'd)

While Tables should be immutable, sometimes one would like to change a value (or a subset of them). Setting the top-left-most element can be done in two ways: t.iloc[0, 0] or t.loc[t.index[0], t.columns[0]], or with the explicit index and column if available. This can NOT be done with t.iloc[0]["col1"] or t.iloc[0].loc["col1"] or t.iloc[0].iloc[0] or similar as, in pandas, that most likely creates a copy of the object (but not always).

Implicit bool

A pd.DataFrame has no implicit boolean value. Do not write if data or if not data, instead use if data is not None or if data.isnull().any().any() or similar. Most widgets have this error.

Iterating

This topic is undecided, iteration may preserve previous Orange behaviour if that is decided upon.

Iterating through a pd.DataFrame gives column names, not rows, consistent with t[...] selection. However, len(table) gives the number of rows in the table. Blame pandas. To iterate over rows, use t.iterrows() which yields tuples of (row_index, row_series). See iterrows, itertuples and iteritems for alternatives.

Domain and columns

The domain hasn't changed much. What has changed is that Variable now extends str, which allows Variables to be used in table column names and for indexing with col = t[variable]. Column names are pure strings by default, but can be Variables--this is discouraged, because strings work just as well if not better.

TableSeries

In pandas, Series objects are one-dimensional (where DataFrame is 2D and Panel is 3D). They can be either rows or columns and behave the same. They're almost exactly like the previous RowInstance and provide .X, .Y, .metas and .weights for compatibility. The domain is transparently passed to the series when selected from a Table.

Class structure

The data class structure has changed significantly. See this picture for a general overview. Don't use from Orange.data.table import Table, but just from Orange.data import x, as everything is pulled to that level.

TableBase and SeriesBase are now the base classes, like the now-defunct Storage. This is important for widget connections. All Orange data storage should extend from those. Table is dense, SparseTable is sparse, but SparseTable does not extend Table. SqlTable extends Table. Corpus extends SparseTable.

Weights and other special columns

Some special columns have been introduced. To simplify just about everything, weights are now always present and default to 1. The column isn't included in the domain, but instead just exists in the table columns as TableBase._WEIGHTS_COLUMN_NAME, right now __weights__. This means weights are automatically transferred to subsets and changing weights on a view also chagnes them in the parent.

Corpus has some other special columns, everything should be included in cls._INTERNAL_COLUMN_NAMES.

Weights are now accessible solely by .weights, no longer .W.

X/Y/metas/weights, values and about setting data

Previously, all data was held in np.ndarrays as X/Y/metas/weights. Now, these are computed properties with data stored in the table itself. An important distinction is that the table now holds actual values, not their indices and such. Example: for a discrete variable column, the table now contains male and female, not 0 and 1. This allows t[t.sex == 'male'].

X, Y, metas and weights are computed properties which do not give the variable descriptors, but rather the computed values--the same as they did before. Taking the above example, t.Y with a class_var of DiscreteVariable("sex") would return a column of zeros and ones. The returned np.arrays are marked as immutable and attempting to mutate them raises an error.

As before, X, Y and metas always return 2D matrices, except for Y, which returns 1D if there is only one class variable. weights always returns 1D, never None.

I feel like this is a very major change that needs to be emphasised, emphasised, emphasised. Table.X, Table.Y, Table.metas and Table.weights are read-only, computed properties, and this cannot be set. Set or modify individual columns the pandas way if you need to.

Categoricals

Discrete variables are automatically converted into pandas Categorical columns. This is the queivalent of R's factor and tightly constrains values onto the predefined values. As such, you cannot modify an element (or an a row) where that column's value doesn't exist in the registered values, at least before modifying them. Upon creating, DiscreteVariable.values is synced with the column's allowed values, with the proper ordering flag. Care must be taken when apending rows to tables, but shouldn't pose a problem in most cases.

Time datatypes and ops

pandas has native datetime columns, in the same way as the categorical column type is category. This allows a whole bunch of nice temporal processing functionality to be inherited from pandas. Also, the following works: t[t.timecol > '2015-06-07'].

Time variables are automatically parsed and their time zones registered upon creating/reading a table. Otehr functionality is inherited directly from pandas. The numeric value is unix time in seconds, internally, in the table, time values are represented by a native pandas data type.

Filters

Filters have been removed, use pandas filtering now. For Select Rows, a filter shim that is required for the GUI was constructed.

Sparse

An important note right off the bat here is that proper sparse functionality requires (the yet unreleased) pandas 0.19.0, which fixes a bucketload of bugs and adds proper multi-type containers.

Sparse now very likely works better than before. A big difference is that the entire table is sparse, not only X. This unifies behaviour for likely negligible performance loss. A good migration example is in biolab/orange3-text#97.

Sparsity and density

Tables now use .is_sparse and .is_dense instead of .density returning some not-quite-enum. The previous Table.DENSE|SPARSE|SPARSE_BOOL do not exist any more.

The density of a dense and a sparse table is defined as the number of undefined values in the table. For dense, this is computed with .isnull(), for sparse, this uses the default pandas functionality, which takes into account fill_value (always np.nan for us).

pandas compatibility notes, subclassing

Because Orange historically broke __new__ to enable a MaGiC signature, some compatibility shims had to be employed to conform to pandas. __new__ and __init__ check if they are being called from pandas internals in quite a roundabout way: checking for pandas-only args and kwargs signatures (detailed description in the comment block above the relevant code). If so, the Orange constructor is skipped entirely and only the pandas part is used.

Why do we need this complexity? Because pandas calls the constructor fairly regularly (even has something like this in its own code called fastpath), and because each slicing or DataFrame op returns a new object (with possibly shared data) - and to do that, it calls the constructor. To transfer attributes and such, pandas has __finalize__ (that's not python native).

When subclassing pandas objects, you have to override _constructor, _constructor_sliced for dimensionality reduction and _constructor_expanddim for dimensionality expansion. These normally just return the respective classes, but we included a domain transfer mechanism so SeriesBase objects retain TableBase domains and properties, enabling X/Y/metas/weights.

To transfer custom properties when slicing etc., all property names must be added to _metadata. After this, pandas will automagically transfer those properties, but not when using some pandas global functions, such as pd.concat. When subclassing Table or similar, remember to extend the parent's _metadata, not overwrite it.

The MRO of the new classes is Orange first, pandas second. This means you can override any 2D pandas functionality by redefining in TableBase.

From pd.DataFrame to Table

Generally, Table(df) works via Table.from_dataframe. This infers the domain, but you can pass one to skip that.

Where has the Value gone? Also Instance. And .columns.

Unneeded. Value was a remnant of C, Instance (RowInstance in particular) now has an analog in SeriesBase. .columns is not needed because pandas kind of already supports that with attribute access.

SQL things

This is the ugliest part of the entire project. I haven't refactored much except what was urgently needed, because there is no future with this approach, which falls apart when you poke it in the wrong way. len(SqlTable) now returns the number of downloaded and stored rows, not the length of the database. This was needed because the data is now stored in pandas and its internals require len to be sensible. We have approx_len and exact_len for database lengths.

A lot less cython

A lot of cython code was removed and replaced with pandas. This includes contingencies, value counts, distributions and statistics.

IO now uses pandas

Should be faster. Also used for Excel files. A slightly different approach is used to try to infere the Orange header: first, the first 3 rows are read, then the rest. Check out CsvReader in data/io.py if you want to know more.

What to look for when fixing bugs

  • When a test fails fail when run in a group, but not when run by itself, it's an indexing issue (see the global indexing section).
  • When there is a column mismatch, check for columns from _INTERNAL_COLUMN_NAMES.
  • Are you constructing matrices (ad-hoc) with numeric values of otherwise textual discrete attributes? Don't do that. Also keep in mind that mixing strings and numbers in numpy arrays forces the whole dtype to object, use an ad-hoc list instead.
  • Tables don't have and implicit boolean value, as a consequence assertEqual(t1, t2) doesn't work because of an implicit bool of a boolean matrix.
  • Try not to use "?" for invalid values, use np.nan.
  • Using proper pandas indexing? See indexing above.
  • Iterating properly? See iterating above.
  • Comparison of Tables with .equals failing? Check if the indexes are different, likely due to global indexing. If so, override one of them.
  • Can't read a small, hand-written CSV file in a test? Use @mock.patch("Orange.data.io.CSVReader.DELIMITERS", your_delimiter) to force the sniffer to output a specific delimiter.
  • Use setUp instead of setUpClass to avoid funny inter-test state.
  • Does pandas maybe not return the correct data type? It could be a pandas bug, subclassing isn't completely there yet.

List of API breakages and incompatibilities

In rough order of prominence, descending.

  • Indexing. Use .loc, .iloc and similar, see the dev note for more.
  • Tables now hold actual values instead of integer/float descriptors.
  • Variable.to_val converts from a descriptor to a value used in X/Y/metas.
  • X/Y/metas aren't settable any more, are computed properties (not even views).
  • Iterating doesn't work the way you'd expect, see the dev note
  • No more .W, always .weights.
  • No implicit boolean value for the table.
  • Were in-place, are not any more: Corpus.extend_corpus.
  • Removed Table,from_table_rows, use proper pandas indexing and slicing.
  • "Deleting" rows works by selecting an opposite subset, not by del t[i].
  • Inserting and extending does not exist any more, as the row order is not important - use Table.concatenate.
  • Due to the use of Categoricals, discrete variables are much more constrained.
  • No more Filters, Instances (except in sql/compat, ugly) and Values.
  • SeriesBase has .X, not .x like RowInstance had previously.
  • len(SqlTable) now returns the number of downloaded and stored rows, not the length of the database.
  • Table.checkssum removed, use hash(Table).
  • Table.shuffle removed, use t.sample(frac=1).

Future TODOs and architectural wishlist

  • Go through the whole codebase and see where there could be very nice pandas functions used, instead of some weird workarounds using numpy and nulls. This is a big endeavour, but it would speed up orange and ensure stability.
  • Remove the whole SqlTable shebang and create a new Apache Spark addon.
    • Completely separate widgets, with a transformation/sampler widget to transform the Spark structure into a Table.
    • Don't try to maintain compatibility with Table, it's too much work. Create a new table-like structure that encloses Spark's DataFrame (not compatible with pandas') and has a domain and other needed things.
    • The most work would likely be with widgets, so some plan on how to use existing visualization widgets (with subclassing) woudl be nice. Just changing setData would be ideal.
  • Fix basic stats, distributions, contingency. They have weird call paterns with even weirder structs, decide how to clean that up. pandas has .describe(), so use that.
    • Contingency may need to be directed back to the cython implementation for speed.
  • Domain inference for sparse matrices.
  • Performance improvements:
    • See how many copies of tables are made. Maybe too many, because of some old pipeline that now copies everything unnecessarily - likely due to the fact X, Y and metas aren't primary storage any more.
    • LOO has problems. Actually, all parallel things. Does importing Orange take a while and is slow because of that?
  • Any can't-see-the-forest-for-the-trees issues that I missed.
  • See what has changed in pandas 0.19.0 and adapt as needed.
  • Transform .Y to return 2D.
  • Improve TimeVariable processing (timezone discovery could likely be faster).
  • Check widgets' inputs, some things may have broken because of the table class structure change.