-
Fixes
- allow single columns or expressions in materialize #2249
- arrow data used in selection would ignore null values or fail #2196
- build expressions / filters with an arrow string scalar #2244
- selection-dropna did not work with non-identifier expresssion #2208
- use
vaex.settings
for thread counts #2231 - raise an informative exception when
extract
can not run #2232
-
Performance
- Import and version checking improvements #2226
- Features
- store arrow arrays in hdf5 using null bitmasks #2245
-
Features
-
Fixes
-
Features
-
Fixes
Requires vaex-core 4.13.0 for refactor of dataset
- Fixes
- correctly place the colorbar for matplotlib 3.6.0 #2215
- Fixes
- Typo in eq2gal #2206
- Features
- get_column_names accepts a dtypes argument #2160
- Fixes
- df.extract() was not thread safe #2182
- uuid4 function was not always restored properly #2181
- groupby could overflow due to wrong downcasting #2137
- support unique with selection=True #2164
- value_counts for strings was sometimes off #2147
- better arrow support for interchanging categorical columns #2135
- Fixes
- Improve selection behaviour for histogram and update docstrings #2143
- Fix
- Fix
- df.func.where relies more on pyarrow 5's if_else #2096
- correct $VAEX_PATH_HOME -> $VAEX_PATH #2101
- Various join fixed when the missing values were present #808
- Various join fixed when the missing values were present #808
- string join on large_list with large_strings. #2112
- Working arm wheel for osx (#) #2124
- Performance
- Do not let arrow validate the dict encoded data. d6242090a1f480abae669bc5281e803fe06c5d36
- Feature
- Add how to dropna, dropinf etc #2104
- Performance
- Do not let arrow validate the dict encoded data. d6242090a1f480abae669bc5281e803fe06c5d36
- Fix
- Join issue with missing values or nans #2077
- Feature
- Performance
- Value_counts uses a task to get caching support #2085.
- Features
- Enable selections in metrics #2073
- Fix
- Write to cached filesystem when metadata argument is needed #1993
- Multi-d sparse groupby would fail for arrow data (e.g. list agg) #2031
- Exporting arrow with large_string would result in schema conflict #2030
- expression engine did not roundtrip dict correctly, missing ", " #2039
- Changed deprecated numpy.float to numpy.float64 #2023
- Replace pylab with pyplot #2047
- isin should accept empty array or non-existing values #2064
- Ordinal_encode with values which extra entries gave wrong results #2059
- Combining filters with arrow arrays failed converting (gave TypeError) #2038
- Wrong order of casting and subtracting offset cause overflow #2065
- Fix
- Do not keep a reference to numpy arrays on closing an hdf5 file 2066
- Fix
- Features
- Progress bar for percentile_approx and median_approx #1889
- Better casting of strings to datetime #1920
- We better support numpy scalars now, and more arrow time units. #1921
- Allow sorting by strings, multiple columns and multiple directions #1963
- Support JSON in df.export #1974
- New/better aggregators
- Pre-sort by the grouping columns in df.groupby (better performance) #1990
- Performance
- Fix
- Respect row_limit when the groupby is dense #1894
- Fingerprint collision possible if filter uses virtual column #1949
- Apply with filtered data could give wrong dtypes #1936
- Strings array growing failed when first string was zero length #1956
- Use less processes for when using multiprocessing. #1979
- Support chunked arrays and empty chunks in value counts. #1958 #1975
- Allow renaming of function, to make join use with functions without name collisions. #1966
- Join would fail if the rhs had no columns besides the join one #2010
- hdf5 export fails for concat df with missing columns #1493
- Allow
col
as column name #1992
- Features
- Multiple example datasets provided in
vaex.datasets
#1317 - We do not use asyncio for the default sync execute path #1783
- Executor works with asyncio with multiple tasks [#1784]#1784)
- Auto execute context manager makes vaex behave normal with await #1785
- Support exporting arrow and parquet to file like objects #1790
- Put lock files in $VAEX_HOME/lock #1797
- Show progress when converting the included datasets #1798
- Limit and limit_raise for unique and nunique #1801
- Lazy ordinal encode #1813
- Configure logging using settings system#1811
- Export to JSON #1789
- Progress bar can be configured using settings system #1815
- fillna and fillmissing should upcast integers when needed #1869
- Multiple example datasets provided in
- Performance
- Fix
- Support empty parquet and arrow files #1791
- Keep virtual column order when renaming/dropping to not break state transfer #1788
- blake3 compatibility issues #1818 db527a6
- Avoid frozendict 2.2.0 which can segfault on Python 3.6#1856
- Use label instead of expression for non-ident column names in binby #1842
- Development
- Features
- Support storing Arrow Dictionary encoded/categoricals in hdf5 #1814
Requires vaex-core 4.8.0 for the vaex.datasets.iris()
Made compatible with Python 3.6
- Features
- Allow casting integers to timedelta64 type #1741
- When a single task can fail, other can continue #1762
- Improved rich progress bar support #1771
- vaex.from_records to build a dataframe from a list of dicts #1767
- Settings in Vaex can be configured in a uniform way #1743
- Unique for datetime64 and timedelta64 expressions #1016
- Copy argument for binby, similar to groupby 4e7fd8e
- Performance
- Improve performance for filtered dataframes 1685
- Fixes
- Features
- do not track times to have deterministic output (useful for lineage/hash output) #1772
Requires vaex-core 4.7 for uniform settings
Requires vaex-core 4.7 for uniform settings
Requires vaex-core 4.7 for uniform settings
- Features
- Editor widget for settings #1743
- Fixes
- Histogram method on expression to propagate kwargs #1757
- Features
- OSX Metal support for jitting expressions #584
- Improved progress support, including Rich progress bars #1738
- Control number of columns and rows being printed #1672
- Groupby with regular bins (similar to binby) #1589
- Groupby with a limited number of values, and 'OTHERS' #1641
- New aggregators: vaex.agg.any and vaex.agg.all #1630
- Better API for correlation and mutual information #536
- Materialize datasets columns for better performance of non-memory mapping files (e.g. parquet) #1625
- Avoid using nest_asyncio #1546
- Multi level cache support (e.g. memory and disk) #1580
- Do not mutate dataframe when comparing dates. #1584
- Performance
- Fingerprint for tasks are more stable when the dataframe changes, but not the task description, for more cache hits. #1627
- Faster conversion between Arrow and NumPy #1625
- Cache sparse-finding/combining of high-d groupby #1588
- Allow (lazy) math and computations with aggregators #1612
- Less passes over the data when multiple dataframes use the same dataset #1594
- Share evaluation of expressions of selections #1594
- Delay support for groupby #1594
- Fixes
Requires vaex-core 4.6
Requires vaex-core 4.6
- Performance
- Dot product with many columns does not use expressions, but dedicated function #1671
- Features
- Filelocks for multi process convert=True cooperation #1573
- Performance
- Features
- Performance
- Features
- Write higher dimensional arrays to hdf5 files #1563
Requires vaex 4.5.0 due to private API change.
- Fixes
- Missing imports (now checked in CI) #1516
- Features
- Import from and export to Google BigQuery #1470
- Performance
- Features
- Fixes
- Complete refactor, now using FastAPI by default #1300
- Tensorflow/keras support #1510
- Features
- Fixes
- File order close issue on Windows #1479
- Performance
- Reuse filter data when slicing a dataframe #1287
- Features
- Cache task results, with support for Redis and diskcache #1393
- df.func.stack for stacking columns into Nd arrays #1287
- Sliding windows / shift / diff / sum #1287
- Embed join/groupby/shift in dataset (opt in via df._future(), will be default in vaex v5) #1287
- df.fingerprint() - a cross runtime unique key for caching #1287
- limit rows in groupby using early stop #1391
- Compare date columns to string values formatted in ISO 8601 format 621a341b54f9b4112f24e2ffd86612753df19fef
- Fixes
- df.concat did not copy functions #1287
- Filters with column name equals to function names a159777e2dc13ec762914c51c8b5550efec5f845
- Performance
- Perform groupby in a sparse way for less memory usage/performance (up to 250x faster) #1381
- Features
- Sorted groupby #1339
- Fixes
- Features
- SSL support 5dc29edd5b15eb4e1fe9c6981c67edd477481484
- Features
- groupby datetime support #1265
- Fixes
- Improved fsspec support #1268
- Performance
- df.extract() uses mask instead of indices 398b682fe9042b3336120e9013e15bbd638620ed
- Breaking changes:
- Arrow is now a core dependency, vaex-arrow is deprecated. All methods that return string, will return Arrow arrays #517
- Opening an .arrow file will expose the arrays as Apache Arrow arrays, not numpy arrays. #984
- Columns (e.g. df.column['x']) may now return a ColumnProxy, instead of the original data, slice it [:] to get the underlying data (or call .to_numpy()/to_arrow() or try converting it with np.array(..) or pa.array(..)). #993
- All plot methods went into the df.viz accessor #923
This is now part of vaex-core.
- Requirement changed to vaex-core >=4,<5
- Fixes
- Features
- Refactor
- Performance
- concat (vaex.concat or df.concat) is about 100x faster. #994
This is now part of vaex-enterprise (was a proof of content, never functional).
- Requirement changed to vaex-core >=4,<5
- Requirement changed vaex-core >=4,<5
- Requirement changed to vaex-core >=4,<5
- Features
- Requirement changed to vaex-core >=4,<5
- Requirement changed to vaex-core >=4,<5
- Features
- Normalize histogram and change selection mode. #826
* Features
* Autogenerate the fast (or functional) API [#512](https://github.com/vaexio/vaex/pull/512)
- Performance
- isin uses hashmaps, leading to a 2x-4x performance increase for primitives, 200x for strings in some cases #822
- Features
- Selection toggle list. #797
- Fixes
- Remote dataframe was still using dtype, not data_type. #797
- Features
- Implementation of
GroupbyTransformer
#479
- Implementation of
- Fixes
- Various fixes for aliased columns (column names with invalid identifiers) #768
- Fixes
- Fixes
- Masked arrays supported in hdf5 files on s3 #781
- Expression.map always uses masked arrays to be state transferrable (a new dataset might have missing values) #479
- Support importing Pandas dataframes with version 0.23 #794
- Various fixes for aliased columns (column names with invalid identifiers) #768 #793
- Fixes
- Join could in rare cases point to row 0, when there were values in the left, not present in the right #765
- Tabulate 0.8.7 escaped html, undo this to print dataframes nicely.
- Breaking changes:
- Python 2 is not supported anymore
- Variables don't have access to pi and e anymore
df.rename_column
is nowdf.rename
(and also renames variables)- DataFrame uses a normal dict instead of OrderedDict, requiring Python >= 3.6
- Default limits (e.g. for plots) is minmax, so we don't miss outliers
df.get_column_names()
returns the aliased names (invalid identifiers), passalias=False
to get the internal column name- Default value of
virtual
is True in methoddf.export
,df.to_dict
,df.to_items
,df.to_arrays
. - df.dtype is a property, to get data types for expressions, use df.data_type(), df.expr.dtype is still behaving the same
- df.categorize takes min_value and max_value, and no longer needs the check argument, also the labels do not have to be strings.
- vaex.open/from_csv etc does not copy the pandas index by default #756
- df.categorize takes an inplace argument, similar to most methods, and returns the dataframe affected.
-
Performance
-
Refactor
-
Fixes
- Renaming columns fixes #571
- Joining with virtual columns but different data, and name collision fixes #570
- Variables are treated similarly as columns, and respected in join #573
- Arguments to lazy function which are numpy arrays gets put in the variables #573
- Executor does not block after failed/interrupted tasks. #571
- Default limits (e.g. for plots) is minmax, so we don't miss outliers #581
- Do no fail printing out dataframe with 0 rows #582
- Give proper NameError when using non-existing column names #299
- Several fixes for concatenated dataframes. #590
- dropna/nan/missing only dropped rows when all column values were missing, if no columns were specified. #600
- Flaky test for RobustScaler skipped for p36 #614
- Copying/printing sparse matrices #615
- Sparse columns names with invalid identifiers are not rewritten. #617
- Column names with invalid identifiers which are rewritten are shown when printing the dataframe. #617
- Column name rewriting for invalid identifiers also works on virtual columns. #617
- Fix the links to the example datasets. #609
- Expression.isin supports dtype=object #669
- Fix
colum_count
, now only counts hidden columns if explicitly specified #593 - df.values respects masked arrays #640
- Rewriting a virtual column and doing a state transfer does not lead to
ValueError: list.remove(x): x not in list
#592 df.<stat>(limits=...)
will now respect the selection #651- Using automatic names for aggregators led to many underscores in name #687
- Support Python3.8 #559
-
Features
- New lazy numpy wrappers: np.digitize and np.searchsorted #573
df.to_arrow_table
/to_pandas_df
/to_items
/df.to_dict
/df.to_arrays
now take a chunk_size argument for chunked iterators #589 (#699)- Filtered datasets can be concatenated. #590
- DataFrames/Executors are thread safe (meaning you can schedule/compute from any thread), which makes it work out of the box for Dash and Flask #670
df.count/mean/std
etc can output in xarray.DataArray array type, makes plotting easier #671- Column names can have unicode, and we use str.isidentifier to test, also dont accidently hide columns. #617
- Percentile approx can take a sequence of percentages #527
- Polygon testing, useful in combinations with geo/geojson data #685
- Added dt.quarter property and dt.strftime method to expression (by Juho Lauri) #682
- Refactored server, can return multiple binary blobs, execute multiple tasks, cancel tasks, encoding/serialization is more flexible (like returning masked arrays). #571
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Requirement of vaex-core >=2,<3
- Fixes
- Booleans were negated, and didn't respect offsets.
- Requirement of vaex-core >=2,<3
- Breaking changes
- vaex-jupyter is refactored #654
- Features
- Fixes
- Slicing arrow string arrays with masked arrays is respected/working #530]
- Performance
- IncrementalPredictor uses parallel chunked support (2x speedup possible) #515
- Fix
- Features
- Performance
- Dataframes are always true (implements
__bool__
) to avoid calling__len__
#496
- Dataframes are always true (implements
- Fixes
- Do not duplicate column when joining DataFrames on a column with the same name #480
- Better error messages/stack traces, and work better with debugger. #488
- Accept numpy scalars in expressions. #462
- Expression.astype can create datetime64 columns out of (arrow) strings arrays. #440
- Invalid mask access triggered when memory-mapped read only for strings. #459
- Features
- Features
- IncrementalPredictor for
scikit-learn
models that support the.partial_fit
method #497
- IncrementalPredictor for
- Fixes
- Adding unique function names to dataframes to enable adding a predictor twice #492
* Compatibility with vaex-core 1.4.0
- Performance
- Parallel df.evaluate #474
- Avoid calling df.get_column_names (1000x for 1 billion rows per column use) #473
- Slicing e.g df[1:-1] goes much faster for filtered dataframes #471
- Dataframe copying and expression rewriting was slow #470
- Double indices columns were not using index cache since empty dict is falsy #439
- Features
- requires vaex-core >=1.3,<2 for parallel evaluate
- Fixes:
- bqplot 0.12 revealed a bug/inconsistency with heatmap #465
- Fixes
- Support for Apache Arrow >= 0.15
- Fixes
- Docstrings and minor improvements
- initial release 0.1
- feature: auto upcasting for sum #435
- fix: selection/filtering fix when using masked values #431
- fix: masked string array fixes #434
- fix: memory usage fix for joins #439
- fix: support for Apache Arrow >= 0.15