Version 0.20
Breaking changes
Change default join
behavior with regard to null values
Previously, null values in the join key were considered a value like any other value. This meant that null values in the left frame would be joined with null values in the right frame. This is expensive and does not match default behavior in SQL.
Default behavior has now been changed to ignore null values in the join key. The previous behavior
can be retained by setting join_nulls=True
.
Example
Before:
>>> df1 = pl.DataFrame({"a": [1, 2, None], "b": [4, 4, 4]})
>>> df2 = pl.DataFrame({"a": [None, 2, 3], "c": [5, 5, 5]})
>>> df1.join(df2, on="a", how="inner")
shape: (2, 3)
┌──────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 4 ┆ 5 │
│ 2 ┆ 4 ┆ 5 │
└──────┴─────┴─────┘
After:
>>> df1.join(df2, on="a", how="inner")
shape: (1, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2 ┆ 4 ┆ 5 │
└─────┴─────┴─────┘
>>> df1.join(df2, on="a", how="inner", join_nulls=True) # Keeps previous behavior
shape: (2, 3)
┌──────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 4 ┆ 5 │
│ 2 ┆ 4 ┆ 5 │
└──────┴─────┴─────┘
Preserve left and right join keys in outer joins
Previously, the result of an outer join did not contain the join keys of the left and right frames. Rather, it contained a coalesced version of the left key and right key. This loses information and does not conform to default SQL behavior.
The behavior has been changed to include the original join keys. Name clashes are solved by
appending a suffix (_right
by default) to the right join key name. The previous behavior can be
retained by setting how="outer_coalesce"
.
Example
Before:
>>> df1 = pl.DataFrame({"L1": ["a", "b", "c"], "L2": [1, 2, 3]})
>>> df2 = pl.DataFrame({"L1": ["a", "c", "d"], "R2": [7, 8, 9]})
>>> df1.join(df2, on="L1", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ L1 ┆ L2 ┆ R2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 7 │
│ c ┆ 3 ┆ 8 │
│ d ┆ null ┆ 9 │
│ b ┆ 2 ┆ null │
└─────┴──────┴──────┘
After:
>>> df1.join(df2, on="L1", how="outer")
shape: (4, 4)
┌──────┬──────┬──────────┬──────┐
│ L1 ┆ L2 ┆ L1_right ┆ R2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════════╪══════╡
│ a ┆ 1 ┆ a ┆ 7 │
│ b ┆ 2 ┆ null ┆ null │
│ c ┆ 3 ┆ c ┆ 8 │
│ null ┆ null ┆ d ┆ 9 │
└──────┴──────┴──────────┴──────┘
>>> df1.join(df2, on="a", how="outer_coalesce") # Keeps previous behavior
shape: (4, 3)
┌─────┬──────┬──────┐
│ L1 ┆ L2 ┆ R2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 7 │
│ c ┆ 3 ┆ 8 │
│ d ┆ null ┆ 9 │
│ b ┆ 2 ┆ null │
└─────┴──────┴──────┘
count
now ignores null values
The count
method for Expr
and Series
now ignores null values. Use len
to get the count with
null values included.
Note that pl.count()
and group_by(...).count()
are unchanged. These count the number of rows in
the context, so nulls are not applicable in the same way.
This brings behavior more in line with the SQL standard, where COUNT(col)
ignores null values but
COUNT(*)
counts rows regardless of null values.
Example
Before:
>>> df = pl.DataFrame({"a": [1, 2, None]})
>>> df.select(pl.col("a").count())
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ u32 │
╞═════╡
│ 3 │
└─────┘
After:
>>> df.select(pl.col("a").count())
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ u32 │
╞═════╡
│ 2 │
└─────┘
>>> df.select(pl.col("a").len()) # Mirrors previous behavior
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ u32 │
╞═════╡
│ 3 │
└─────┘
NaN
values are now considered equal
Floating point NaN
values were treated as unequal across Polars operations. This has been
corrected to better match user expectation and existing standards.
While this is considered a bug fix, it is included in this guide in order to draw attention to
possible impact on user workflows that may contain NaN
values.
Example
Before:
>>> s = pl.Series([1.0, float("nan"), float("inf")])
>>> s == s
shape: (3,)
Series: '' [bool]
[
true
false
true
]
After:
>>> s == s
shape: (3,)
Series: '' [bool]
[
true
true
true
]
Assertion utils updates to exact checking and NaN
equality
The assertion utility functions assert_frame_equal
and assert_series_equal
would use the
tolerance parameters atol
and rtol
to do approximate checking, unless check_exact
was set to
True
. This could lead to some surprising behavior, as integers are generally thought of as exact
values. Integer values are now always checked exactly. To do inexact checking, convert to float
first.
Additionally, the nans_compare_equal
parameter has been removed and NaN
values are now always
considered equal, which was the previous default behavior. This parameter had previously been
deprecated but has been removed before the end of the standard deprecation period to facilitate the
change to NaN
equality.
Example
Before:
>>> from polars.testing import assert_frame_equal
>>> df1 = pl.DataFrame({"id": [123456]})
>>> df2 = pl.DataFrame({"id": [123457]})
>>> assert_frame_equal(df1, df2) # Passes
After:
>>> assert_frame_equal(df1, df2)
...
AssertionError: DataFrames are different (value mismatch for column 'id')
[left]: [123456]
[right]: [123457]
Allow all DataType
objects to be instantiated
Polars data types are subclasses of the DataType
class. We had a 'hack' in place that
automatically converted data types instantiated without any arguments to the class
, rather than
actually instantiating it. The idea was to allow specifying data types as Int64
rather than
Int64()
, which is more succinct. However, this caused some unexpected behavior when working
directly with data type objects, especially as there was a discrepancy with data types like
Datetime
which will be instantiated in many cases.
Going forward, instantiating a data type will always return an instance of that class. Both classes
an instances are handled by Polars, so the previous short syntax is still available. Methods that
return data types like Series.dtype
and DataFrame.schema
now always return instantiated data
types objects.
You may have to update some of your data type checks if you were not already using the equality
operator (==
), as well as update some type hints.
Example
Before:
>>> s = pl.Series([1, 2, 3], dtype=pl.Int8)
>>> s.dtype == pl.Int8
True
>>> s.dtype is pl.Int8
True
>>> isinstance(s.dtype, pl.Int8)
False
After:
>>> s.dtype == pl.Int8
True
>>> s.dtype is pl.Int8
False
>>> isinstance(s.dtype, pl.Int8)
True
Update constructors for Decimal
and Array
data types
The data types Decimal
and Array
have had their parameters switched around. The new constructors
should more closely match user expectations.
Example
Before:
>>> pl.Array(2, pl.Int16)
Array(Int16, 2)
>>> pl.Decimal(5, 10)
Decimal(precision=10, scale=5)
After:
>>> pl.Array(pl.Int16, 2)
Array(Int16, width=2)
>>> pl.Decimal(10, 5)
Decimal(precision=10, scale=5)
DataType.is_nested
changed from a property to a class method
This is a minor change, but a very important one to properly update. Failure to update accordingly
may result in faulty logic, as Python will evaluate the method to True
. For example,
if dtype.is_nested
will now evaluate to True
regardless of the data type, because it returns the
method, which Python considers truthy.
Example
Before:
>>> pl.List(pl.Int8).is_nested
True
After:
>>> pl.List(pl.Int8).is_nested()
True
Smaller integer data types for datetime components dt.month
, dt.week
Most datetime components such as month
and week
would previously return a UInt32
type. This
has been updated to the smallest appropriate signed integer type. This should reduce memory
consumption.
Method | Dtype old | Dtype new |
---|---|---|
year | i32 | i32 |
iso_year | i32 | i32 |
quarter | u32 | i8 |
month | u32 | i8 |
week | u32 | i8 |
day | u32 | i8 |
weekday | u32 | i8 |
ordinal_day | u32 | i16 |
hour | u32 | i8 |
minute | u32 | i8 |
second | u32 | i8 |
millisecond | u32 | i32* |
microsecond | u32 | i32 |
nanosecond | u32 | i32 |
*Technically, millisecond
can be an i16
. This may be updated in the future.
Example
Before:
>>> from datetime import date
>>> s = pl.Series([date(2023, 12, 31), date(2024, 1, 1)])
>>> s.dt.month()
shape: (2,)
Series: '' [u32]
[
12
1
]
After:
>>> s.dt.month()
shape: (2,)
Series: '' [u8]
[
12
1
]
Series now defaults to Null
data type when no data is present
This replaces the previous behavior of initializing as a Float32
type.
Example
Before:
>>> pl.Series("a", [None])
shape: (1,)
Series: 'a' [f32]
[
null
]
After:
>>> pl.Series("a", [None])
shape: (1,)
Series: 'a' [null]
[
null
]
replace
reimplemented with slightly different behavior
The new implementation is mostly backwards compatible. Please do note the following:
- The logic for determining the return data type has changed. You may want to specify
return_dtype
to override the inferred data type, or take advantage of the new function signature (separateold
andnew
parameters) to influence the return type. - The previous workaround for referencing other columns as default by using a struct column no longer works. It now simply works as expected, no workaround needed.
Example
Before:
>>> df = pl.DataFrame({"a": [1, 2, 2, 3], "b": [1.5, 2.5, 5.0, 1.0]}, schema={"a": pl.Int8, "b": pl.Float64})
>>> df.select(pl.col("a").replace({2: 100}))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i8 │
╞═════╡
│ 1 │
│ 100 │
│ 100 │
│ 3 │
└─────┘
>>> df.select(pl.struct("a", "b").replace({2: 100}, default=pl.col("b")))
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ f64 │
╞═══════╡
│ 1.5 │
│ 100.0 │
│ 100.0 │
│ 1.0 │
└───────┘
After:
>>> df.select(pl.col("a").replace({2: 100}))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 100 │
│ 100 │
│ 3 │
└─────┘
>>> df.select(pl.col("a").replace({2: 100}, default=pl.col("b"))) # No struct needed
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ f64 │
╞═══════╡
│ 1.5 │
│ 100.0 │
│ 100.0 │
│ 1.0 │
└───────┘
value_counts
resulting column renamed from counts
to count
The resulting struct field for the value_counts
method has been renamed from counts
to count
.
Example
Before:
>>> s = pl.Series("a", ["x", "x", "y"])
>>> s.value_counts()
shape: (2, 2)
┌─────┬────────┐
│ a ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪════════╡
│ x ┆ 2 │
│ y ┆ 1 │
└─────┴────────┘
After:
>>> s.value_counts()
shape: (2, 2)
┌─────┬───────┐
│ a ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════╡
│ x ┆ 2 │
│ y ┆ 1 │
└─────┴───────┘
Update read_parquet
to use Object Store rather than fsspec
If you were using read_parquet
, installing fsspec
as an optional dependency is no longer
required. The new Object Store implementation was already in use for scan_parquet
. It may have
slightly different behavior in certain cases, such as how credentials are detected and how downloads
are performed.
The resulting DataFrame
should be identical between versions.
Deprecations
Cumulative functions renamed from cum*
to cum_*
Technically, this deprecation was introduced in version 0.19.14
, but many users will first
encounter it when upgrading to 0.20
. It's a relatively impactful change, which is why we mention
it here.
Old name | New name |
---|---|
cumfold |
cum_fold |
cumreduce |
cum_reduce |
cumsum |
cum_sum |
cumprod |
cum_prod |
cummin |
cum_min |
cummax |
cum_max |
cumcount |
cum_count |