Woodwork Alteryx Com en v0.3.1
Woodwork Alteryx Com en v0.3.1
Woodwork Alteryx Com en v0.3.1
Release 0.3.1
Alteryx, Inc.
1 Woodwork is a library that helps with data typing of 2-dimensional tabular data structures. 3
Index 113
i
ii
Woodwork Documentation, Release 0.3.1
CONTENTS 1
Woodwork Documentation, Release 0.3.1
2 CONTENTS
CHAPTER
ONE
It provides a special namespace on your DataFrame, ww, which contains the physical, logical, and semantic data types.
It can be used with Featuretools, EvalML, and general machine learning applications where logical and semantic
typing information is important.
Woodwork provides simple interfaces for adding and updating logical and semantic typing information, as well as
selecting data columns based on the types.
Below is an example of using Woodwork to automatically infer the Logical Types for a DataFrame and select columns
with specific types.
df = ww.demo.load_retail(nrows=100, init_woodwork=False)
df.ww.init(name="retail")
df.ww
[1]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['numeric']
order_id int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string NaturalLanguage []
country string NaturalLanguage []
total float64 Double ['numeric']
cancelled bool Boolean []
3
Woodwork Documentation, Release 0.3.1
Install
Woodwork is available for Python 3.7, 3.8, and 3.9. It can be installed from PyPI, conda, or from source.
PyPI
Woodwork allows users to install add-ons individually or all at once. In order to install all add-ons, run:
Conda
Note: In order to use Woodwork with Dask or Koalas DataFrames, the following commands must be run for your
library of choice prior to installing Woodwork with conda: conda install dask for Dask or conda install
koalas and conda install pyspark for Koalas.
Source
To install Woodwork from source, clone the repository from Github, and install the dependencies.
4 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Dependencies
You can view a list of all Woodwork core dependencies in the requirements.txt file.
Optional Dependencies
Woodwork has several other dependencies that are used only for specific methods. Attempting to use one of these
methods without having the necessary library installed will result in an ImportError with instructions on how to
install the necessary dependency.
Development
Get Started
In this guide, you walk through examples where you initialize Woodwork on a DataFrame and on a Series. Along
the way, you learn how to update and remove logical types and semantic tags. You also learn how to use typing
information to select subsets of data.
Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are
covered in detail in Understanding Types and Tags, but we provide brief definitions here for reference:
• Physical Type: defines how the data is stored on disk or in memory.
• Logical Type: defines how the data should be parsed or interpreted.
• Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Start learning how to use Woodwork by reading in a dataframe that contains retail sales data.
df = pd.read_csv("https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.
˓→csv")
df.head(5)
[1]: order_id product_id description quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
(continues on next page)
cancelled
0 False
1 False
2 False
3 False
4 False
As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric
values, and natural language descriptions. Next, initialize Woodwork on this DataFrame.
Importing Woodwork creates a special namespace on your DataFrames, DataFrame.ww, that can be used to set or
update the typing information for the DataFrame. As long as Woodwork has been imported, initializing Woodwork on
a DataFrame is as simple as calling .ww.init() on the DataFrame of interest. An optional name parameter can be
specified to label the data.
Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the DataFrame
dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some
of the columns based on the logical types that were inferred. Because the original data did not contain an index
column, Woodwork’s make_index parameter was used to create a new index column in the DataFrame.
Warning: Woodwork uses a weak reference for maintaining a reference from the accessor to the DataFrame.
Because of this, chaining a Woodwork call onto another call that creates a new DataFrame or Series object can be
problematic.
6 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
All Woodwork methods and properties can be accessed through the ww namespace on the DataFrame. DataFrame
methods called from the Woodwork namespace will be passed to the DataFrame, and whenever possible, Woodwork
will be initialized on the returned object, assuming it is a Series or a DataFrame.
As an example, use the head method to create a new DataFrame containing the first 5 rows of the original data, with
Woodwork typing information retained.
[4]: head_df
[4]: order_product_id order_id product_id description \
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER
1 1 536365 71053 WHITE METAL LANTERN
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART.
total cancelled
0 25.245 False
1 33.561 False
2 36.300 False
3 33.561 False
4 33.561 False
Note: Once Woodwork is initialized on a DataFrame, it is recommended to go through the ww namespace when
performing DataFrame operations to avoid invalidating Woodwork’s typing information.
If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change
some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the
order_product_id and country columns to be Categorical and set customer_name to have a logical
type of PersonFullName.
[5]: df.ww.set_types(logical_types={
'customer_name': 'PersonFullName',
'country': 'Categorical',
'order_id': 'Categorical'
})
df.ww.types
[5]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string PersonFullName []
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []
Inspect the information in the types output. There, you can see that the Logical type for the three columns has been
updated with the logical types you specified.
Selecting Columns
Now that you’ve prepared logical types, you can select a subset of the columns based on their logical types. Select
only the columns that have a logical type of Integer or Double.
[6]: numeric_df = df.ww.select(['Integer', 'Double'])
numeric_df.ww
[6]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
quantity int64 Integer ['numeric']
unit_price float64 Double ['numeric']
total float64 Double ['numeric']
This selection process has returned a new Woodwork DataFrame containing only the columns that match the logical
types you specified. After you have selected the columns you want, you can use the DataFrame containing just those
columns as you normally would for any additional analysis.
[7]: numeric_df
[7]: order_product_id quantity unit_price total
0 0 6 4.2075 25.2450
1 1 6 5.5935 33.5610
2 2 8 4.5375 36.3000
3 3 6 5.5935 33.5610
(continues on next page)
8 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Next, let’s add semantic tags to some of the columns. Add the tag of product_details to the description
column, and tag the total column with currency.
Select columns based on a semantic tag. Only select the columns tagged with category.
Select columns using multiple semantic tags or a mixture of semantic tags and logical types.
To select an individual column, specify the column name. Woodwork will be initialized on the returned Series and
you can use the Series for additional analysis as needed.
[13]: total
[13]: 0 25.2450
1 33.5610
2 36.3000
3 33.5610
4 33.5610
...
401599 16.8300
401600 20.7900
401601 27.3900
401602 27.3900
401603 24.5025
Name: total, Length: 401604, dtype: float64
Remove specific semantic tags from a column if they are no longer needed. In this example, remove the
product_details tag from the description column.
[15]: df.ww.remove_semantic_tags({'description':'product_details'})
df.ww
[15]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
(continues on next page)
10 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Notice how the product_details tag has been removed from the description column. If you want to remove
all user-added semantic tags from all columns, you can do that, too.
[16]: df.ww.reset_semantic_tags()
df.ww
[16]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['numeric']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string PersonFullName []
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []
At any point, you can designate certain columns as the Woodwork index or time_index with the methods
set_index and set_time_index. These methods can be used to assign these columns for the first time or to change
the column being used as the index or time index.
Index and time index columns contain index and time_index semantic tags, respectively.
[17]: df.ww.set_index('order_product_id')
df.ww.index
[17]: 'order_product_id'
[18]: df.ww.set_time_index('order_date')
df.ww.time_index
[18]: 'order_date'
[19]: df.ww
[19]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
(continues on next page)
Woodwork also can be used to store typing information on a Series. There are two approaches for initializing Wood-
work on a Series, depending on whether or not the Series dtype is the same as the physical type associated with the
LogicalType. For more information on logical types and physical types, refer to Understanding Types and Tags.
If your Series dtype matches the physical type associated with the specified or inferred LogicalType, Woodwork can
be initialized through the ww namespace, just as with DataFrames.
[20]: series = pd.Series([1, 2, 3], dtype='int64')
series.ww.init(logical_type='Integer')
series.ww
[20]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {
˓→'numeric'})>
In the example above, we specified the Integer LogicalType for the Series. Because Integer has a physical
type of int64 and this matches the dtype used to create the Series, no Series dtype conversion was needed and the
initialization succeeds.
In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be
used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to
match the physical type of the LogicalType.
To demonstrate this case, first create a Series, with a string dtype. Then, initialize a Woodwork Series with a
Categorical logical type using the init_series function. Because Categorical uses a physical type of
category, the dtype of the Series must be changed, and that is why we must use the init_series function here.
The series that is returned will have Woodwork initialized with the LogicalType set to Categorical as expected,
with the expected dtype of category.
[21]: string_series = pd.Series(['a', 'b', 'a'], dtype='string')
ww_series = ww.init_series(string_series, logical_type='Categorical')
ww_series.ww
[21]: <Series: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags
˓→= {'category'})>
As with DataFrames, Woodwork provides several methods that can be used to update or change the typing information
associated with the series. As an example, add a new semantic tag to the series.
[22]: series.ww.add_semantic_tags('new_tag')
series.ww
[22]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_
˓→tag', 'numeric'})>
As you can see from the output above, the specified tag has been added to the semantic tags for the series.
You can also access Series properties methods through the Woodwork namespace. When possible, Woodwork typing
information will be retained on the value returned. As an example, you can access the Series shape property through
Woodwork.
12 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
[23]: series.ww.shape
[23]: (3,)
You can also call Series methods such as sample. In this case, Woodwork typing information is retained on the
Series returned by the sample method.
[24]: sample_series = series.ww.sample(2)
sample_series.ww
[24]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_
˓→tag', 'numeric'})>
[25]: sample_series
[25]: 2 3
0 1
dtype: int64
Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well
as how they are interpreted.
[26]: from woodwork.type_sys.utils import list_logical_types
list_logical_types()
[26]: name type_string \
0 Address address
1 Age age
2 AgeNullable age_nullable
3 Boolean boolean
4 BooleanNullable boolean_nullable
5 Categorical categorical
6 CountryCode country_code
7 Datetime datetime
8 Double double
9 EmailAddress email_address
10 Filepath filepath
11 IPAddress ip_address
12 Integer integer
13 IntegerNullable integer_nullable
14 LatLong lat_long
15 NaturalLanguage natural_language
16 Ordinal ordinal
17 PersonFullName person_full_name
18 PhoneNumber phone_number
19 PostalCode postal_code
20 SubRegionCode sub_region_code
21 Timedelta timedelta
22 URL url
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
(continues on next page)
14 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Guides
Using Woodwork effectively requires a good understanding of physical types, logical types, and semantic tags, all
concepts that are core to Woodwork. This guide provides a detailed overview of types and tags, as well as how to work
with them.
Woodwork has been designed to allow users to easily specify additional typing information for a DataFrame while
providing the ability to interface with the data based on the typing information. Because a single DataFrame might
store various types of data like numbers, text, or dates in different columns, the additional information is defined on a
per-column basis.
There are 3 main ways that Woodwork stores additional information about user data:
• Physical Type: defines how the data is stored on disk or in memory.
• Logical Type: defines how the data should be parsed or interpreted.
• Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Physical Types
Physical types define how the data is stored on disk or in memory. You might also see the physical type for a column
referred to as the column’s dtype.
For example, typical Pandas dtypes often used include object, int64, float64 and datetime64[ns], though
there are many more. In Woodwork, there are 10 different physical types that are used, each corresponding to a Pandas
dtype. When Woodwork is initialized on a DataFrame, the dtype of the underlying data is converted to one of these
values, if it isn’t already one of these types:
• bool
• boolean
• category
• datetime64[ns]
• float64
• int64
• Int64
• object
• string
• timedelta64[ns]
The physical type conversion is done based on the LogicalType that has been specified or inferred for a given
column.
When using Woodwork with a Koalas DataFrame, the physical types used may be different than those listed above.
For more information, refer to the guide Using Woodwork with Dask and Koalas DataFrames.
Logical Types
Logical types define how data should be interpreted or parsed. Logical types provide an additional level of detail
beyond the physical type. Some columns might share the same physical type, but might have different parsing require-
ments depending on the information that is stored in the column.
For example, email addresses and phone numbers would typically both be stored in a data column with a physical
type of string. However, when reading and validating these two types of information, different rules apply. For
email addresses, the presence of the @ symbol is important. For phone numbers, you might want to confirm that only a
certain number of digits are present, and special characters might be restricted to +, -, ( or ). In this particular example
Woodwork defines two different logical types to separate these parsing needs: EmailAddress and PhoneNumber.
There are many different logical types defined within Woodwork. To get a complete list of all the available logical
types, you can use the list_logical_types function.
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
3 Represents Logical Types that contain binary v... bool
4 Represents Logical Types that contain binary v... boolean
5 Represents Logical Types that contain unordere... category
6 Represents Logical Types that contain categori... category
7 Represents Logical Types that contain date and... datetime64[ns]
8 Represents Logical Types that contain positive... float64
9 Represents Logical Types that contain email ad... string
10 Represents Logical Types that specify location... string
11 Represents Logical Types that contain IP addre... string
12 Represents Logical Types that contain positive... int64
13 Represents Logical Types that contain positive... Int64
(continues on next page)
16 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
In the table, notice that each logical type has a specific physical_type value associated with it. Any time a
logical type is set for a column, the physical type of the underlying data is converted to the type shown in the
physical_type column. There is only one physical type associated with each logical type.
Semantic Tags
Semantic tags provide more context about the meaning of a data column. This could directly affect how the information
contained in the column is interpreted. Unlike physical types and logical types, semantic tags are much less restrictive.
A column might contain many semantic tags or none at all. Regardless, when assigning semantic tags, users should
take care to not assign tags that have conflicting meanings.
As an example of how semantic tags can be useful, consider a dataset with 2 date columns: a signup date and a user
birth date. Both of these columns have the same physical type (datetime64[ns]), and both have the same logical
type (Datetime). However, semantic tags can be used to differentiate these columns. For example, you might want
to add the date_of_birth semantic tag to the user birth date column to indicate this column has special meaning
and could be used to compute a user’s age. Computing an age from the signup date column would not make sense, so
the semantic tag can be used to differentiate between what the dates in these columns mean.
As you can see from the table generated with the list_logical_types function above, Woodwork has some
standard tags that are applied to certain columns by default. Woodwork adds a standard set of semantic tags to
columns with LogicalTypes that fall under certain predefined categories.
The standard tags are as follows:
• 'numeric' - The tag applied to numeric Logical Types.
– Integer
– IntegerNullable
– Double
• 'category' - The tag applied to Logical Types that represent categorical variables.
– Categorical
– CountryCode
– Ordinal
– PostalCode
– SubRegionCode
There are also 2 tags that get added to index columns. If no index columns have been specified, these tags are not
present:
• 'index' - on the index column, when specified
• 'time_index' on the time index column, when specified
The application of standard tags, excluding the index and time_index tags, which have special meaning, can be
controlled by the user. This is discussed in more detail in the Working with Semantic Tags section. There are a few
different semantic tags defined within Woodwork. To get a list of the standard, index, and time index tags, you can use
the list_semantic_tags function.
valid_logical_types
0 [Age, AgeNullable, Double, Integer, IntegerNul...
1 [Categorical, CountryCode, Ordinal, PostalCode...
2 [Integer, Double, Categorical, Datetime]
3 [Datetime]
4 [Datetime]
18 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
When initializing Woodwork, users have the option to specify the logical types for all, some, or none of the columns
in the underlying DataFrame. If logical types are defined for all of the columns, these logical types are used directly,
provided the data is compatible with the specified logical type. You can’t, for example, use a logical type of Integer
on a column that contains text values that can’t be converted to integers.
If users don’t supply any logical type information during initialization, Woodwork infers the logical types based on
the physical type of the column and the information contained in the columns. If the user passes information for some
of the columns, the logical types are inferred for any columns not specified.
These scenarios are illustrated in this section. To start, create a simple DataFrame to use for this example.
[3]: import pandas as pd
import woodwork as ww
df = pd.DataFrame({
'integers': [-2, 30, 20],
'bools': [True, False, True],
'names': ["Jane Doe", "Bill Smith", "John Hancock"]
})
df
[3]: integers bools names
0 -2 True Jane Doe
1 30 False Bill Smith
2 20 True John Hancock
Importing Woodwork creates a special namespace on the DataFrame, called ww, that can be used to initialize and
modify Woodwork information for a DataFrame. Now that you’ve created the data to use for the example, you can
initialize Woodwork on this DataFrame, assigning logical type values to each of the columns. Then view the types
stored for each column by using the DataFrame.ww.types property.
[4]: logical_types = {
'integers': 'Integer',
'bools': 'Boolean',
'names': 'PersonFullName'
}
df.ww.init(logical_types=logical_types)
df.ww.types
[4]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer ['numeric']
bools bool Boolean []
names string PersonFullName []
As you can see, the logical types that you specified have been assigned to each of the columns. Now assign only one
logical type value, and let Woodwork infer the types for the other columns.
[5]: logical_types = {
'names': 'PersonFullName'
}
df.ww.init(logical_types=logical_types)
df.ww
[5]: Physical Type Logical Type Semantic Tag(s)
Column
(continues on next page)
With that input, you get the same results. Woodwork used the PersonFullName logical type you assigned to the
names column and then correctly inferred the logical types for the integers and bools columns.
Next, look at what happens if we do not specify any logical types.
[6]: df.ww.init()
df.ww
[6]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer ['numeric']
bools bool Boolean []
names category Categorical ['category']
In this case, Woodwork correctly inferred type for the integers and bools columns, but failed to recognize the
names column should have a logical type of PersonFullName. In situations like this, Woodwork provides users
the ability to change the logical type.
Update the logical type of the names column to be PersonFullName.
If you look carefully at the output, you can see that several things happened to the names column. First, the cor-
rect PersonFullName logical type has been applied. Second, the physical type of the column has changed from
category to string to match the standard physical type for the PersonFullName logical type. Finally, the
standard tag of category that was previously set for the names column has been removed because it no longer
applies.
When setting the LogicalType for a column, the type can be specified by passing a string representing the camel-case
name of the LogicalType class as you have done in previous examples. Alternatively, you can pass the class directly
instead of a string or the snake-case name of the string. All of these would be valid values to use for setting the
PersonFullName Logical type: PersonFullName, "PersonFullName" or "person_full_name".
Note—in order to use the class name, first you have to import the class.
Woodwork provides several methods for working with semantic types. You can add and remove specific tags, or you
can reset the tags to their default values. In this section, you learn how to use those methods.
20 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Standard Tags
As mentioned above, Woodwork applies default semantic tags to columns by default, based on the logical type
that was specified or inferred. If this behavior is undesirable, it can be controlled by setting the parameter
use_standard_tags to False when initializing Woodwork.
[8]: df.ww.init(use_standard_tags=False)
df.ww
[8]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer []
bools bool Boolean []
names category Categorical []
As can be seen in the output above, when initializing Woodwork with use_standard_tags set to False, all
semantic tags are empty. The only exception to this is if the index or time index column were set. We discuss that in
more detail later on.
Create a new Woodwork DataFrame with the standard tags, and specify some additional user-defined semantic tags
during creation.
[9]: semantic_tags = {
'bools': 'user_status',
'names': 'legal_name'
}
df.ww.init(semantic_tags=semantic_tags)
df.ww
[9]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer ['numeric']
bools bool Boolean ['user_status']
names category Categorical ['category', 'legal_name']
Woodwork has applied the tags we specified along with any standard tags to the columns in our DataFrame.
After initializing Woodwork, you have changed your mind and decided you don’t like the tag of user_status that
you applied to the bools column. Now you want to remove it. You can do that with the remove_semantic_tags
method.
[10]: df.ww.remove_semantic_tags({'bools':'user_status'})
df.ww
[10]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer ['numeric']
bools bool Boolean []
names category Categorical ['category', 'legal_name']
All tags can be reset to their default values by using the reset_semantic_tags methods. If
use_standard_tags is True, the tags are reset to the standard tags. Otherwise, the tags are reset to be empty
sets.
[13]: df.ww.reset_semantic_tags()
df.ww
[13]: Physical Type Logical Type Semantic Tag(s)
Column
integers int64 Integer ['numeric']
bools bool Boolean []
names category Categorical ['category']
In this case, since you initialized Woodwork with the default behavior of using standard tags, calling
reset_semantic_tags resulted in all of our semantic tags being reset to the standard tags for each column.
When initializing Woodwork, you have the option to specify which column represents the index and which column
represents the time index. If these columns are specified, semantic tags of index and time_index are applied to
the specified columns. Behind the scenes, Woodwork is performing additional validation checks on the columns to
make sure they are appropriate. For example, index columns must be unique, and time index columns must contain
datetime values or numeric values.
Because of the need for these validation checks, you can’t set the index or time_index tags directly on a column.
In order to designate a column as the index, the set_index method should be used. Similarly, in order to set the time
index column, the set_time_index method should be used. Optionally, these can be specified when initializing
Woodwork by using the index or time_index parameters.
Create a new sample DataFrame that contains columns that can be used as index and time index columns and initialize
Woodwork.
[14]: df = pd.DataFrame({
'index': [0, 1, 2],
'id': [1, 2, 3],
'times': pd.to_datetime(['2020-09-01', '2020-09-02', '2020-09-03']),
'numbers': [10, 20, 30]
(continues on next page)
22 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
df.ww.init()
df.ww
[14]: Physical Type Logical Type Semantic Tag(s)
Column
index int64 Integer ['numeric']
id int64 Integer ['numeric']
times datetime64[ns] Datetime []
numbers int64 Integer ['numeric']
Without specifying an index or time index column during initialization, Woodwork has inferred that the index and
id columns are integers and the numeric semantic tag has been applied. You can now set the index column with the
set_index method.
[15]: df.ww.set_index('index')
df.ww
[15]: Physical Type Logical Type Semantic Tag(s)
Column
index int64 Integer ['index']
id int64 Integer ['numeric']
times datetime64[ns] Datetime []
numbers int64 Integer ['numeric']
Inspecting the types now reveals that the index semantic tag has been added to the index column, and the numeric
standard tag has been removed. You can also check that the index has been set correctly by checking the value of the
DataFrame.ww.index attribute.
[16]: df.ww.index
[16]: 'index'
If you want to change the index column to be the id column instead, you can do that with another call to set_index.
[17]: df.ww.set_index('id')
df.ww
[17]: Physical Type Logical Type Semantic Tag(s)
Column
index int64 Integer ['numeric']
id int64 Integer ['index']
times datetime64[ns] Datetime []
numbers int64 Integer ['numeric']
The index tag has been removed from the index column and added to the id column. The numeric standard tag
that was originally present on the index column has been added back.
Setting the time index works similarly to setting the index. You can now set the time index with the
set_time_index method.
[18]: df.ww.set_time_index('times')
df.ww
[18]: Physical Type Logical Type Semantic Tag(s)
Column
index int64 Integer ['numeric']
id int64 Integer ['index']
times datetime64[ns] Datetime ['time_index']
numbers int64 Integer ['numeric']
After calling set_time_index, the time_index semantic tag has been added to the semantic tags for times
column.
The logical types, physical types, and semantic tags described above make up a DataFrame’s typing information,
which will be referred to as its “schema”. For Woodowork to be useful, the schema must be valid with respect to its
DataFrame.
[19]: df.ww.schema
[19]: Logical Type Semantic Tag(s)
Column
index Integer ['numeric']
id Integer ['index']
times Datetime ['time_index']
numbers Integer ['numeric']
The Woodwork schema shown above can be seen reflected in the DataFrame below. Every column present in the
schema is present in the DataFrame, the dtypes all match the physical types defined by each column’s LogicalType,
and the Woodwork index column is both unique and matches the DataFrame’s underlying index.
[20]: df
[20]: index id times numbers
1 0 1 2020-09-01 10
2 1 2 2020-09-02 20
3 2 3 2020-09-03 30
[21]: df.dtypes
[21]: index int64
id int64
times datetime64[ns]
numbers int64
dtype: object
Woodwork defines the elements of a valid schema, and maintaining schema validity requires that the DataFrame follow
Woodwork’s type system. For this reason, it is not recommended to perform DataFrame operations directly on the
DataFrame; instead, you should go through the ww namespace. Woodwork will attempt to retain a valid schema for
any operations performed through the ww namespace. If a DataFrame operation called through the ww namespace
invalidates the Woodwork schema defined for that DataFrame, the typing information will be removed.
24 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Therefore, when performing Woodwork operations, you can be sure that if the schema is present on df.ww.schema
then the schema is valid for that DataFrame.
Given a DataFrame and its Woodwork typing information, the schema will be considered valid if:
• All of the columns present in the schema are present on the DataFrame and vice versa
• The physical type used by each column’s Logical Type matches the corresponding series’ dtype
• If an index is present, the index column is unique [pandas only]
• If an index is present, the DataFrame’s underlying index matches the index column exactly [pandas only]
Calling sort_values on a DataFrame, for example, will not invalidate a DataFrame’s schema, as none of the above
properties get broken. In the example below, a new DataFrame is created with the columns sorted in descending order,
and it has Woodwork initialized. Looking at the schema, you will see that it’s exactly the same as the schema of the
original DataFrame.
[23]: sorted_df.ww
[23]: Physical Type Logical Type Semantic Tag(s)
Column
index int64 Integer ['numeric']
id int64 Integer ['index']
times datetime64[ns] Datetime ['time_index']
numbers int64 Integer ['numeric']
Conversely, changing a column’s dtype so that it does not match the corresponding physical type by calling astype
on a DataFrame will invalidate the schema, removing it from the DataFrame. The resulting DataFrame will not have
Woodwork initialized, and a warning will be raised explaining why the schema was invalidated.
dtype mismatch for column numbers between DataFrame dtype, float64, and Integer
˓→dtype, int64.
Woodwork provides two helper functions that will allow you to check if a schema is valid for a given dataframe.
The ww.is_schema_valid function will return a boolean indicating whether or not the schema is valid for the
dataframe.
Check whether the schema from df is valid for the sorted_df created above.
The function ww.get_invalid_schema_message can be used to obtain a string message indicating the reason
for an invalid schema. If the schema is valid, this function will return None.
Use the function to determine why the schema from df is invalid for the astype_df created above.
Woodwork contains global configuration options that you can use to control the behavior of certain aspects of Wood-
work. This guide provides an overview of working with those options, including viewing the current settings and
updating the config values.
The output of ww.config lists each of the available config variables followed by it’s current setting.
In the output above, the natural_language_threshold config variable has been set to 10 and the
numeric_categorical_threshold has been set to -1.
26 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Updating a config variable is done simply with a call to the ww.config.set_option function. This function
requires two arguments: the name of the config variable to update and the new value to set.
As an example, update the natural_language_threshold config variable to have a value of 25 instead of the
default value of 10.
As you can see from the output above, the value for the natural_language_threshold config variable has
been updated to 25.
If you need access to the value that is set for a specific config variable you can access it with the ww.config.
get_option function, passing in the name of the config variable for which you want the value.
[3]: ww.config.get_option('natural_language_threshold')
[3]: 25
Config variables can be reset to their default values using the ww.config.reset_option function, passing in the
name of the variable to reset.
As an example, reset the natural_language_threshold config variable to its default value.
[4]: ww.config.reset_option('natural_language_threshold')
ww.config
[4]: Woodwork Global Config Settings
-------------------------------
natural_language_threshold: 10
numeric_categorical_threshold: -1
This section provides an overview of the current config options that can be set within Woodwork.
The natural_language_threshold config variable helps control the distinction between Categorical and
NaturalLanguage logical types during type inference. More specifically, this threshold represents the average
string length that is used to distinguish between these two types. If the average string length in a column is greater than
this threshold, the column is inferred as a NaturalLanguage column; otherwise, it is inferred as a Categorical
column. The natural_language_threshold config variable defaults to 10.
Woodwork provides the option to infer numeric columns as the Categorical logical type if they have few enough
unique values. The numeric_categorical_threshold config variable allows users to set the threshold of
unique values below which numeric columns are inferred as categorical. The default threshold is -1, meaning that
numeric columns are not inferred to be categorical by default (because the fewest number of unique values a column
can have is zero).
Woodwork provides methods on your DataFrames to allow you to use the typing information stored by Woodwork to
better understand your data.
Follow along to learn how to use Woodwork’s statistical methods on a DataFrame of retail data while demonstrating
the full capabilities of the functions.
df = load_retail()
df.ww
[1]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id category Categorical ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
customer_name category Categorical ['category']
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []
28 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
DataFrame.ww.describe
Use df.ww.describe() to calculate statistics for the columns in a DataFrame, returning the results in the format
of a pandas DataFrame with the relevant calculations done for each column.
[2]: df.ww.describe()
[2]: order_id product_id description \
physical_type category category string
logical_type Categorical Categorical NaturalLanguage
semantic_tags {category} {category} {}
count 401604 401604 401604
nunique 22190 3684 NaN
nan_count 0 0 0
mean NaN NaN NaN
mode 576339 85123A WHITE HANGING HEART T-LIGHT HOLDER
std NaN NaN NaN
min NaN NaN NaN
first_quartile NaN NaN NaN
second_quartile NaN NaN NaN
third_quartile NaN NaN NaN
max NaN NaN NaN
num_true NaN NaN NaN
num_false NaN NaN NaN
DataFrame.ww.value_counts
Use df.ww.value_counts() to calculate the most frequent values for each column that has category as a
standard tag. This returns a dictionary where each column is associated with a sorted list of dictionaries. Each
dictionary contains value and count.
[3]: df.ww.value_counts()
[3]: {'order_product_id': [{'value': 0, 'count': 1},
{'value': 267744, 'count': 1},
{'value': 267742, 'count': 1},
{'value': 267741, 'count': 1},
{'value': 267740, 'count': 1},
{'value': 267739, 'count': 1},
{'value': 267738, 'count': 1},
{'value': 267737, 'count': 1},
{'value': 267736, 'count': 1},
{'value': 267735, 'count': 1}],
'order_id': [{'value': '576339', 'count': 542},
{'value': '579196', 'count': 533},
{'value': '580727', 'count': 529},
{'value': '578270', 'count': 442},
{'value': '573576', 'count': 435},
{'value': '567656', 'count': 421},
{'value': '567183', 'count': 392},
{'value': '575607', 'count': 377},
{'value': '571441', 'count': 364},
{'value': '570488', 'count': 353}],
'product_id': [{'value': '85123A', 'count': 2065},
{'value': '22423', 'count': 1894},
{'value': '85099B', 'count': 1659},
{'value': '47566', 'count': 1409},
{'value': '84879', 'count': 1405},
{'value': '20725', 'count': 1346},
{'value': '22720', 'count': 1224},
{'value': 'POST', 'count': 1196},
{'value': '22197', 'count': 1110},
{'value': '23203', 'count': 1108}],
'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
{'value': 'Dalton Grant', 'count': 5898},
{'value': 'Jeremy Woods', 'count': 5128},
{'value': 'Jasmine Salazar', 'count': 4459},
{'value': 'James Robinson', 'count': 2759},
{'value': 'Bryce Stewart', 'count': 2478},
(continues on next page)
30 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
DataFrame.ww.mutual_information
df.ww.mutual_information calculates the mutual information between all pairs of relevant columns. Certain
types, like strings, can’t have mutual information calculated.
The mutual information between columns A and B can be understood as the amount of knowledge you can have
about column A if you have the values of column B. The more mutual information there is between A and B, the less
uncertainty there is in A knowing B, and vice versa.
[4]: df.ww.mutual_information()
[4]: column_1 column_2 mutual_info
0 order_id customer_name 0.886411
1 order_id product_id 0.475745
2 product_id unit_price 0.426383
3 order_id order_date 0.391906
4 product_id customer_name 0.361855
5 order_date customer_name 0.187982
6 quantity total 0.184497
7 customer_name country 0.155593
8 product_id total 0.152183
9 order_id total 0.129882
10 order_id country 0.126048
11 order_id quantity 0.114714
12 unit_price total 0.103210
13 customer_name total 0.099530
14 product_id quantity 0.088663
15 quantity customer_name 0.085515
16 quantity unit_price 0.082515
17 order_id unit_price 0.077681
18 product_id order_date 0.057175
19 total cancelled 0.044032
20 unit_price customer_name 0.041308
21 quantity cancelled 0.035528
22 product_id country 0.028569
23 country total 0.025071
24 order_id cancelled 0.022204
25 quantity country 0.021515
26 order_date country 0.010361
27 customer_name cancelled 0.006456
(continues on next page)
Available Parameters
df.ww.mutual_information provides various parameters for tuning the mutual information calculation.
• num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into
categories. This parameter allows you to choose the number of bins with which to categorize data.
– Defaults to using 10 bins
– The more bins there are, the more variety a column will have. The number of bins used should accurately
portray the spread of the data.
• nrows - If nrows is set at a value below the number of rows in the DataFrame, that number of rows is randomly
sampled from the underlying data
– Defaults to using all the available rows.
– Decreasing the number of rows can speed up the mutual information calculation on a DataFrame with
many rows, but you should be careful that the number being sampled is large enough to accurately portray
the data.
• include_index - If set to True and an index is defined with a logical type that is valid for mutual informa-
tion, the index column will be included in the mutual information output.
– Defaults to False
Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects
numeric columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the
impacted columns.
32 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
In order to include the index column in the mutual information output, run the calculation with
include_index=True.
[7]: mi = df.ww.mutual_information(include_index=True)
mi[mi['column_1'].isin(['order_product_id']) | mi['column_2'].isin(['order_product_id
˓→'])]
Woodwork allows you to add custom typing information to Dask DataFrames or Koalas DataFrames when working
with datasets that are too large to easily fit in memory. Although initializing Woodwork on a Dask or Koalas DataFrame
follows the same process as you follow when initializing on a pandas DataFrame, there are a few limitations to be aware
of. This guide provides a brief overview of using Woodwork with a Dask or Koalas DataFrame. Along the way, the
guide highlights several key items to keep in mind when using a Dask or Koalas DataFrame as input.
Using Woodwork with either Dask or Koalas requires the installation of the Dask or Koalas libraries respectively.
These libraries can be installed directly with these commands:
Create a Dask DataFrame to use in our example. Normally you create the DataFrame directly by reading in the data
from saved files, but you will create it from a demo pandas DataFrame.
Now that you have a Dask DataFrame, you can use it to create a Woodwork DataFrame, just as you would with a
pandas DataFrame:
[2]: df_dask.ww.init(index='order_product_id')
df_dask.ww
[2]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string NaturalLanguage []
country string NaturalLanguage []
total float64 Double ['numeric']
cancelled bool Boolean []
As you can see from the output above, Woodwork was initialized successfully, and logical type inference was per-
formed for all of the columns.
However, that illustrates one of the key issues in working with Dask DataFrames. In order to perform logical type
inference, Woodwork needs to bring the data into memory so it can be analyzed. Currently, Woodwork reads data
from the first partition of data only, and then uses this data for type inference. Depending on the complexity of the
34 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
data, this could be a time consuming operation. Additionally, if the first partition is not representative of the entire
dataset, the logical types for some columns may be inferred incorrectly.
If this process takes too much time, or if the logical types are not inferred correctly, you can manually specify the
logical types for each column. If the logical type for a column is specified, type inference for that column will
be skipped. If logical types are specified for all columns, logical type inference will be skipped completely and
Woodwork will not need to bring any of the data into memory during initialization.
To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the
correct logical type defined for each column in the DataFrame, then pass that dictionary to the initialization call.
[3]: logical_types = {
'order_product_id': 'Integer',
'order_id': 'Categorical',
'product_id': 'Categorical',
'description': 'NaturalLanguage',
'quantity': 'Integer',
'order_date': 'Datetime',
'unit_price': 'Double',
'customer_name': 'PersonFullName',
'country': 'Categorical',
'total': 'Double',
'cancelled': 'Boolean',
}
df_dask.ww.init(index='order_product_id', logical_types=logical_types)
df_dask.ww
[3]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string PersonFullName []
country category Categorical ['category']
total float64 Double ['numeric']
cancelled bool Boolean []
DataFrame Statistics
There are some Woodwork methods that require bringing the underlying Dask DataFrame into memory: describe,
value_counts and mutual_information. When called, these methods will call a compute operation on the
DataFrame to calculate the desired information. This might be problematic for datasets that cannot fit in memory, so
exercise caution when using these methods.
[4]: df_dask.ww.describe(include=['numeric'])
[4]: quantity unit_price total
physical_type int64 float64 float64
(continues on next page)
[5]: df_dask.ww.value_counts()
[5]: {'order_id': [{'value': '536464', 'count': 81},
{'value': '536520', 'count': 71},
{'value': '536412', 'count': 68},
{'value': '536401', 'count': 64},
{'value': '536415', 'count': 59},
{'value': '536409', 'count': 54},
{'value': '536408', 'count': 48},
{'value': '536381', 'count': 35},
{'value': '536488', 'count': 34},
{'value': '536446', 'count': 31}],
'product_id': [{'value': '22632', 'count': 11},
{'value': '85123A', 'count': 10},
{'value': '22633', 'count': 10},
{'value': '22961', 'count': 9},
{'value': '84029E', 'count': 9},
{'value': '22866', 'count': 7},
{'value': '84879', 'count': 7},
{'value': '22960', 'count': 7},
{'value': '21212', 'count': 7},
{'value': '22197', 'count': 7}],
'country': [{'value': 'United Kingdom', 'count': 964},
{'value': 'France', 'count': 20},
{'value': 'Australia', 'count': 14},
{'value': 'Netherlands', 'count': 2}]}
[6]: df_dask.ww.mutual_information().head()
[6]: column_1 column_2 mutual_info
0 order_id order_date 0.777905
1 order_id product_id 0.595564
2 product_id unit_price 0.517738
3 product_id total 0.433166
4 product_id order_date 0.404885
36 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
As above, first create a Koalas DataFrame to use in our example. Normally you create the DataFrame directly by
reading in the data from saved files, but here you create it from a demo pandas DataFrame.
[7]: # The two lines below only need to be executed if you do not have Spark properly
˓→configured.
# However if you are running into config errors, this resource may be useful:
# https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-
˓→address-service-sparkdriver-failed-after
˓→getOrCreate()
df_koalas = ks.from_pandas(df_pandas)
df_koalas.head()
[8]: order_product_id order_id product_id description
˓→ quantity order_date unit_price customer_name country total
˓→cancelled
Now that you have a Koalas DataFrame, you can initialize Woodwork, just as you would with a pandas DataFrame:
[9]: df_koalas.ww.init(index='order_product_id')
df_koalas.ww
[9]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id string Categorical ['category']
product_id string Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string NaturalLanguage []
country string NaturalLanguage []
total float64 Double ['numeric']
cancelled bool Boolean []
As you can see from the output above, Woodwork has been initialized successfully, and logical type inference was
performed for all of the columns.
In the types table above, one important thing to notice is that the physical types for the Koalas DataFrame are different
than the physical types for the Dask DataFrame. The reason for this is that Koalas does not support the category
dtype that is available with pandas and Dask.
When Woodwork is initialized, the dtype of the DataFrame columns are converted to a set of standard dtypes, defined
by the LogicalType primary_dtype property. By default, Woodwork uses the category dtype for any categorical
logical types, but this is not available with Koalas.
For LogicalTypes that have primary_dtype properties that are not compatible with Koalas, Woodwork will try to
convert the column dtype, but will be unsuccessful. At that point, Woodwork will use a backup dtype that is compatible
with Koalas. The implication of this is that using Woodwork with a Koalas DataFrame may result in dtype values that
are different than the values you would get when working with an otherwise identical pandas DataFrame.
Since Koalas does not support the category dtype, any column that is inferred or specified with a logical type
of Categorical will have its values converted to strings and stored with a dtype of string. This means that a
categorical column containing numeric values, will be converted into the equivalent string values.
Finally, Koalas does not support the timedelta64[ns] dtype. For this, there is not a clean backup dtype, so the
use of Timedelta LogicalType is not supported with Koalas DataFrames.
As with Dask, Woodwork must bring the data into memory so it can be analyzed for type inference. Currently,
Woodwork reads the first 100,000 rows of data to use for type inference when using a Koalas DataFrame as input. If
the first 100,000 rows are not representative of the entire dataset, the logical types for some columns might be inferred
incorrectly.
To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the
correct logical type defined for each column in the dataframe.
[10]: logical_types = {
'order_product_id': 'Integer',
'order_id': 'Categorical',
'product_id': 'Categorical',
'description': 'NaturalLanguage',
'quantity': 'Integer',
'order_date': 'Datetime',
'unit_price': 'Double',
'customer_name': 'PersonFullName',
'country': 'Categorical',
'total': 'Double',
'cancelled': 'Boolean',
}
df_koalas.ww.init(index='order_product_id', logical_types=logical_types)
df_koalas.ww
[10]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id string Categorical ['category']
product_id string Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
(continues on next page)
38 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
DataFrame Statistics
As with Dask, running describe, value_counts or mutual_information requires bringing the data into
memory to perform the analysis. When called, these methods will call a to_pandas operation on the DataFrame
to calculate the desired information. This may be problematic for very large datasets, so exercise caution when using
these methods.
[11]: df_koalas.ww.describe(include=['numeric'])
[11]: quantity unit_price total
physical_type int64 float64 float64
logical_type Integer Double Double
semantic_tags {numeric} {numeric} {numeric}
count 1000.0 1000.0 1000.0
nunique 43.0 61.0 232.0
nan_count 0 0 0
mean 12.735 5.003658 40.390465
mode 1 2.0625 24.75
std 38.401634 9.73817 123.99357
min -24.0 0.165 -68.31
first_quartile 2.0 2.0625 5.709
second_quartile 4.0 3.34125 17.325
third_quartile 12.0 6.1875 33.165
max 600.0 272.25 2684.88
num_true NaN NaN NaN
num_false NaN NaN NaN
[12]: df_koalas.ww.value_counts()
[12]: {'order_id': [{'value': '536464', 'count': 81},
{'value': '536520', 'count': 71},
{'value': '536412', 'count': 68},
{'value': '536401', 'count': 64},
{'value': '536415', 'count': 59},
{'value': '536409', 'count': 54},
{'value': '536408', 'count': 48},
{'value': '536381', 'count': 35},
{'value': '536488', 'count': 34},
{'value': '536446', 'count': 31}],
'product_id': [{'value': '22632', 'count': 11},
{'value': '85123A', 'count': 10},
{'value': '22633', 'count': 10},
{'value': '84029E', 'count': 9},
{'value': '22961', 'count': 9},
{'value': '22197', 'count': 7},
{'value': '21212', 'count': 7},
{'value': '22960', 'count': 7},
{'value': '84879', 'count': 7},
{'value': '22866', 'count': 7}],
(continues on next page)
[13]: df_koalas.ww.mutual_information().head()
[13]: column_1 column_2 mutual_info
0 order_id order_date 0.777905
1 order_id product_id 0.595564
2 product_id unit_price 0.517738
3 product_id total 0.433166
4 product_id order_date 0.404885
Woodwork performs several validation checks to confirm that the data in the DataFrame is appropriate for the specified
parameters. Because some of these validation steps would require pulling the data into memory, they are skipped when
using Woodwork with a Dask or Koalas DataFrame. This section provides an overview of the validation checks that
are performed with pandas input but skipped with Dask or Koalas input.
Index Uniqueness
Normally a check is performed to verify that any column specified as the index contains no duplicate values. With
Dask or Koalas input, this check is skipped and you must manually verify that any column specified as an index
column contains unique values.
If you manually define the LogicalType for a column when initializing Woodwork, a check is performed to verify
that the data in that column is appropriate for the specified LogicalType. For example, with pandas input if you
specify a LogicalType of Double for a column that contains letters such as ['a', 'b', 'c'], an error is raised
because it is not possible to convert the letters into numeric values with the float dtype associated with the Double
LogicalType.
With Dask input, no such error appears at the time initialization. However, behind the scenes, Woodwork attempts to
convert the column physical type to float, and this conversion is added to the Dask task graph, without raising an
error. However, an error is raised if a compute operation is called on the DataFrame as Dask attempts to execute
the conversion step. Extra care should be taken when using Dask input to make sure any specified logical types are
consistent with the data in the columns to avoid this type of error.
40 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
For the Ordinal LogicalType, a check is typically performed to make sure that the data column does not contain any
values that are not present in the defined order values. This check will not be performed with Dask or Koalas input.
Users should manually verify that the defined order values are complete to avoid unexpected results.
Other Limitations
Woodwork provides the ability to read data directly from a CSV file into a Woodwork DataFrame. The helper function
used for this, woodwork.read_file, currently only reads the data into a pandas DataFrame. At some point,
this limitation may be removed, allowing data to be read into a Dask or Koalas DataFrame. For now, only pandas
DataFrames can be created with this function.
When initializing with a time index, Woodwork, by default, will sort the input DataFrame first on the time index and
then on the index, if specified. Because sorting a distributed DataFrame is a computationally expensive operation, this
sorting is performed only when using a pandas DataFrame. If a sorted DataFrame is needed when using a Dask or
Koalas, the user should manually sort the DataFrame as needed.
In order to avoid bringing a Dask DataFrame into memory, Woodwoork does not consider the equality of the data when
checking whether Woodwork Dataframe initialized from a Dask or Koalas DataFrame is equal to another Woodwork
DataFrame. This means that two DataFrames with identical names, columns, indices, semantic tags, and LogicalTypes
but different underlying data will be treated as equal if at least one of them uses Dask or Koalas.
LatLong Columns
When working with the LatLong logical type, Woodwork converts all LatLong columns to a standard format of a tuple
of floats for Dask DataFrames and a list of floats for Koalas DataFrames. In order to do this, the data is read into
memory, which may be problematic for large datatsets.
Woodwork allows column names of any format that is supported by the DataFrame. However, Dask DataFrames do
not currently support integer column names.
When specifying a Woodwork index with a pandas DataFrame, the underlying index of the DataFrame will be updated
to match the column specified as the Woodwork index. When specifying a Woodwork index on a Dask or Koalas
DataFrame, however, the underlying index will remain unchanged.
Make Index
When using make_index during Woodwork initialization, a new index column is added in-place to the existing
DataFrame. Because this type of in-place operation is not currently possible with Koalas, Woodwork does not support
make_index when working with a Koalas DataFrame.
If a new index column is needed, this should be added by the user prior to initializing Woodwork. This can be done
easily with an operation such as this:
df = df.koalas.attach_id_column('distributed-sequence', 'index_col_name').
The default type system in Woodwork contains many built-in LogicalTypes that work for a wide variety of datasets. For
situations in which the built-in LogicalTypes are not sufficient, Woodwork allows you to create custom LogicalTypes.
Woodwork also has a set of standard type inference functions that can help in automatically identifying correct Logi-
calTypes in the data. Woodwork also allows you to override these existing functions, or add new functions for inferring
any custom LogicalTypes that are added.
This guide provides an overview of how to create custom LogicalTypes as well as how to override and add new type
inference functions. If you need to learn more about types and tags in Woodwork, refer to the Understanding Types
and Tags guide for more detail.
To view all of the default LogicalTypes in Woodwork, use the list_logical_types function. If the existing
types are not sufficient for your needs, you can create and register new LogicalTypes for use with Woodwork initialized
DataFrames and Series.
ww.list_logical_types()
[1]: name type_string \
0 Address address
1 Age age
2 AgeNullable age_nullable
3 Boolean boolean
4 BooleanNullable boolean_nullable
5 Categorical categorical
6 CountryCode country_code
7 Datetime datetime
8 Double double
9 EmailAddress email_address
10 Filepath filepath
11 IPAddress ip_address
12 Integer integer
(continues on next page)
42 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
3 Represents Logical Types that contain binary v... bool
4 Represents Logical Types that contain binary v... boolean
5 Represents Logical Types that contain unordere... category
6 Represents Logical Types that contain categori... category
7 Represents Logical Types that contain date and... datetime64[ns]
8 Represents Logical Types that contain positive... float64
9 Represents Logical Types that contain email ad... string
10 Represents Logical Types that specify location... string
11 Represents Logical Types that contain IP addre... string
12 Represents Logical Types that contain positive... int64
13 Represents Logical Types that contain positive... Int64
14 Represents Logical Types that contain latitude... object
15 Represents Logical Types that contain text or ... string
16 Represents Logical Types that contain ordered ... category
17 Represents Logical Types that may contain firs... string
18 Represents Logical Types that contain numeric ... string
19 Represents Logical Types that contain a series... category
20 Represents Logical Types that contain codes re... category
21 Represents Logical Types that contain values s... timedelta64[ns]
22 Represents Logical Types that contain URLs, wh... string
The first step in registering a new LogicalType is to define the class for the new type. This is done by sub-classing the
built-in LogicalType class. There are a few class attributes that should be set when defining this new class. Each
is reviewed in more detail below.
For this example, you will work through an example for a dataset that contains UPC Codes. First create a new
UPCCode LogicalType. For this example, consider the UPC Code to be a type of categorical variable.
class UPCCode(LogicalType):
"""Represents Logical Types that contain 12-digit UPC Codes."""
primary_dtype = 'category'
backup_dtype = 'string'
standard_tags = {'category', 'upc_code'}
When defining the UPCCode LogicalType class, three class attributes were set. All three of these attributes are
optional, and will default to the values defined on the LogicalType class if they are not set when defining the new
type.
• primary_dtype: This value specifies how the data will be stored. If the column of the dataframe is not
already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that
represents a valid pandas dtype. If not specified, this will default to 'string'.
• backup_dtype: This is primarily useful when working with Koalas dataframes. backup_dtype specifies
the dtype to use if Woodwork is unable to convert to the dtype specified by primary_dtype. In our example,
we set this to 'string' since Koalas does not currently support the 'category' dtype.
• standard_tags: This is a set of semantic tags to apply to any column that is set with the specified Logical-
Type. If not specified, standard_tags will default to an empty set.
• docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a
description of the type in the list of available types returned by ww.list_logical_types().
Note: Behind the scenes, Woodwork uses the category and numeric semantic tags to determine whether a
column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or
numeric type, you should include the appropriate tag in the set of tags specified for standard_tags.
Now that you have created the new LogicalType, you can register it with the Woodwork type system so you can use
it. All modifications to the type system are performed by calling the appropriate method on the ww.type_system
object.
If you once again list the available LogicalTypes, you will see the new type you created was added to the list, including
the values for description, physical_type and standard_tags specified when defining the UPCCode LogicalType.
[4]: ww.list_logical_types()
44 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
3 Represents Logical Types that contain binary v... bool
4 Represents Logical Types that contain binary v... boolean
5 Represents Logical Types that contain unordere... category
6 Represents Logical Types that contain categori... category
7 Represents Logical Types that contain date and... datetime64[ns]
8 Represents Logical Types that contain positive... float64
9 Represents Logical Types that contain email ad... string
10 Represents Logical Types that specify location... string
11 Represents Logical Types that contain IP addre... string
12 Represents Logical Types that contain positive... int64
13 Represents Logical Types that contain positive... Int64
14 Represents Logical Types that contain latitude... object
15 Represents Logical Types that contain text or ... string
16 Represents Logical Types that contain ordered ... category
17 Represents Logical Types that may contain firs... string
18 Represents Logical Types that contain numeric ... string
19 Represents Logical Types that contain a series... category
20 Represents Logical Types that contain codes re... category
21 Represents Logical Types that contain values s... timedelta64[ns]
22 Represents Logical Types that contain 12-digit... category
23 Represents Logical Types that contain URLs, wh... string
When adding a new type to the type system, you can specify an optional parent LogicalType as done above. When
performing type inference a given set of data might match multiple different LogicalTypes. Woodwork uses the
parent-child relationship defined when registering a type to determine which type to infer in this case.
When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent
type to Categorical when registering the UPCCode LogicalType, you are telling Woodwork that if a data column
matches both Categorical and UPCCode during inference, the column should be considered as UPCCode as this
is more specific than Categorical. Woodwork always assumes that a child type is a more specific version of the
parent type.
Next, you will create a small sample DataFrame to demonstrate use of the new custom type. This sample DataFrame
includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes
because it contains non-numeric values.
Before using this dataframe, update Woodwork’s default threshold for differentiating between a NaturalLanguage
and Categorical column so that Woodwork will correctly recognize the code column as a Categorical
column. After setting the threshold, initialize Woodwork and verify that Woodwork has identified our column as
Categorical.
46 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
The reason Woodwork did not identify the code column to have a UPCCode LogicalType, is that you have not yet
defined an inference function to use with this type. The inference function is what tells Woodwork how to match
columns to specific LogicalTypes.
Even without the inference function, you can manually tell Woodwork that the code column should be of type
UPCCode. This will set the physical type properly and apply the standard semantic tags you have defined
Next, add a new inference function and allow Woodwork to automatically set the correct type for the code column.
The first step in adding an inference function for the UPCCode LogicalType is to define an appropriate function.
Inference functions always accept a single parameter, a pandas.Series. The function should return True if the
series is a match for the LogicalType for which the function is associated, or False if the series is not a match.
For the UPCCode LogicalType, define a function to check that all of the values in a column are 12 character strings
that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that
need to be considered for properly identifying a UPC Code.
After defining the new UPC Code inference function, add it to the Woodwork type system so it can be used when
inferring column types.
After updating the inference function, you can reinitialize Woodwork on the DataFarme. Notice that Woodwork has
correctly identified the code column to have a LogicalType of UPCCode and has correctly set the physical type and
added the standard tags to the semantic tags for that column.
Also note that the not_upc column was identified as Categorical. Even though this column contains 12-digit
strings, some of the values contain letters, and our inference function correctly told Woodwork this was not valid for
the UPCCode LogicalType.
[10]: df.ww.init()
df.ww
[10]: Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc category Categorical ['category']
Overriding the default inference functions is done with the update_inference_function TypeSystem method.
Simply pass in the LogicalType for which you want to override the function, along with the new function to use.
For example you can tell Woodwork to use the new infer_upc_code function for the built in Categorical
LogicalType.
If you initialize Woodwork on a DataFrame after updating the Categorical function, you can see that
the not_upc column is no longer identified as a Categorical column, but is rather set to the default
NaturalLanguage LogicalType. This is because the letters in the first row of the not_upc column cause our
inference function to return False for this column, while the default Categorical function will allow non-
numeric values to be present. After updating the inference function, this column is no longer considered a match
for the Categorical type, nor does the column match any other logical types. As a result, the LogicalType is set to
NaturalLanguage, the default type used when no type matches are found.
[12]: df.ww.init()
df.ww
[12]: Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category UPCCode ['upc_code', 'category']
not_upc string NaturalLanguage []
If you need to change the parent for a registered LogicalType, you can do this with the update_relationship
method. Update the new UPCCode LogicalType to be a child of NaturalLanguage instead.
The parent for a logical type can also be set to None to indicate this is a root-level LogicalType that is not a child of
any other existing LogicalType.
48 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the
most specific LogicalType match found during inference, improper inference can occur if the relationships are not set
correctly.
As an example, if you initialize Woodwork after setting the UPCCode LogicalType to have a parent of None, you
will now see that the UPC Code column is inferred as Categorical instead of UPCCode. After setting the parent
to None, UPCCode and Categorical are now siblings in the relationship graph instead of having a parent-child
relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship
graph, the first match is returned, which in this case is Categorical. Without proper parent-child relationships set,
Woodwork is unable to determine which LogicalType is most specific.
[15]: df.ww.init()
df.ww
[15]: Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
code category Categorical ['category']
not_upc string NaturalLanguage []
Removing a LogicalType
If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type
method. When a LogicalType has been removed, a value of False will be present in the is_registered column
for the type. If a LogicalType that has children is removed, all of the children types will have their parent set to the
parent of the LogicalType that is being removed, assuming a parent was defined.
Remove the custom UPCCode type and confirm it has been removed from the type system by listing the available
LogicalTypes. You can confirm that the UPCCode type will no longer be used because it will have a value of False
listed in the is_registered column.
[16]: ww.type_system.remove_type('UPCCode')
ww.list_logical_types()
[16]: name type_string \
0 Address address
1 Age age
2 AgeNullable age_nullable
3 Boolean boolean
4 BooleanNullable boolean_nullable
5 Categorical categorical
6 CountryCode country_code
7 Datetime datetime
8 Double double
9 EmailAddress email_address
10 Filepath filepath
11 IPAddress ip_address
12 Integer integer
13 IntegerNullable integer_nullable
14 LatLong lat_long
15 NaturalLanguage natural_language
16 Ordinal ordinal
17 PersonFullName person_full_name
18 PhoneNumber phone_number
(continues on next page)
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
3 Represents Logical Types that contain binary v... bool
4 Represents Logical Types that contain binary v... boolean
5 Represents Logical Types that contain unordere... category
6 Represents Logical Types that contain categori... category
7 Represents Logical Types that contain date and... datetime64[ns]
8 Represents Logical Types that contain positive... float64
9 Represents Logical Types that contain email ad... string
10 Represents Logical Types that specify location... string
11 Represents Logical Types that contain IP addre... string
12 Represents Logical Types that contain positive... int64
13 Represents Logical Types that contain positive... Int64
14 Represents Logical Types that contain latitude... object
15 Represents Logical Types that contain text or ... string
16 Represents Logical Types that contain ordered ... category
17 Represents Logical Types that may contain firs... string
18 Represents Logical Types that contain numeric ... string
19 Represents Logical Types that contain a series... category
20 Represents Logical Types that contain codes re... category
21 Represents Logical Types that contain values s... timedelta64[ns]
22 Represents Logical Types that contain 12-digit... category
23 Represents Logical Types that contain URLs, wh... string
50 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Finally, if you made multiple changes to the default Woodwork type system and would like to reset everything back
to the default state, you can use the reset_defaults method as shown below. This unregisters any new types
you have registered, resets all relationships to their default values and sets all inference functions back to their default
functions.
[17]: ww.type_system.reset_defaults()
There may be times when you would like to override Woodwork’s default LogicalTypes. An example might be if you
wanted to use the nullable Int64 dtype for the Integer LogicalType instead of the default dtype of int64. In this
case, you want to stop Woodwork from inferring the default Integer LogicalType and have a compatible Logical
Type inferred instead. You may solve this issue in one of two ways.
First, you can create an entirely new LogicalType with its own name, MyInteger, and register it in the TypeSystem.
If you want to infer it in place of the normal Integer LogicalType, you would remove Integer from the type sys-
tem, and use Integer’s default inference function for MyInteger. Doing this will make it such that MyInteger
will get inferred any place that Integer would have previously. Note, that because Integer has a parent Logi-
calType of IntegerNullable, you also need to set the parent of MyInteger to be IntegerNullable when
registering with the type system.
[18]: from woodwork.logical_types import LogicalType
class MyInteger(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(MyInteger, int_inference_fn, parent='IntegerNullable')
df.ww.init()
df.ww
[18]: Physical Type Logical Type Semantic Tag(s)
Column
id Int64 MyInteger ['numeric']
code category Categorical ['category']
not_upc category Categorical ['category']
Above, you can see that the id column, which was previously inferred as Integer is now inferred as MyInteger
with the Int64 physical type. In the full list of Logical Types at ww.list_logical_types(), Integer
and MyInteger will now both be present, but Integer’s is_registered will be False while the value for
is_registered for MyInteger will be set to True.
The second option for overriding the default Logical Types allows you to create a new LogicalType with the same
name as an existing one. This might be desirable because it will allow Woodwork to interpret the string 'Integer'
as your new LogicalType, allowing previous code that might have selected 'Integer' to be used without updating
references to a new LogicalType like MyInteger.
Before adding a LogicalType whose name already exists into the TypeSystem, you must first unregister the default
LogicalType.
In order to avoid using the same name space locally between Integer LogicalTypes, it is recommended to reference
Woodwork’s default LogicalType as ww.logical_types.Integer.
[19]: ww.type_system.reset_defaults()
class Integer(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(Integer, int_inference_fn, parent='IntegerNullable')
df.ww.init()
display(df.ww)
ww.type_system.reset_defaults()
Physical Type Logical Type Semantic Tag(s)
Column
id Int64 Integer ['numeric']
code category Categorical ['category']
not_upc category Categorical ['category']
Notice how id now gets inferred as an Integer Logical Type that has Int64 as its Physical Type!
API Reference
WoodworkTableAccessor
WoodworkTableAccessor(dataframe)
WoodworkTableAccessor. Adds specified semantic tags to columns, updating the
add_semantic_tags(. . . ) Woodwork typing information.
WoodworkTableAccessor. Calculates statistics for data contained in the
describe([include]) DataFrame.
WoodworkTableAccessor. Calculates statistics for data contained in the
describe_dict([include]) DataFrame.
WoodworkTableAccessor.drop(columns) Drop specified columns from a DataFrame.
WoodworkTableAccessor.iloc Integer-location based indexing for selection by posi-
tion.
WoodworkTableAccessor.index The index column for the table
WoodworkTableAccessor.init([index, . . . ]) Initializes Woodwork typing information for a
DataFrame.
WoodworkTableAccessor.loc Access a group of rows by label(s) or a boolean array.
WoodworkTableAccessor.logical_types A dictionary containing logical types for each column
WoodworkTableAccessor. Calculates mutual information between all pairs of
mutual_information([. . . ]) columns in the DataFrame that support mutual informa-
tion.
WoodworkTableAccessor. Calculates mutual information between all pairs of
mutual_information_dict([. . . ]) columns in the DataFrame that support mutual informa-
tion.
WoodworkTableAccessor.physical_types A dictionary containing physical types for each column
WoodworkTableAccessor.pop(column_name) Return a Series with Woodwork typing information and
remove it from the DataFrame.
WoodworkTableAccessor. Remove the semantic tags for any column names in the
remove_semantic_tags(. . . ) provided semantic_tags dictionary, updating the Wood-
work typing information.
continues on next page
52 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.table_accessor.WoodworkTableAccessor
class woodwork.table_accessor.WoodworkTableAccessor(dataframe)
__init__(dataframe)
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
54 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.table_accessor.WoodworkTableAccessor.add_semantic_tags
WoodworkTableAccessor.add_semantic_tags(semantic_tags)
Adds specified semantic tags to columns, updating the Woodwork typing information. Will retain any previously
set values.
Parameters semantic_tags (dict[str -> str/list/set]) – A dictionary mapping the
columns in the DataFrame to the tags that should be added to the column’s semantic tags
woodwork.table_accessor.WoodworkTableAccessor.describe
WoodworkTableAccessor.describe(include=None)
Calculates statistics for data contained in the DataFrame.
Parameters include (list[str or LogicalType], optional) – filter for what
columns to include in the statistics returned. Can be a list of column names, semantic tags,
logical types, or a list combining any of the three. It follows the most broad specification. Fa-
vors logical types then semantic tag then column name. If no matching columns are found, an
empty DataFrame will be returned.
Returns A Dataframe containing statistics for the data or the subset of the original DataFrame that
contains the logical types, semantic tags, or column names specified in include.
Return type pd.DataFrame
woodwork.table_accessor.WoodworkTableAccessor.describe_dict
WoodworkTableAccessor.describe_dict(include=None)
Calculates statistics for data contained in the DataFrame.
Parameters include (list[str or LogicalType], optional) – filter for what
columns to include in the statistics returned. Can be a list of column names, semantic tags,
logical types, or a list combining any of the three. It follows the most broad specification. Fa-
vors logical types then semantic tag then column name. If no matching columns are found, an
empty DataFrame will be returned.
Returns A dictionary with a key for each column in the data or for each column matching the logical
types, semantic tags or column names specified in include, paired with a value containing a
dictionary containing relevant statistics for that column.
Return type dict[str -> dict]
woodwork.table_accessor.WoodworkTableAccessor.drop
WoodworkTableAccessor.drop(columns)
Drop specified columns from a DataFrame.
Parameters columns (str or list[str]) – Column name or names to drop. Must be
present in the DataFrame.
Returns DataFrame with the specified columns removed, maintaining Woodwork typing informa-
tion.
Return type DataFrame
Note: This method is used for removing columns only. To remove rows with drop, go through the DataFrame
directly and then reinitialize Woodwork with DataFrame.ww.init instead of calling DataFrame.ww.
drop.
woodwork.table_accessor.WoodworkTableAccessor.iloc
property WoodworkTableAccessor.iloc
Integer-location based indexing for selection by position. .iloc[] is primarily integer position based (from 0
to length-1 of the axis), but may also be used with a boolean array.
If the selection result is a DataFrame or Series, Woodwork typing information will be initialized for the returned
object when possible.
Allowed inputs are: An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A slice object with ints,
e.g. 1:7. A boolean array. A callable function with one argument (the calling Series, DataFrame or
Panel) and that returns valid output for indexing (one of the above). This is useful in method chains, when
you don’t have a reference to the calling object, but would like to base your selection on some value.
woodwork.table_accessor.WoodworkTableAccessor.index
property WoodworkTableAccessor.index
The index column for the table
woodwork.table_accessor.WoodworkTableAccessor.init
56 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
• make_index (bool, optional) – If True, will create a new unique, numeric index
column with the name specified by index and will add the new index column to the sup-
plied DataFrame. If True, the name specified in index cannot match an existing column
name in dataframe. If False, the name is specified in index must match a column
present in the dataframe. Defaults to False.
• already_sorted (bool, optional) – Indicates whether the input DataFrame is al-
ready sorted on the time index. If False, will sort the dataframe first on the time_index and
then on the index (pandas DataFrame only). Defaults to False.
• name (str, optional) – Name used to identify the DataFrame.
• semantic_tags (dict, optional) – Dictionary mapping column names in Wood-
work to the semantic tags for the column. The keys in the dictionary should be strings that
correspond to column names. There are two options for specifying the dictionary values:
(str): If only one semantic tag is being set, a single string can be used as a value. (list[str]
or set[str]): If multiple tags are being set, a list or set of strings can be used as the value.
Semantic tags will be set to an empty set for any column not included in the dictionary.
• table_metadata (dict[str -> json serializable], optional) – Dic-
tionary containing extra metadata for Woodwork.
• column_metadata (dict[str -> dict[str -> json serializable]],
optional) – Dictionary mapping column names to that column’s metadata dictionary.
• use_standard_tags (bool, dict[str -> bool], optional) – Determines
whether standard semantic tags will be added to columns based on the specified logical type
for the column. If a single boolean is supplied, will apply the same use_standard_tags value
to all columns. A dictionary can be used to specify use_standard_tags values for
individual columns. Unspecified columns will use the default value. Defaults to True.
• column_descriptions (dict[str -> str], optional) – Dictionary map-
ping column names to column descriptions.
• schema (Woodwork.TableSchema, optional) – Typing information to use for the
DataFrame instead of performing inference. Any other arguments provided will be ignored.
Note that any changes made to the schema object after initialization will propagate to the
DataFrame. Similarly, to avoid unintended typing information changes, the same schema
object should not be shared between DataFrames.
• validate (bool, optional) – Whether parameter and data validation should occur.
Defaults to True. Warning: Should be set to False only when parameters and data are
known to be valid. Any errors resulting from skipping validation with invalid inputs may
not be easily understood.
woodwork.table_accessor.WoodworkTableAccessor.loc
property WoodworkTableAccessor.loc
Access a group of rows by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
If the selection result is a DataFrame or Series, Woodwork typing information will be initialized for the returned
object when possible.
Allowed inputs are: A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never
as an integer position along the index). A list or array of labels, e.g. ['a', 'b', 'c']. A slice object
with labels, e.g. 'a':'f'. A boolean array of the same length as the axis being sliced, e.g. [True,
False, True]. An alignable boolean Series. The index of the key will be aligned before masking. An
alignable Index. The Index of the returned selection will be the input. A callable function with one
argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
woodwork.table_accessor.WoodworkTableAccessor.logical_types
property WoodworkTableAccessor.logical_types
A dictionary containing logical types for each column
woodwork.table_accessor.WoodworkTableAccessor.mutual_information
woodwork.table_accessor.WoodworkTableAccessor.mutual_information_dict
58 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Returns A list containing dictionaries that have keys column_1, column_2, and mutual_info that is
sorted in decending order by mutual info. Mutual information values are between 0 (no mutual
information) and 1 (perfect dependency).
Return type list(dict)
woodwork.table_accessor.WoodworkTableAccessor.physical_types
property WoodworkTableAccessor.physical_types
A dictionary containing physical types for each column
woodwork.table_accessor.WoodworkTableAccessor.pop
WoodworkTableAccessor.pop(column_name)
Return a Series with Woodwork typing information and remove it from the DataFrame.
Parameters column (str) – Name of the column to pop.
Returns Popped series with Woodwork initialized
Return type Series
woodwork.table_accessor.WoodworkTableAccessor.remove_semantic_tags
WoodworkTableAccessor.remove_semantic_tags(semantic_tags)
Remove the semantic tags for any column names in the provided semantic_tags dictionary, updating the Wood-
work typing information. Including index or time_index tags will set the Woodwork index or time index to None
for the DataFrame.
Parameters semantic_tags (dict[str -> str/list/set]) – A dictionary mapping the
columns in the DataFrame to the tags that should be removed from the column’s semantic tags
woodwork.table_accessor.WoodworkTableAccessor.rename
WoodworkTableAccessor.rename(columns)
Renames columns in a DataFrame, maintaining Woodwork typing information.
Parameters columns (dict[str -> str]) – A dictionary mapping current column names to
new column names.
Returns DataFrame with the specified columns renamed, maintaining Woodwork typing informa-
tion.
Return type DataFrame
woodwork.table_accessor.WoodworkTableAccessor.reset_semantic_tags
WoodworkTableAccessor.reset_semantic_tags(columns=None, retain_index_tags=False)
Reset the semantic tags for the specified columns to the default values. The default values will be either an
empty set or a set of the standard tags based on the column logical type, controlled by the use_standard_tags
property on each column. Column names can be provided as a single string, a list of strings or a set of strings.
If columns is not specified, tags will be reset for all columns.
Parameters
• columns (str/list/set, optional) – The columns for which the semantic tags
should be reset.
• retain_index_tags (bool, optional) – If True, will retain any index or
time_index semantic tags set on the column. If False, will clear all semantic tags. Defaults
to False.
woodwork.table_accessor.WoodworkTableAccessor.schema
property WoodworkTableAccessor.schema
A copy of the Woodwork typing information for the DataFrame.
woodwork.table_accessor.WoodworkTableAccessor.select
WoodworkTableAccessor.select(include=None, exclude=None)
Create a DataFrame with Woodwork typing information initialized that includes only columns whose Logical
Type and semantic tags match conditions specified in the list of types and tags to include or exclude. Values for
both include and exclude cannot be provided in a single call.
If no matching columns are found, an empty DataFrame will be returned.
Parameters
• include (str or LogicalType or list[str or LogicalType]) – Logi-
cal types, semantic tags to include in the DataFrame.
• exclude (str or LogicalType or list[str or LogicalType]) – Logi-
cal types, semantic tags to exclude from the DataFrame.
Returns The subset of the original DataFrame that matches the conditions specified by include
or exclude. Has Woodwork typing information initialized.
Return type DataFrame
woodwork.table_accessor.WoodworkTableAccessor.semantic_tags
property WoodworkTableAccessor.semantic_tags
A dictionary containing semantic tags for each column
60 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.table_accessor.WoodworkTableAccessor.set_index
WoodworkTableAccessor.set_index(new_index)
Sets the index column of the DataFrame. Adds the ‘index’ semantic tag to the column and clears the tag from
any previously set index column.
Setting a column as the index column will also cause any previously set standard tags for the column to be
removed.
Clears the DataFrame’s index by passing in None.
Parameters new_index (str) – The name of the column to set as the index
woodwork.table_accessor.WoodworkTableAccessor.set_time_index
WoodworkTableAccessor.set_time_index(new_time_index)
Set the time index. Adds the ‘time_index’ semantic tag to the column and clears the tag from any previously set
index column
Parameters new_time_index (str) – The name of the column to set as the time index. If None,
will remove the time_index.
woodwork.table_accessor.WoodworkTableAccessor.set_types
woodwork.table_accessor.WoodworkTableAccessor.time_index
property WoodworkTableAccessor.time_index
The time index column for the table
woodwork.table_accessor.WoodworkTableAccessor.to_disk
Note: As the engine fastparquet cannot handle nullable pandas dtypes, pyarrow will be used for serialization
to parquet.
Parameters
• path (str) – Location on disk to write to (will be created as a directory)
• format (str) – Format to use for writing Woodwork data. Defaults to csv. Possible values
are: {‘csv’, ‘pickle’, ‘parquet’}.
• compression (str) – Name of the compression to use. Possible values are: {‘gzip’,
‘bz2’, ‘zip’, ‘xz’, None}.
• profile_name (str) – Name of AWS profile to use, False to use an anonymous profile,
or None.
• kwargs (keywords) – Additional keyword arguments to pass as keywords arguments to
the underlying serialization method or to specify AWS profile.
woodwork.table_accessor.WoodworkTableAccessor.to_dictionary
WoodworkTableAccessor.to_dictionary()
Get a dictionary representation of the Woodwork typing information.
Returns Description of the typing information.
Return type dict
woodwork.table_accessor.WoodworkTableAccessor.types
property WoodworkTableAccessor.types
DataFrame containing the physical dtypes, logical types and semantic tags for the schema.
woodwork.table_accessor.WoodworkTableAccessor.use_standard_tags
property WoodworkTableAccessor.use_standard_tags
A dictionary containing the use_standard_tags setting for each column in the table
62 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.table_accessor.WoodworkTableAccessor.value_counts
Parameters
• ascending (bool) – Defines whether each list of values should be sorted most frequent
to least frequent value (False), or least frequent to most frequent value (True). Defaults to
False.
• top_n (int) – the number of top values to retrieve. Defaults to 10.
• dropna (bool) – determines whether to remove NaN values when finding frequency.
Defaults to False.
Returns a list of dictionaries for each categorical column with keys count and value.
Return type list(dict)
WoodworkColumnAccessor
WoodworkColumnAccessor(series)
WoodworkColumnAccessor. Add the specified semantic tags to the set of tags.
add_semantic_tags(. . . )
WoodworkColumnAccessor.description The description of the series
WoodworkColumnAccessor.iloc Integer-location based indexing for selection by posi-
tion.
WoodworkColumnAccessor.init([logical_type, Initializes Woodwork typing information for a Series.
. . . ])
WoodworkColumnAccessor.loc Access a group of rows by label(s) or a boolean array.
WoodworkColumnAccessor.logical_type The logical type of the series
WoodworkColumnAccessor.metadata The metadata of the series
WoodworkColumnAccessor. Removes specified semantic tags from the current tags.
remove_semantic_tags(. . . )
WoodworkColumnAccessor. Reset the semantic tags to the default values.
reset_semantic_tags()
WoodworkColumnAccessor.semantic_tags The semantic tags assigned to the series
WoodworkColumnAccessor. Update the logical type for the series, clearing any previ-
set_logical_type(. . . ) ously set semantic tags, and returning a new series with
Woodwork initialied.
WoodworkColumnAccessor. Replace current semantic tags with new values.
set_semantic_tags(. . . )
WoodworkColumnAccessor.
use_standard_tags
woodwork.column_accessor.WoodworkColumnAccessor
class woodwork.column_accessor.WoodworkColumnAccessor(series)
__init__(series)
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
woodwork.column_accessor.WoodworkColumnAccessor.add_semantic_tags
WoodworkColumnAccessor.add_semantic_tags(semantic_tags)
Add the specified semantic tags to the set of tags.
Parameters semantic_tags (str/list/set) – New semantic tag(s) to add
64 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.column_accessor.WoodworkColumnAccessor.description
property WoodworkColumnAccessor.description
The description of the series
woodwork.column_accessor.WoodworkColumnAccessor.iloc
property WoodworkColumnAccessor.iloc
Integer-location based indexing for selection by position. .iloc[] is primarily integer position based (from 0
to length-1 of the axis), but may also be used with a boolean array.
If the selection result is a Series, Woodwork typing information will be initialized for the returned Series.
Allowed inputs are: An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A slice object with ints,
e.g. 1:7. A boolean array. A callable function with one argument (the calling Series, DataFrame or
Panel) and that returns valid output for indexing (one of the above). This is useful in method chains, when
you don’t have a reference to the calling object, but would like to base your selection on some value.
woodwork.column_accessor.WoodworkColumnAccessor.init
woodwork.column_accessor.WoodworkColumnAccessor.loc
property WoodworkColumnAccessor.loc
Access a group of rows by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
If the selection result is a Series, Woodwork typing information will be initialized for the returned Series.
Allowed inputs are: A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never
as an integer position along the index). A list or array of labels, e.g. ['a', 'b', 'c']. A slice object
with labels, e.g. 'a':'f'. A boolean array of the same length as the axis being sliced, e.g. [True,
False, True]. An alignable boolean Series. The index of the key will be aligned before masking. An
alignable Index. The Index of the returned selection will be the input. A callable function with one
argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
woodwork.column_accessor.WoodworkColumnAccessor.logical_type
property WoodworkColumnAccessor.logical_type
The logical type of the series
woodwork.column_accessor.WoodworkColumnAccessor.metadata
property WoodworkColumnAccessor.metadata
The metadata of the series
woodwork.column_accessor.WoodworkColumnAccessor.remove_semantic_tags
WoodworkColumnAccessor.remove_semantic_tags(semantic_tags)
Removes specified semantic tags from the current tags.
Parameters semantic_tags (str/list/set) – Semantic tag(s) to remove.
woodwork.column_accessor.WoodworkColumnAccessor.reset_semantic_tags
WoodworkColumnAccessor.reset_semantic_tags()
Reset the semantic tags to the default values. The default values will be either an empty set or a set of the
standard tags based on the column logical type, controlled by the use_standard_tags property.
Parameters None –
woodwork.column_accessor.WoodworkColumnAccessor.semantic_tags
property WoodworkColumnAccessor.semantic_tags
The semantic tags assigned to the series
66 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.column_accessor.WoodworkColumnAccessor.set_logical_type
WoodworkColumnAccessor.set_logical_type(logical_type)
Update the logical type for the series, clearing any previously set semantic tags, and returning a new series with
Woodwork initialied.
Parameters logical_type (LogicalType, str) – The new logical type to set for the series.
Returns A new series with the updated logical type.
Return type Series
woodwork.column_accessor.WoodworkColumnAccessor.set_semantic_tags
WoodworkColumnAccessor.set_semantic_tags(semantic_tags)
Replace current semantic tags with new values. If use_standard_tags is set to True for the series, any standard
tags associated with the LogicalType of the series will be added as well.
Parameters semantic_tags (str/list/set) – New semantic tag(s) to set
woodwork.column_accessor.WoodworkColumnAccessor.use_standard_tags
property WoodworkColumnAccessor.use_standard_tags
TableSchema
TableSchema(column_names, logical_types[, . . . ])
TableSchema.add_semantic_tags(semantic_tags)Adds specified semantic tags to columns, updating the
Woodwork typing information.
TableSchema.index The index column for the table
TableSchema.logical_types A dictionary containing logical types for each column
TableSchema.rename(columns) Renames columns in a TableSchema
TableSchema.remove_semantic_tags(semantic_tags) Remove the semantic tags for any column names in the
provided semantic_tags dictionary, updating the Wood-
work typing information.
TableSchema.reset_semantic_tags([columns, Reset the semantic tags for the specified columns to the
. . . ]) default values.
TableSchema.semantic_tags A dictionary containing semantic tags for each column
TableSchema.set_index(new_index[, validate]) Sets the index.
TableSchema.set_time_index(new_time_index[, Set the time index.
. . . ])
TableSchema.set_types([logical_types, . . . ]) Update the logical type and semantic tags for any
columns names in the provided types dictionaries, up-
dating the TableSchema at those columns.
TableSchema.time_index The time index column for the table
TableSchema.types DataFrame containing the physical dtypes, logical types
and semantic tags for the TableSchema.
TableSchema.use_standard_tags
woodwork.table_schema.TableSchema
68 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
woodwork.table_schema.TableSchema.add_semantic_tags
TableSchema.add_semantic_tags(semantic_tags)
Adds specified semantic tags to columns, updating the Woodwork typing information. Will retain any previously
set values.
Parameters semantic_tags (dict[str -> str/list/set]) – A dictionary mapping the
columns in the DataFrame to the tags that should be added to the column’s semantic tags
woodwork.table_schema.TableSchema.index
property TableSchema.index
The index column for the table
woodwork.table_schema.TableSchema.logical_types
property TableSchema.logical_types
A dictionary containing logical types for each column
woodwork.table_schema.TableSchema.rename
TableSchema.rename(columns)
Renames columns in a TableSchema
Parameters columns (dict[str -> str]) – A dictionary mapping current column names to
new column names.
Returns TableSchema with the specified columns renamed.
Return type woodwork.TableSchema
woodwork.table_schema.TableSchema.remove_semantic_tags
TableSchema.remove_semantic_tags(semantic_tags)
Remove the semantic tags for any column names in the provided semantic_tags dictionary, updating the Wood-
work typing information. Including index or time_index tags will set the Woodwork index or time index to None
for the DataFrame.
Parameters semantic_tags (dict[str -> str/list/set]) – A dictionary mapping the
columns in the DataFrame to the tags that should be removed from the column’s semantic tags
woodwork.table_schema.TableSchema.reset_semantic_tags
TableSchema.reset_semantic_tags(columns=None, retain_index_tags=False)
Reset the semantic tags for the specified columns to the default values. The default values will be either an
empty set or a set of the standard tags based on the column logical type, controlled by the use_standard_tags
property on the table. Column names can be provided as a single string, a list of strings or a set of strings. If
columns is not specified, tags will be reset for all columns.
Parameters
• columns (str/list/set, optional) – The columns for which the semantic tags
should be reset.
• retain_index_tags (bool, optional) – If True, will retain any index or
time_index semantic tags set on the column. If False, will clear all semantic tags. Defaults
to False.
70 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.table_schema.TableSchema.semantic_tags
property TableSchema.semantic_tags
A dictionary containing semantic tags for each column
woodwork.table_schema.TableSchema.set_index
TableSchema.set_index(new_index, validate=True)
Sets the index. Handles setting a new index, updating the index, or removing the index.
Parameters new_index (str) – Name of the new index column. Must be present in the Ta-
bleSchema. If None, will remove the index.
woodwork.table_schema.TableSchema.set_time_index
TableSchema.set_time_index(new_time_index, validate=True)
Set the time index. Adds the ‘time_index’ semantic tag to the column and clears the tag from any previously set
index column
Parameters new_time_index (str) – The name of the column to set as the time index. If None,
will remove the time_index.
woodwork.table_schema.TableSchema.set_types
woodwork.table_schema.TableSchema.time_index
property TableSchema.time_index
The time index column for the table
woodwork.table_schema.TableSchema.types
property TableSchema.types
DataFrame containing the physical dtypes, logical types and semantic tags for the TableSchema.
woodwork.table_schema.TableSchema.use_standard_tags
property TableSchema.use_standard_tags
ColumnSchema
ColumnSchema([logical_type, semantic_tags, . . . ])
ColumnSchema.is_boolean Whether the ColumnSchema is a Boolean column
ColumnSchema.is_categorical Whether the ColumnSchema is categorical in nature
ColumnSchema.is_datetime Whether the ColumnSchema is a Datetime column
ColumnSchema.is_numeric Whether the ColumnSchema is numeric in nature
woodwork.table_schema.ColumnSchema
72 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
woodwork.table_schema.ColumnSchema.is_boolean
property ColumnSchema.is_boolean
Whether the ColumnSchema is a Boolean column
woodwork.table_schema.ColumnSchema.is_categorical
property ColumnSchema.is_categorical
Whether the ColumnSchema is categorical in nature
woodwork.table_schema.ColumnSchema.is_datetime
property ColumnSchema.is_datetime
Whether the ColumnSchema is a Datetime column
woodwork.table_schema.ColumnSchema.is_numeric
property ColumnSchema.is_numeric
Whether the ColumnSchema is numeric in nature
Serialization
woodwork.serialize.typing_info_to_dict
woodwork.serialize.typing_info_to_dict(dataframe)
Creates the description for a Woodwork table, including typing information for each column and loading infor-
mation.
Parameters dataframe (pd.DataFrame, dd.Dataframe, ks.DataFrame) –
DataFrame with Woodwork typing information initialized.
Returns Dictionary containing Woodwork typing information
Return type dict
woodwork.serialize.write_dataframe
woodwork.serialize.write_typing_info
woodwork.serialize.write_typing_info(typing_info, path)
Writes Woodwork typing information to the specified path at woodwork_typing_info.json
Parameters typing_info (dict) – Dictionary containing Woodwork typing information.
woodwork.serialize.write_woodwork_table
74 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Deserialization
woodwork.deserialize.read_table_typing_information
woodwork.deserialize.read_table_typing_information(path)
Read Woodwork typing information from disk, S3 path, or URL.
Parameters path (str) – Location on disk, S3 path, or URL to read woodwork_typing_info.json.
Returns Woodwork typing information dictionary
Return type dict
woodwork.deserialize.read_woodwork_table
Logical Types
woodwork.logical_types.Address
class woodwork.logical_types.Address
Represents Logical Types that contain address values.
Examples
['1 Miller Drive, New York, NY 12345', '1 Berkeley Street, Boston, MA 67891']
['26387 Russell Hill, Dallas, TX 34521', '54305 Oxford Street, Seattle, WA 95132']
__init__()
Initialize self. See help(type(self)) for accurate signature.
76 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Age
class woodwork.logical_types.Age
Represents Logical Types that contain non-negative numbers indicating a person’s age. Has ‘numeric’ as a
standard tag.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.AgeNullable
class woodwork.logical_types.AgeNullable
Represents Logical Types that contain non-negative numbers indicating a person’s age. Has ‘numeric’ as a
standard tag. May also contain null values.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Boolean
class woodwork.logical_types.Boolean
Represents Logical Types that contain binary values indicating true/false.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
78 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Categorical
class woodwork.logical_types.Categorical(encoding=None)
Represents Logical Types that contain unordered discrete values that fall into one of a set of possible values.
Has ‘category’ as a standard tag.
Examples
__init__(encoding=None)
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.CountryCode
class woodwork.logical_types.CountryCode
Represents Logical Types that contain categorical information specifically used to represent countries. Has
‘category’ as a standard tag.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Datetime
class woodwork.logical_types.Datetime(datetime_format=None)
Represents Logical Types that contain date and time information.
Parameters datetime_format (str) – Desired datetime format for data
Examples
["2020-09-10",
"2020-01-10 00:00:00",
"01/01/2000 08:30"]
__init__(datetime_format=None)
Initialize self. See help(type(self)) for accurate signature.
80 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
datetime_format
primary_dtype
standard_tags
type_string
woodwork.logical_types.Double
class woodwork.logical_types.Double
Represents Logical Types that contain positive and negative numbers, some of which include a fractional com-
ponent. Includes zero (0). Has ‘numeric’ as a standard tag.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.EmailAddress
class woodwork.logical_types.EmailAddress
Represents Logical Types that contain email address values.
Examples
["[email protected]",
"[email protected]",
"[email protected]"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Filepath
class woodwork.logical_types.Filepath
Represents Logical Types that specify locations of directories and files in a file system.
Examples
["/usr/local/bin",
"/Users/john.smith/dev/index.html",
"/tmp"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
82 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Integer
class woodwork.logical_types.Integer
Represents Logical Types that contain positive and negative numbers without a fractional component, including
zero (0). Has ‘numeric’ as a standard tag.
Examples
[100, 35, 0]
[-54, 73, 11]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.IPAddress
class woodwork.logical_types.IPAddress
Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.
Examples
["172.16.254.1",
"192.0.0.0",
"2001:0db8:0000:0000:0000:ff00:0042:8329"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.LatLong
class woodwork.logical_types.LatLong
Represents Logical Types that contain latitude and longitude values in decimal degrees.
Note: LatLong values will be stored with the object dtype as a tuple of floats (or a list of floats for Koalas
DataFrames) and must contain only two values.
Null latitude or longitude values will be stored as np.nan, and a fully null LatLong (np.nan, np.nan) will be
stored as just a single nan.
Examples
[(33.670914, -117.841501),
(40.423599, -86.921162),
(-45.031705, nan)]
__init__()
Initialize self. See help(type(self)) for accurate signature.
84 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.NaturalLanguage
class woodwork.logical_types.NaturalLanguage
Represents Logical Types that contain text or characters representing natural human language
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Ordinal
class woodwork.logical_types.Ordinal(order)
Represents Logical Types that contain ordered discrete values. Has ‘category’ as a standard tag.
Parameters order (list or tuple) – An list or tuple specifying the order of the ordinal val-
ues from low to high. The underlying series cannot contain values that are not present in the
order values.
Examples
__init__(order)
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.PersonFullName
class woodwork.logical_types.PersonFullName
Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
86 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.PhoneNumber
class woodwork.logical_types.PhoneNumber
Represents Logical Types that contain numeric digits and characters representing a phone number
Examples
["1-(555)-123-5495",
"+1-555-123-5495",
"5551235495"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.PostalCode
class woodwork.logical_types.PostalCode
Represents Logical Types that contain a series of postal codes for representing a group of addresses. Has
‘category’ as a standard tag.
Examples
["90210"
"60018-0123",
"SW1A"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.SubRegionCode
class woodwork.logical_types.SubRegionCode
Represents Logical Types that contain codes representing a portion of a larger geographic region. Has ‘category’
as a standard tag.
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
88 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Timedelta
class woodwork.logical_types.Timedelta
Represents Logical Types that contain values specifying a duration of time
Examples
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.URL
class woodwork.logical_types.URL
Represents Logical Types that contain URLs, which may include protocol, hostname and file name
Examples
["http://google.com",
"https://example.com/index.html",
"example.com"]
__init__()
Initialize self. See help(type(self)) for accurate signature.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
TypeSystem
TypeSystem([inference_functions, . . . ])
TypeSystem.add_type(logical_type[, . . . ]) Add a new LogicalType to the TypeSystem, optionally
specifying the corresponding inference function and a
parent type.
TypeSystem.infer_logical_type(series) Infer the logical type for the given series
TypeSystem.remove_type(logical_type) Remove a logical type from the TypeSystem.
TypeSystem.reset_defaults() Reset type system to the default settings that were spec-
ified at initialization.
TypeSystem.update_inference_function(. . . )Update the inference function for the specified Logical-
Type.
TypeSystem.update_relationship(logical_type,Add or update a relationship.
...)
90 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.type_sys.type_system.TypeSystem
class woodwork.type_sys.type_system.TypeSystem(inference_functions=None,
relationships=None, de-
fault_type=NaturalLanguage)
Methods
Attributes
woodwork.type_sys.type_system.TypeSystem.add_type
woodwork.type_sys.type_system.TypeSystem.infer_logical_type
TypeSystem.infer_logical_type(series)
Infer the logical type for the given series
Parameters series (pandas.Series) – The series for which to infer the LogicalType.
woodwork.type_sys.type_system.TypeSystem.remove_type
TypeSystem.remove_type(logical_type)
Remove a logical type from the TypeSystem. Any children of the remove type will have their parent set to the
parent of the removed type.
Parameters logical_type (LogicalType) – The LogicalType to remove.
woodwork.type_sys.type_system.TypeSystem.reset_defaults
TypeSystem.reset_defaults()
Reset type system to the default settings that were specified at initialization.
Parameters None –
woodwork.type_sys.type_system.TypeSystem.update_inference_function
TypeSystem.update_inference_function(logical_type, inference_function)
Update the inference function for the specified LogicalType.
Parameters
• logical_type (LogicalType) – The LogicalType for which to update the inference
function.
• inference_function (func) – The new inference function to use. Can be set to None
to skip type inference for the specified LogicalType.
92 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
woodwork.type_sys.type_system.TypeSystem.update_relationship
TypeSystem.update_relationship(logical_type, parent)
Add or update a relationship. If the specified LogicalType exists in the relationship graph, its parent will be
updated. If the specified LogicalType does not exist in relationships, the relationship will be added.
Parameters
• logical_type (LogicalType) – The LogicalType for which to update the parent
value.
• parent (LogicalType) – The new parent to set for the specified LogicalType.
Utils
Type Utils
woodwork.type_sys.utils.list_logical_types
woodwork.type_sys.utils.list_logical_types()
Returns a dataframe describing all of the available Logical Types.
Parameters None –
Returns A dataframe containing details on each LogicalType, including the corresponding physical
type and any standard semantic tags.
Return type pd.DataFrame
woodwork.type_sys.utils.list_semantic_tags
woodwork.type_sys.utils.list_semantic_tags()
Returns a dataframe describing all of the common semantic tags.
Parameters None –
Returns A dataframe containing details on each Semantic Tag, including the corresponding logical
type(s).
Return type pd.DataFrame
General Utils
woodwork.utils.get_valid_mi_types
woodwork.utils.get_valid_mi_types()
Generate a list of LogicalTypes that are valid for calculating mutual information. Note that index columns are
not valid for calculating mutual information, but their types may be returned by this function.
Parameters None –
Returns A list of the LogicalTypes that can be use to calculate mutual information
Return type list(LogicalType)
woodwork.utils.read_file
Parameters
• filepath (str) – A valid string path to the file to read
• content_type (str) – Content type of file to read
• name (str, optional) – Name used to identify the DataFrame.
• index (str, optional) – Name of the index column.
• time_index (str, optional) – Name of the time index column.
• semantic_tags (dict, optional) – Dictionary mapping column names in the
dataframe to the semantic tags for the column. The keys in the dictionary should be strings
that correspond to columns in the underlying dataframe. There are two options for specify-
ing the dictionary values: (str): If only one semantic tag is being set, a single string can be
used as a value. (list[str] or set[str]): If multiple tags are being set, a list or set of strings can
be used as the value. Semantic tags will be set to an empty set for any column not included
in the dictionary.
• logical_types (dict[str -> LogicalType], optional) – Dictionary map-
ping column names in the dataframe to the LogicalType for the column. LogicalTypes will
be inferred for any columns not present in the dictionary.
• use_standard_tags (bool, optional) – If True, will add standard semantic tags
to columns based on the inferred or specified logical type for the column. Defaults to True.
94 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
• validate (bool, optional) – Whether parameter and data validation should occur.
Defaults to True. Warning: Should be set to False only when parameters and data are
known to be valid. Any errors resulting from skipping validation with invalid inputs may
not be easily understood.
• **kwargs – Additional keyword arguments to pass to the underlying pandas read file
function. For more information on available keywords refer to the pandas documentation.
Returns DataFrame created from the specified file with Woodwork typing information initialized.
Return type pd.DataFrame
woodwork.accessor_utils.get_invalid_schema_message
woodwork.accessor_utils.get_invalid_schema_message(dataframe, schema)
Return a message indicating the reason that the provided schema cannot be used to initialize Woodwork on the
dataframe. If the schema is valid for the dataframe, None will be returned.
Parameters
• dataframe (DataFrame) – The dataframe against which to check the schema.
• schema (ww.TableSchema) – The schema to use in the validity check.
Returns The reason that the schema is invalid for the dataframe
Return type str or None
woodwork.accessor_utils.init_series
woodwork.accessor_utils.is_schema_valid
woodwork.accessor_utils.is_schema_valid(dataframe, schema)
Check if a schema is valid for initializing Woodwork on a dataframe
Parameters
• dataframe (DataFrame) – The dataframe against which to check the schema.
• schema (ww.TableSchema) – The schema to use in the validity check.
Returns Boolean indicating whether the schema is valid for the dataframe
Return type boolean
Demo Data
load_retail([id, nrows, init_woodwork]) Load a demo retail dataset into a DataFrame, optionally
initializing Woodwork’s typing information.
woodwork.demo.load_retail
96 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Release Notes
Warning: This Woodwork release uses a weak reference for maintaining a reference from the acces-
sor to the DataFrame. Because of this, chaining a Woodwork call onto another call that creates a new
DataFrame or Series object can be problematic.
Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the
DataFrame in a new variable and then initialize Woodwork:
df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()
• Enhancements
– Add deep parameter to Woodwork Accessor and Schema equality checks (#889)
– Add support for reading from parquet files to woodwork.read_file (#909)
• Changes
– Remove command line functions for list logical and semantic tags (#891)
– Keep index and time index tags for single column when selecting from a table (#888)
– Update accessors to store weak reference to data (#894)
• Documentation Changes
– Update nbsphinx version to fix docs build issue (#911, #913)
• Testing Changes
– Use Minimum Dependency Generator GitHub Action and remove tools folder (#897)
– Move all latest and minimum dependencies into 1 folder (#912)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey,
@thehomebrewnerd
Breaking Changes
• Enhancements
– Add is_schema_valid and get_invalid_schema_message functions for
checking schema validity (#834)
– Add logical type for Age and AgeNullable (#849)
– Add logical type for Address (#858)
– Add generic to_disk function to save Woodwork schema and data (#872)
Breaking Changes
• Woodwork tables can no longer be saved using to disk df.ww.to_csv, df.ww.to_pickle, or df.ww.
to_parquet. Use df.ww.to_disk instead.
• The read_csv function has been replaced by read_file.
• Enhancements
– Add validation control to WoodworkTableAccessor (#736)
– Store make_index value on WoodworkTableAccessor (#780)
– Add optional exclude parameter to WoodworkTableAccessor select method (#783)
– Add validation control to deserialize.read_woodwork_table and ww.
read_csv (#788)
98 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Breaking Changes
• Enhancements
– Implement Schema and Accessor API (#497)
– Add Schema class that holds typing info (#499)
– Add WoodworkTableAccessor class that performs type inference and stores Schema (#514)
– Allow initializing Accessor schema with a valid Schema object (#522)
– Add ability to read in a csv and create a DataFrame with an initialized Woodwork Schema
(#534)
– Add ability to call pandas methods from Accessor (#538, #589)
– Add helpers for checking if a column is one of Boolean, Datetime, numeric, or categorical
(#553)
– Add ability to load demo retail dataset with a Woodwork Accessor (#556)
– Add select to WoodworkTableAccessor (#548)
– Add mutual_information to WoodworkTableAccessor (#571)
– Add WoodworkColumnAccessor class (#562)
– Add semantic tag update methods to column accessor (#573)
– Add describe and describe_dict to WoodworkTableAccessor (#579)
– Add init_series util function for initializing a series with dtype change (#581)
– Add set_logical_type method to WoodworkColumnAccessor (#590)
– Add semantic tag update methods to table schema (#591)
– Add warning if additional parameters are passed along with schema (#593)
– Better warning when accessing column properties before init (#596)
– Update column accessor to work with LatLong columns (#598)
– Add set_index to WoodworkTableAccessor (#603)
– Implement loc and iloc for WoodworkColumnAccessor (#613)
– Add set_time_index to WoodworkTableAccessor (#612)
– Implement loc and iloc for WoodworkTableAccessor (#618)
100 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
– Allow updating logical types with set_types and make relevant DataFrame changes
(#619)
– Allow serialization of WoodworkColumnAccessor to csv, pickle, and parquet (#624)
– Add DaskColumnAccessor (#625)
– Allow deserialization from csv, pickle, and parquet to Woodwork table (#626)
– Add value_counts to WoodworkTableAccessor (#632)
– Add KoalasColumnAccessor (#634)
– Add pop to WoodworkTableAccessor (#636)
– Add drop to WoodworkTableAccessor (#640)
– Add rename to WoodworkTableAccessor (#646)
– Add DaskTableAccessor (#648)
– Add Schema properties to WoodworkTableAccessor (#651)
– Add KoalasTableAccessor (#652)
– Adds __getitem__ to WoodworkTableAccessor (#633)
– Update Koalas min version and add support for more new pandas dtypes with Koalas (#678)
– Adds __setitem__ to WoodworkTableAccessor (#669)
• Fixes
– Create new Schema object when performing pandas operation on Accessors (#595)
– Fix bug in _reset_semantic_tags causing columns to share same semantic tags set
(#666)
– Maintain column order in DataFrame and Woodwork repr (#677)
• Changes
– Move mutual information logic to statistics utils file (#584)
– Bump min Koalas version to 1.4.0 (#638)
– Preserve pandas underlying index when not creating a Woodwork index (#664)
– Restrict Koalas version to <1.7.0 due to breaking changes (#674)
– Clean up dtype usage across Woodwork (#682)
– Improve error when calling accessor properties or methods before init (#683)
– Remove dtype from Schema dictionary (#685)
– Add include_index param and allow unique columns in Accessor mutual information
(#699)
– Include DataFrame equality and use_standard_tags in WoodworkTableAccessor
equality check (#700)
– Remove DataTable and DataColumn classes to migrate towards the accessor approach
(#713)
– Change sample_series dtype to not need conversion and remove convert_series
util (#720)
– Rename Accessor methods since DataTable has been removed (#723)
• Documentation Changes
– Update README.md and Get Started guide to use accessor (#655, #717)
– Update Understanding Types and Tags guide to use accessor (#657)
– Update docstrings and API Reference page (#660)
– Update statistical insights guide to use accessor (#693)
– Update Customizing Type Inference guide to use accessor (#696)
– Update Dask and Koalas guide to use accessor (#701)
– Update index notebook and install guide to use accessor (#715)
– Add section to documentation about schema validity (#729)
– Update README.md and Get Started guide to use pd.read_csv (#730)
– Make small fixes to documentation formatting (#731)
• Testing Changes
– Add tests to Accessor/Schema that weren’t previously covered (#712, #716)
– Update release branch name in notes update check (#719)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbrid-
strup, @tamargrey, @thehomebrewnerd
Breaking Changes
• The DataTable and DataColumn classes have been removed and replaced by new
WoodworkTableAccessor and WoodworkColumnAccessor classes which are used through the
ww namespace available on DataFrames after importing Woodwork.
• Changes
– Restrict Koalas version to <1.7.0 due to breaking changes (#674)
– Include unique columns in mutual information calculations (#687)
– Add parameter to include index column in mutual information calculations (#692)
• Documentation Changes
– Update to remove warning message from statistical insights guide (#690)
• Testing Changes
– Update branch reference in tests to run on main (#641)
– Make release notes updated check separate from unit tests (#642)
– Update release branch naming instructions (#644)
Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewn-
erd
102 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
• Changes
– Avoid calculating mutualinfo for non-unique columns (#563)
– Preserve underlying DataFrame index if index column is not specified (#588)
– Add blank issue template for creating issues (#630)
• Testing Changes
– Update branch reference in tests workflow (#552, #601)
– Fixed text on back arrow on install page (#564)
– Refactor test_datatable.py (#574)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbrid-
strup, @tamargrey
• Enhancements
– Add Python 3.9 support without Koalas testing (#511)
– Add get_valid_mi_types function to list LogicalTypes valid for mutual information
calculation (#517)
• Fixes
– Handle missing values in Datetime columns when calculating mutual information (#516)
– Support numpy 1.20.0 by restricting version for koalas and changing serialization error
message (#532)
– Move Koalas option setting to DataTable init instead of import (#543)
• Documentation Changes
– Add Alteryx OSS Twitter link (#519)
– Update logo and add new favicon (#521)
– Multiple improvements to Getting Started page and guides (#527)
– Clean up API Reference and docstrings (#536)
– Added Open Graph for Twitter and Facebook (#544)
Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewn-
erd
• Enhancements
– Add DataTable.df property for accessing the underling DataFrame (#470)
– Set index of underlying DataFrame to match DataTable index (#464)
• Fixes
– Sort underlying series when sorting dataframe (#468)
– Allow setting indices to current index without side effects (#474)
• Changes
– Fix release document with Github Actions link for CI (#462)
– Don’t allow registered LogicalTypes with the same name (#477)
– Move str_to_logical_type to TypeSystem class (#482)
– Remove pyarrow from core dependencies (#508)
Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewn-
erd
• Enhancements
– Allow for user-defined logical types and inference functions in TypeSystem object (#424)
– Add __repr__ to DataTable (#425)
– Allow initializing DataColumn with numpy array (#430)
– Add drop to DataTable (#434)
– Migrate CI tests to Github Actions (#417, #441, #451)
– Add metadata to DataColumn for user-defined metadata (#447)
• Fixes
– Update DataColumn name when using setitem on column with no name (#426)
– Don’t allow pickle serialization for Koalas DataFrames (#432)
– Check DataTable metadata in equality check (#449)
– Propagate all attributes of DataTable in _new_dt_including (#454)
• Changes
– Update links to use alteryx org Github URL (#423)
– Support column names of any type allowed by the underlying DataFrame (#442)
– Use object dtype for LatLong columns for easy access to latitude and longitude values
(#414)
– Restrict dask version to prevent 2020.12.0 release from being installed (#453)
– Lower minimum requirement for numpy to 1.15.4, and set pandas minimum requirement
1.1.1 (#459)
104 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
• Testing Changes
– Fix missing test coverage (#436)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey,
@thehomebrewnerd
• Enhancements
– Add support for creating DataTable from Koalas DataFrame (#327)
– Add ability to initialize DataTable with numpy array (#367)
– Add describe_dict method to DataTable (#405)
– Add mutual_information_dict method to DataTable (#404)
– Add metadata to DataTable for user-defined metadata (#392)
– Add update_dataframe method to DataTable to update underlying DataFrame (#407)
– Sort dataframe if time_index is specified, bypass sorting with already_sorted pa-
rameter. (#410)
– Add description attribute to DataColumn (#416)
– Implement DataColumn.__len__ and DataTable.__len__ (#415)
• Fixes
– Rename data_column.py datacolumn.py (#386)
– Rename data_table.py datatable.py (#387)
– Rename get_mutual_information mutual_information (#390)
• Changes
– Lower moto test requirement for serialization/deserialization (#376)
– Make Koalas an optional dependency installable with woodwork[koalas] (#378)
– Remove WholeNumber LogicalType from Woodwork (#380)
– Updates to LogicalTypes to support Koalas 1.4.0 (#393)
– Replace set_logical_types and set_semantic_tags with just set_types
(#379)
– Remove copy_dataframe parameter from DataTable initialization (#398)
– Implement DataTable.__sizeof__ to return size of the underlying dataframe (#401)
– Include Datetime columns in mutual info calculation (#399)
– Maintain column order on DataTable operations (#406)
• Testing Changes
– Add pyarrow, dask, and koalas to automated dependency checks (#388)
– Use new version of pull request Github Action (#394)
– Improve parameterization for test_datatable_equality (#409)
Thanks to the following people for contributing to this release: @ctduffy, @gsheni, @tamargrey, @the-
homebrewnerd
Breaking Changes
• Enhancements
– Add __eq__ to DataTable and DataColumn and update LogicalType equality (#318)
– Add value_counts() method to DataTable (#342)
– Support serialization and deserialization of DataTables via csv, pickle, or parquet (#293)
– Add shape property to DataTable and DataColumn (#358)
– Add iloc method to DataTable and DataColumn (#365)
– Add numeric_categorical_threshold config value to allow inferring numeric
columns as Categorical (#363)
– Add rename method to DataTable (#367)
• Fixes
– Catch non numeric time index at validation (#332)
• Changes
– Support logical type inference from a Dask DataFrame (#248)
– Fix validation checks and make_index to work with Dask DataFrames (#260)
– Skip validation of Ordinal order values for Dask DataFrames (#270)
– Improve support for datetimes with Dask input (#286)
– Update DataTable.describe to work with Dask input (#296)
– Update DataTable.get_mutual_information to work with Dask input (#300)
– Modify to_pandas function to return DataFrame with correct index (#281)
– Rename DataColumn.to_pandas method to DataColumn.to_series (#311)
– Rename DataTable.to_pandas method to DataTable.to_dataframe (#319)
– Remove UserWarning when no matching columns found (#325)
106 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
Breaking Changes
• Enhancements
– Add optional include parameter for DataTable.describe() to filter results (#228)
– Add make_index parameter to DataTable.__init__ to enable optional creation of
a new index column (#238)
– Add support for setting ranking order on columns with Ordinal logical type (#240)
– Add list_semantic_tags function and CLI to get dataframe of woodwork seman-
tic_tags (#244)
– Add support for numeric time index on DataTable (#267)
– Add pop method to DataTable (#289)
– Add entry point to setup.py to run CLI commands (#285)
• Fixes
– Allow numeric datetime time indices (#282)
• Changes
• Enhancements
– Implement setitem on DataTable to create/overwrite an existing DataColumn (#165)
– Add to_pandas method to DataColumn to access the underlying series (#169)
– Add list_logical_types function and CLI to get dataframe of woodwork LogicalTypes
(#172)
– Add describe method to DataTable to generate statistics for the underlying data (#181)
– Add optional return_dataframe parameter to load_retail to return either
DataFrame or DataTable (#189)
– Add get_mutual_information method to DataTable to generate mutual information
between columns (#203)
– Add read_csv function to create DataTable directly from CSV file (#222)
• Fixes
– Fix bug causing incorrect values for quartiles in DataTable.describe method (#187)
– Fix bug in DataTable.describe that could cause an error if certain semantic tags
were applied improperly (#190)
– Fix bug with instantiated LogicalTypes breaking when used with issubclass (#231)
• Changes
– Remove unnecessary add_standard_tags attribute from DataTable (#171)
– Remove standard tags from index column and do not return stats for index column from
DataTable.describe (#196)
108 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
• Fixes
– Fix formatting issue when printing global config variables (#138)
• Changes
– Change add_standard_tags to use_standard_Tags to better describe behavior (#149)
– Change access of underlying dataframe to be through to_pandas with ._dataframe field
on class (#146)
– Remove replace_none parameter to DataTables (#146)
• Documentation Changes
– Add working code example to README and create Using Woodwork page (#103)
Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewn-
erd
110 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Woodwork Documentation, Release 0.3.1
112 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
INDEX
Symbols method), 89
__init__()
__init__() (woodwork.column_accessor.WoodworkColumnAccessor (woodwork.logical_types.URL method),
method), 64 90
__init__() (woodwork.logical_types.Address __init__() (woodwork.table_accessor.WoodworkTableAccessor
method), 76 method), 53
__init__() (woodwork.logical_types.Age method), __init__() (woodwork.table_schema.ColumnSchema
77 method), 72
__init__() (woodwork.logical_types.AgeNullable __init__() (woodwork.table_schema.TableSchema
method), 78 method), 68
__init__() (woodwork.logical_types.Boolean __init__() (woodwork.type_sys.type_system.TypeSystem
method), 78 method), 91
__init__() (woodwork.logical_types.Categorical
method), 79 A
__init__() (woodwork.logical_types.CountryCode add_semantic_tags() (wood-
method), 80 work.column_accessor.WoodworkColumnAccessor
__init__() (woodwork.logical_types.Datetime method), 64
method), 80 add_semantic_tags() (wood-
__init__() (woodwork.logical_types.Double work.table_accessor.WoodworkTableAccessor
method), 81 method), 55
__init__() (woodwork.logical_types.EmailAddress add_semantic_tags() (wood-
method), 82 work.table_schema.TableSchema method),
__init__() (woodwork.logical_types.Filepath 69
method), 82 add_type() (woodwork.type_sys.type_system.TypeSystem
__init__() (woodwork.logical_types.IPAddress method), 92
method), 84 Address (class in woodwork.logical_types), 76
__init__() (woodwork.logical_types.Integer Age (class in woodwork.logical_types), 77
method), 83 AgeNullable (class in woodwork.logical_types), 78
__init__() (woodwork.logical_types.LatLong
method), 84 B
__init__() (woodwork.logical_types.NaturalLanguage Boolean (class in woodwork.logical_types), 78
method), 85
__init__() (woodwork.logical_types.Ordinal C
method), 86 Categorical (class in woodwork.logical_types), 79
__init__() (woodwork.logical_types.PersonFullName ColumnSchema (class in woodwork.table_schema), 72
method), 86 CountryCode (class in woodwork.logical_types), 80
__init__() (woodwork.logical_types.PhoneNumber
method), 87 D
__init__() (woodwork.logical_types.PostalCode
Datetime (class in woodwork.logical_types), 80
method), 88
describe() (woodwork.table_accessor.WoodworkTableAccessor
__init__() (woodwork.logical_types.SubRegionCode
method), 55
method), 88
__init__() (woodwork.logical_types.Timedelta
113
Woodwork Documentation, Release 0.3.1
114 Index
Woodwork Documentation, Release 0.3.1
Index 115
Woodwork Documentation, Release 0.3.1
property), 62
use_standard_tags() (wood-
work.table_schema.TableSchema property),
72
V
value_counts() (wood-
work.table_accessor.WoodworkTableAccessor
method), 63
W
WoodworkColumnAccessor (class in wood-
work.column_accessor), 64
WoodworkTableAccessor (class in wood-
work.table_accessor), 53
write_dataframe() (in module wood-
work.serialize), 74
write_typing_info() (in module wood-
work.serialize), 74
write_woodwork_table() (in module wood-
work.serialize), 74
116 Index