Woodwork Alteryx Com en v0.3.1

Woodwork Documentation

Release 0.3.1

Alteryx, Inc.

May 12, 2021

CONTENTS
1 Woodwork is a library that helps with data typing of 2-dimensional tabular data structures. 3
Index 113
i
ii
Woodwork Documentation, Release 0.3.1
CONTENTS 1
2 CONTENTS
CHAPTER
ONE
WOODWORK IS A LIBRARY THAT HELPS WITH DATA TYPING OF

2-DIMENSIONAL TABULAR DATA STRUCTURES.
It provides a special namespace on your DataFrame, ww, which contains the physical, logical, and semantic data types.
It can be used with Featuretools, EvalML, and general machine learning applications where logical and semantic
typing information is important.
Woodwork provides simple interfaces for adding and updating logical and semantic typing information, as well as
selecting data columns based on the types.
1.1 Quick Start
Below is an example of using Woodwork to automatically infer the Logical Types for a DataFrame and select columns
with specific types.
[1]: import woodwork as ww
df = ww.demo.load_retail(nrows=100, init_woodwork=False)
df.ww.init(name="retail")
df.ww
[1]: Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['numeric']
order_id int64 Integer ['numeric']
product_id category Categorical ['category']
description string NaturalLanguage []
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime []
unit_price float64 Double ['numeric']
customer_name string NaturalLanguage []
country string NaturalLanguage []
total float64 Double ['numeric']
cancelled bool Boolean []
[2]: filtered_df = df.ww.select(include=['numeric', 'Boolean'])

filtered_df.head(5)
[2]: order_product_id order_id quantity unit_price total cancelled
0 0 536365 6 4.2075 25.245 False
1 1 536365 6 5.5935 33.561 False
2 2 536365 8 4.5375 36.300 False
(continues on next page)
3
(continued from previous page)

3 3 536365 6 5.5935 33.561 False
4 4 536365 6 5.5935 33.561 False
1.1.1 Table of contents
Install
Woodwork is available for Python 3.7, 3.8, and 3.9. It can be installed from PyPI, conda, or from source.
PyPI
To install Woodwork from PyPI, run the following command:
python -m pip install woodwork
Woodwork allows users to install add-ons individually or all at once. In order to install all add-ons, run:
python -m pip install "woodwork[complete]"
You can use Woodwork with Dask DataFrames by running:
python -m pip install "woodwork[dask]"
You can use Woodwork with Koalas DataFrames by running:
python -m pip install "woodwork[koalas]"
Conda
To install Woodwork from conda run the following command:
conda install -c conda-forge woodwork
Note: In order to use Woodwork with Dask or Koalas DataFrames, the following commands must be run for your
library of choice prior to installing Woodwork with conda: conda install dask for Dask or conda install
koalas and conda install pyspark for Koalas.
Source
To install Woodwork from source, clone the repository from Github, and install the dependencies.
git clone https://github.com/alteryx/woodwork.git

cd woodwork
python -m pip install .
4 Chapter 1. Woodwork is a library that helps with data typing of 2-dimensional tabular data
structures.
Dependencies
You can view a list of all Woodwork core dependencies in the requirements.txt file.
Optional Dependencies
Woodwork has several other dependencies that are used only for specific methods. Attempting to use one of these
methods without having the necessary library installed will result in an ImportError with instructions on how to
install the necessary dependency.
Dependency Min Version Notes

boto3 1.10.45 Required to read/write to URLs and S3
smart_open 5.0.0 Required to read/write to URLs and S3
pyarrow 3.0.0 Required to serialize to parquet
dask[distributed] 2.30.0 Required to use with Dask DataFrames
koalas 1.8.0 Required to use with Koalas DataFrames
pyspark 3.0.0 Required to use with Koalas DataFrames
Development
To make contributions to the codebase, please follow the guidelines here.
Get Started
In this guide, you walk through examples where you initialize Woodwork on a DataFrame and on a Series. Along
the way, you learn how to update and remove logical types and semantic tags. You also learn how to use typing
information to select subsets of data.
Types and Tags
Woodwork relies heavily on the concepts of physical types, logical types and semantic tags. These concepts are
covered in detail in Understanding Types and Tags, but we provide brief definitions here for reference:
• Physical Type: defines how the data is stored on disk or in memory.
• Logical Type: defines how the data should be parsed or interpreted.
• Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Start learning how to use Woodwork by reading in a dataframe that contains retail sales data.
[1]: import pandas as pd
df = pd.read_csv("https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.
˓→csv")
df.head(5)
[1]: order_id product_id description quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
1.1. Quick Start 5


4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
order_date unit_price customer_name country total \

0 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245
cancelled
0 False
1 False
2 False
3 False
4 False
As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric
values, and natural language descriptions. Next, initialize Woodwork on this DataFrame.
Initializing Woodwork on a DataFrame
Importing Woodwork creates a special namespace on your DataFrames, DataFrame.ww, that can be used to set or
update the typing information for the DataFrame. As long as Woodwork has been imported, initializing Woodwork on
a DataFrame is as simple as calling .ww.init() on the DataFrame of interest. An optional name parameter can be
specified to label the data.
df.ww.init(name="retail", make_index=True, index="order_product_id")

df.ww
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category']
Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the DataFrame
dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some
of the columns based on the logical types that were inferred. Because the original data did not contain an index
column, Woodwork’s make_index parameter was used to create a new index column in the DataFrame.
Warning: Woodwork uses a weak reference for maintaining a reference from the accessor to the DataFrame.
Because of this, chaining a Woodwork call onto another call that creates a new DataFrame or Series object can be
problematic.
structures.
Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the DataFrame in a new

variable and then initialize Woodwork:
df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()
All Woodwork methods and properties can be accessed through the ww namespace on the DataFrame. DataFrame
methods called from the Woodwork namespace will be passed to the DataFrame, and whenever possible, Woodwork
will be initialized on the returned object, assuming it is a Series or a DataFrame.
As an example, use the head method to create a new DataFrame containing the first 5 rows of the original data, with
Woodwork typing information retained.
[3]: head_df = df.ww.head(5)

head_df.ww
Column
[4]: head_df
[4]: order_product_id order_id product_id description \
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER
1 1 536365 71053 WHITE METAL LANTERN
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART.
quantity order_date unit_price customer_name country \

0 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom
total cancelled
0 25.245 False
1 33.561 False
2 36.300 False
3 33.561 False
4 33.561 False
Note: Once Woodwork is initialized on a DataFrame, it is recommended to go through the ww namespace when
performing DataFrame operations to avoid invalidating Woodwork’s typing information.
1.1. Quick Start 7

Updating Logical Types
If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let’s change
some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the
order_product_id and country columns to be Categorical and set customer_name to have a logical
type of PersonFullName.
[5]: df.ww.set_types(logical_types={
'customer_name': 'PersonFullName',
'country': 'Categorical',
'order_id': 'Categorical'
})
df.ww.types
Column
customer_name string PersonFullName []
country category Categorical ['category']
Inspect the information in the types output. There, you can see that the Logical type for the three columns has been
updated with the logical types you specified.
Selecting Columns
Now that you’ve prepared logical types, you can select a subset of the columns based on their logical types. Select
only the columns that have a logical type of Integer or Double.
[6]: numeric_df = df.ww.select(['Integer', 'Double'])
numeric_df.ww
Column
This selection process has returned a new Woodwork DataFrame containing only the columns that match the logical
types you specified. After you have selected the columns you want, you can use the DataFrame containing just those
columns as you normally would for any additional analysis.
[7]: numeric_df
[7]: order_product_id quantity unit_price total
0 0 6 4.2075 25.2450
1 1 6 5.5935 33.5610
2 2 8 4.5375 36.3000
3 3 6 5.5935 33.5610
structures.

4 4 6 5.5935 33.5610
... ... ... ... ...
401599 401599 12 1.4025 16.8300
401600 401600 6 3.4650 20.7900
401601 401601 4 6.8475 27.3900
401602 401602 4 6.8475 27.3900
401603 401603 3 8.1675 24.5025
[401604 rows x 4 columns]
Adding Semantic Tags
Next, let’s add semantic tags to some of the columns. Add the tag of product_details to the description
column, and tag the total column with currency.
[8]: df.ww.set_types(semantic_tags={'description':'product_details', 'total': 'currency'})

df.ww
Column
description string NaturalLanguage ['product_details']
total float64 Double ['numeric', 'currency']
Select columns based on a semantic tag. Only select the columns tagged with category.
[9]: category_df = df.ww.select('category')

category_df.ww
Column
Select columns using multiple semantic tags or a mixture of semantic tags and logical types.
[10]: category_numeric_df = df.ww.select(['numeric', 'category'])

category_numeric_df.ww
Column
total float64 Double ['currency', 'numeric']
1.1. Quick Start 9

[11]: mixed_df = df.ww.select(['Boolean', 'product_details'])

mixed_df.ww
Column
description string NaturalLanguage ['product_details']
To select an individual column, specify the column name. Woodwork will be initialized on the returned Series and
you can use the Series for additional analysis as needed.
[12]: total = df.ww['total']

total.ww
[12]: <Series: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {
˓→'currency', 'numeric'})>
[13]: total
[13]: 0 25.2450
1 33.5610
2 36.3000
3 33.5610
4 33.5610
...
401599 16.8300
401600 20.7900
401601 27.3900
401602 27.3900
401603 24.5025
Name: total, Length: 401604, dtype: float64
Select multiple columns by supplying a list of column names.
[14]: multiple_cols_df = df.ww[['product_id', 'total', 'unit_price']]

multiple_cols_df.ww
Column
total float64 Double ['currency', 'numeric']
Removing Semantic Tags
Remove specific semantic tags from a column if they are no longer needed. In this example, remove the
product_details tag from the description column.
[15]: df.ww.remove_semantic_tags({'description':'product_details'})
df.ww
Column
structures.

total float64 Double ['numeric', 'currency']
Notice how the product_details tag has been removed from the description column. If you want to remove
all user-added semantic tags from all columns, you can do that, too.
[16]: df.ww.reset_semantic_tags()
df.ww
Column
order_product_id int64 Integer ['numeric']
Set Index and Time Index
At any point, you can designate certain columns as the Woodwork index or time_index with the methods
set_index and set_time_index. These methods can be used to assign these columns for the first time or to change
the column being used as the index or time index.
Index and time index columns contain index and time_index semantic tags, respectively.
[17]: df.ww.set_index('order_product_id')
df.ww.index
[17]: 'order_product_id'
[18]: df.ww.set_time_index('order_date')
df.ww.time_index
[18]: 'order_date'
[19]: df.ww
Column
order_date datetime64[ns] Datetime ['time_index']
1.1. Quick Start 11


Using Woodwork with a Series
Woodwork also can be used to store typing information on a Series. There are two approaches for initializing Wood-
work on a Series, depending on whether or not the Series dtype is the same as the physical type associated with the
LogicalType. For more information on logical types and physical types, refer to Understanding Types and Tags.
If your Series dtype matches the physical type associated with the specified or inferred LogicalType, Woodwork can
be initialized through the ww namespace, just as with DataFrames.
[20]: series = pd.Series([1, 2, 3], dtype='int64')
series.ww.init(logical_type='Integer')
series.ww
[20]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {
˓→'numeric'})>
In the example above, we specified the Integer LogicalType for the Series. Because Integer has a physical
type of int64 and this matches the dtype used to create the Series, no Series dtype conversion was needed and the
initialization succeeds.
In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be
used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to
match the physical type of the LogicalType.
To demonstrate this case, first create a Series, with a string dtype. Then, initialize a Woodwork Series with a
Categorical logical type using the init_series function. Because Categorical uses a physical type of
category, the dtype of the Series must be changed, and that is why we must use the init_series function here.
The series that is returned will have Woodwork initialized with the LogicalType set to Categorical as expected,
with the expected dtype of category.
[21]: string_series = pd.Series(['a', 'b', 'a'], dtype='string')
ww_series = ww.init_series(string_series, logical_type='Categorical')
ww_series.ww
[21]: <Series: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags
˓→= {'category'})>
As with DataFrames, Woodwork provides several methods that can be used to update or change the typing information
associated with the series. As an example, add a new semantic tag to the series.
[22]: series.ww.add_semantic_tags('new_tag')
series.ww
[22]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_
˓→tag', 'numeric'})>
As you can see from the output above, the specified tag has been added to the semantic tags for the series.
You can also access Series properties methods through the Woodwork namespace. When possible, Woodwork typing
information will be retained on the value returned. As an example, you can access the Series shape property through
Woodwork.
structures.
[23]: series.ww.shape
[23]: (3,)
You can also call Series methods such as sample. In this case, Woodwork typing information is retained on the
Series returned by the sample method.
[24]: sample_series = series.ww.sample(2)
sample_series.ww
[24]: <Series: None (Physical Type = int64) (Logical Type = Integer) (Semantic Tags = {'new_
˓→tag', 'numeric'})>
[25]: sample_series
[25]: 2 3
0 1
dtype: int64
List Logical Types
Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well
as how they are interpreted.
[26]: from woodwork.type_sys.utils import list_logical_types
list_logical_types()
[26]: name type_string \
0 Address address
1 Age age
2 AgeNullable age_nullable
3 Boolean boolean
4 BooleanNullable boolean_nullable
5 Categorical categorical
6 CountryCode country_code
7 Datetime datetime
8 Double double
9 EmailAddress email_address
10 Filepath filepath
11 IPAddress ip_address
12 Integer integer
13 IntegerNullable integer_nullable
14 LatLong lat_long
15 NaturalLanguage natural_language
16 Ordinal ordinal
17 PersonFullName person_full_name
18 PhoneNumber phone_number
19 PostalCode postal_code
20 SubRegionCode sub_region_code
21 Timedelta timedelta
22 URL url
description physical_type \
0 Represents Logical Types that contain address ... string
1 Represents Logical Types that contain non-nega... int64
2 Represents Logical Types that contain non-nega... Int64
1.1. Quick Start 13


3 Represents Logical Types that contain binary v... bool
4 Represents Logical Types that contain binary v... boolean
5 Represents Logical Types that contain unordere... category
6 Represents Logical Types that contain categori... category
7 Represents Logical Types that contain date and... datetime64[ns]
8 Represents Logical Types that contain positive... float64
9 Represents Logical Types that contain email ad... string
10 Represents Logical Types that specify location... string
11 Represents Logical Types that contain IP addre... string
12 Represents Logical Types that contain positive... int64
13 Represents Logical Types that contain positive... Int64
14 Represents Logical Types that contain latitude... object
15 Represents Logical Types that contain text or ... string
16 Represents Logical Types that contain ordered ... category
17 Represents Logical Types that may contain firs... string
18 Represents Logical Types that contain numeric ... string
19 Represents Logical Types that contain a series... category
20 Represents Logical Types that contain codes re... category
21 Represents Logical Types that contain values s... timedelta64[ns]
22 Represents Logical Types that contain URLs, wh... string
standard_tags is_default_type is_registered parent_type

0 {} True True NaturalLanguage
1 {numeric} True True Integer
2 {numeric} True True IntegerNullable
3 {} True True BooleanNullable
4 {} True True None
5 {category} True True None
6 {category} True True Categorical
7 {} True True None
8 {numeric} True True None
14 {} True True None
structures.
Guides
The guides below provide more detail on the functionality of Woodwork.
Understanding Types and Tags
Using Woodwork effectively requires a good understanding of physical types, logical types, and semantic tags, all
concepts that are core to Woodwork. This guide provides a detailed overview of types and tags, as well as how to work
with them.
Definitions of Types and Tags
Woodwork has been designed to allow users to easily specify additional typing information for a DataFrame while
providing the ability to interface with the data based on the typing information. Because a single DataFrame might
store various types of data like numbers, text, or dates in different columns, the additional information is defined on a
per-column basis.
There are 3 main ways that Woodwork stores additional information about user data:
• Physical Type: defines how the data is stored on disk or in memory.
• Logical Type: defines how the data should be parsed or interpreted.
• Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.
Physical Types
Physical types define how the data is stored on disk or in memory. You might also see the physical type for a column
referred to as the column’s dtype.
For example, typical Pandas dtypes often used include object, int64, float64 and datetime64[ns], though
there are many more. In Woodwork, there are 10 different physical types that are used, each corresponding to a Pandas
dtype. When Woodwork is initialized on a DataFrame, the dtype of the underlying data is converted to one of these
values, if it isn’t already one of these types:
• bool
• boolean
• category
• datetime64[ns]
• float64
• int64
• Int64
• object
• string
• timedelta64[ns]
The physical type conversion is done based on the LogicalType that has been specified or inferred for a given
column.
When using Woodwork with a Koalas DataFrame, the physical types used may be different than those listed above.
For more information, refer to the guide Using Woodwork with Dask and Koalas DataFrames.
1.1. Quick Start 15

Logical Types
Logical types define how data should be interpreted or parsed. Logical types provide an additional level of detail
beyond the physical type. Some columns might share the same physical type, but might have different parsing require-
ments depending on the information that is stored in the column.
For example, email addresses and phone numbers would typically both be stored in a data column with a physical
type of string. However, when reading and validating these two types of information, different rules apply. For
email addresses, the presence of the @ symbol is important. For phone numbers, you might want to confirm that only a
certain number of digits are present, and special characters might be restricted to +, -, ( or ). In this particular example
Woodwork defines two different logical types to separate these parsing needs: EmailAddress and PhoneNumber.
There are many different logical types defined within Woodwork. To get a complete list of all the available logical
types, you can use the list_logical_types function.
[1]: from woodwork import list_logical_types

list_logical_types()
0 Address address
1 Age age
3 Boolean boolean
7 Datetime datetime
8 Double double
12 Integer integer
14 LatLong lat_long
16 Ordinal ordinal
22 URL url
structures.


4 {} True True None
7 {} True True None
In the table, notice that each logical type has a specific physical_type value associated with it. Any time a
logical type is set for a column, the physical type of the underlying data is converted to the type shown in the
physical_type column. There is only one physical type associated with each logical type.
Semantic Tags
Semantic tags provide more context about the meaning of a data column. This could directly affect how the information
contained in the column is interpreted. Unlike physical types and logical types, semantic tags are much less restrictive.
A column might contain many semantic tags or none at all. Regardless, when assigning semantic tags, users should
take care to not assign tags that have conflicting meanings.
As an example of how semantic tags can be useful, consider a dataset with 2 date columns: a signup date and a user
birth date. Both of these columns have the same physical type (datetime64[ns]), and both have the same logical
type (Datetime). However, semantic tags can be used to differentiate these columns. For example, you might want
to add the date_of_birth semantic tag to the user birth date column to indicate this column has special meaning
and could be used to compute a user’s age. Computing an age from the signup date column would not make sense, so
the semantic tag can be used to differentiate between what the dates in these columns mean.
1.1. Quick Start 17

Standard Semantic Tags
As you can see from the table generated with the list_logical_types function above, Woodwork has some
standard tags that are applied to certain columns by default. Woodwork adds a standard set of semantic tags to
columns with LogicalTypes that fall under certain predefined categories.
The standard tags are as follows:
• 'numeric' - The tag applied to numeric Logical Types.
– Integer
– IntegerNullable
– Double
• 'category' - The tag applied to Logical Types that represent categorical variables.
– Categorical
– CountryCode
– Ordinal
– PostalCode
– SubRegionCode
There are also 2 tags that get added to index columns. If no index columns have been specified, these tags are not
present:
• 'index' - on the index column, when specified
• 'time_index' on the time index column, when specified
The application of standard tags, excluding the index and time_index tags, which have special meaning, can be
controlled by the user. This is discussed in more detail in the Working with Semantic Tags section. There are a few
different semantic tags defined within Woodwork. To get a list of the standard, index, and time index tags, you can use
the list_semantic_tags function.
[2]: from woodwork import list_semantic_tags

list_semantic_tags()
[2]: name is_standard_tag \
0 numeric True
1 category True
2 index False
3 time_index False
4 date_of_birth False
valid_logical_types
0 [Age, AgeNullable, Double, Integer, IntegerNul...
1 [Categorical, CountryCode, Ordinal, PostalCode...
2 [Integer, Double, Categorical, Datetime]
3 [Datetime]
4 [Datetime]
structures.
Working with Logical Types
When initializing Woodwork, users have the option to specify the logical types for all, some, or none of the columns
in the underlying DataFrame. If logical types are defined for all of the columns, these logical types are used directly,
provided the data is compatible with the specified logical type. You can’t, for example, use a logical type of Integer
on a column that contains text values that can’t be converted to integers.
If users don’t supply any logical type information during initialization, Woodwork infers the logical types based on
the physical type of the column and the information contained in the columns. If the user passes information for some
of the columns, the logical types are inferred for any columns not specified.
These scenarios are illustrated in this section. To start, create a simple DataFrame to use for this example.
import woodwork as ww
df = pd.DataFrame({
'integers': [-2, 30, 20],
'bools': [True, False, True],
'names': ["Jane Doe", "Bill Smith", "John Hancock"]
})
df
[3]: integers bools names
0 -2 True Jane Doe
1 30 False Bill Smith
2 20 True John Hancock
Importing Woodwork creates a special namespace on the DataFrame, called ww, that can be used to initialize and
modify Woodwork information for a DataFrame. Now that you’ve created the data to use for the example, you can
initialize Woodwork on this DataFrame, assigning logical type values to each of the columns. Then view the types
stored for each column by using the DataFrame.ww.types property.
[4]: logical_types = {
'integers': 'Integer',
'bools': 'Boolean',
'names': 'PersonFullName'
}
df.ww.init(logical_types=logical_types)
df.ww.types
Column
integers int64 Integer ['numeric']
bools bool Boolean []
names string PersonFullName []
As you can see, the logical types that you specified have been assigned to each of the columns. Now assign only one
logical type value, and let Woodwork infer the types for the other columns.
'names': 'PersonFullName'
}
df.ww.init(logical_types=logical_types)
df.ww
Column
1.1. Quick Start 19


With that input, you get the same results. Woodwork used the PersonFullName logical type you assigned to the
names column and then correctly inferred the logical types for the integers and bools columns.
Next, look at what happens if we do not specify any logical types.
[6]: df.ww.init()
df.ww
Column
names category Categorical ['category']
In this case, Woodwork correctly inferred type for the integers and bools columns, but failed to recognize the
names column should have a logical type of PersonFullName. In situations like this, Woodwork provides users
the ability to change the logical type.
Update the logical type of the names column to be PersonFullName.
[7]: df.ww.set_types(logical_types={'names': 'PersonFullName'})

df.ww
Column
If you look carefully at the output, you can see that several things happened to the names column. First, the cor-
rect PersonFullName logical type has been applied. Second, the physical type of the column has changed from
category to string to match the standard physical type for the PersonFullName logical type. Finally, the
standard tag of category that was previously set for the names column has been removed because it no longer
applies.
When setting the LogicalType for a column, the type can be specified by passing a string representing the camel-case
name of the LogicalType class as you have done in previous examples. Alternatively, you can pass the class directly
instead of a string or the snake-case name of the string. All of these would be valid values to use for setting the
PersonFullName Logical type: PersonFullName, "PersonFullName" or "person_full_name".
Note—in order to use the class name, first you have to import the class.
Working with Semantic Tags
Woodwork provides several methods for working with semantic types. You can add and remove specific tags, or you
can reset the tags to their default values. In this section, you learn how to use those methods.
structures.
Standard Tags
As mentioned above, Woodwork applies default semantic tags to columns by default, based on the logical type
that was specified or inferred. If this behavior is undesirable, it can be controlled by setting the parameter
use_standard_tags to False when initializing Woodwork.
[8]: df.ww.init(use_standard_tags=False)
df.ww
Column
integers int64 Integer []
names category Categorical []
As can be seen in the output above, when initializing Woodwork with use_standard_tags set to False, all
semantic tags are empty. The only exception to this is if the index or time index column were set. We discuss that in
more detail later on.
Create a new Woodwork DataFrame with the standard tags, and specify some additional user-defined semantic tags
during creation.
[9]: semantic_tags = {
'bools': 'user_status',
'names': 'legal_name'
}
df.ww.init(semantic_tags=semantic_tags)
df.ww
Column
bools bool Boolean ['user_status']
names category Categorical ['category', 'legal_name']
Woodwork has applied the tags we specified along with any standard tags to the columns in our DataFrame.
After initializing Woodwork, you have changed your mind and decided you don’t like the tag of user_status that
you applied to the bools column. Now you want to remove it. You can do that with the remove_semantic_tags
method.
[10]: df.ww.remove_semantic_tags({'bools':'user_status'})
df.ww
Column
The user_status tag has been removed.

You can also add multiple tags to a column at once by passing in a list of tags, rather of a single tag. Similarly, multiple
tags can be removed at once by passing a list of tags.
[11]: df.ww.add_semantic_tags({'bools':['tag1', 'tag2']})

df.ww
1.1. Quick Start 21


Column
bools bool Boolean ['tag2', 'tag1']
[12]: df.ww.remove_semantic_tags({'bools':['tag1', 'tag2']})

df.ww
Column
All tags can be reset to their default values by using the reset_semantic_tags methods. If
use_standard_tags is True, the tags are reset to the standard tags. Otherwise, the tags are reset to be empty
sets.
[13]: df.ww.reset_semantic_tags()
df.ww
Column
names category Categorical ['category']
In this case, since you initialized Woodwork with the default behavior of using standard tags, calling
reset_semantic_tags resulted in all of our semantic tags being reset to the standard tags for each column.
Index and Time Index Tags
When initializing Woodwork, you have the option to specify which column represents the index and which column
represents the time index. If these columns are specified, semantic tags of index and time_index are applied to
the specified columns. Behind the scenes, Woodwork is performing additional validation checks on the columns to
make sure they are appropriate. For example, index columns must be unique, and time index columns must contain
datetime values or numeric values.
Because of the need for these validation checks, you can’t set the index or time_index tags directly on a column.
In order to designate a column as the index, the set_index method should be used. Similarly, in order to set the time
index column, the set_time_index method should be used. Optionally, these can be specified when initializing
Woodwork by using the index or time_index parameters.
Setting the index
Create a new sample DataFrame that contains columns that can be used as index and time index columns and initialize
Woodwork.
[14]: df = pd.DataFrame({
'index': [0, 1, 2],
'id': [1, 2, 3],
'times': pd.to_datetime(['2020-09-01', '2020-09-02', '2020-09-03']),
'numbers': [10, 20, 30]
structures.

})
df.ww.init()
df.ww
Column
index int64 Integer ['numeric']
id int64 Integer ['numeric']
times datetime64[ns] Datetime []
numbers int64 Integer ['numeric']
Without specifying an index or time index column during initialization, Woodwork has inferred that the index and
id columns are integers and the numeric semantic tag has been applied. You can now set the index column with the
set_index method.
[15]: df.ww.set_index('index')
df.ww
Column
index int64 Integer ['index']
Inspecting the types now reveals that the index semantic tag has been added to the index column, and the numeric
standard tag has been removed. You can also check that the index has been set correctly by checking the value of the
DataFrame.ww.index attribute.
[16]: df.ww.index
[16]: 'index'
If you want to change the index column to be the id column instead, you can do that with another call to set_index.
[17]: df.ww.set_index('id')
df.ww
Column
id int64 Integer ['index']
The index tag has been removed from the index column and added to the id column. The numeric standard tag
that was originally present on the index column has been added back.
1.1. Quick Start 23

Setting the time index
Setting the time index works similarly to setting the index. You can now set the time index with the
set_time_index method.
[18]: df.ww.set_time_index('times')
df.ww
Column
times datetime64[ns] Datetime ['time_index']
After calling set_time_index, the time_index semantic tag has been added to the semantic tags for times
column.
Validating Woodwork’s Typing Information
The logical types, physical types, and semantic tags described above make up a DataFrame’s typing information,
which will be referred to as its “schema”. For Woodowork to be useful, the schema must be valid with respect to its
DataFrame.
[19]: df.ww.schema
[19]: Logical Type Semantic Tag(s)
Column
index Integer ['numeric']
id Integer ['index']
times Datetime ['time_index']
numbers Integer ['numeric']
The Woodwork schema shown above can be seen reflected in the DataFrame below. Every column present in the
schema is present in the DataFrame, the dtypes all match the physical types defined by each column’s LogicalType,
and the Woodwork index column is both unique and matches the DataFrame’s underlying index.
[20]: df
[20]: index id times numbers
1 0 1 2020-09-01 10
2 1 2 2020-09-02 20
3 2 3 2020-09-03 30
[21]: df.dtypes
[21]: index int64
id int64
times datetime64[ns]
numbers int64
dtype: object
Woodwork defines the elements of a valid schema, and maintaining schema validity requires that the DataFrame follow
Woodwork’s type system. For this reason, it is not recommended to perform DataFrame operations directly on the
DataFrame; instead, you should go through the ww namespace. Woodwork will attempt to retain a valid schema for
any operations performed through the ww namespace. If a DataFrame operation called through the ww namespace
invalidates the Woodwork schema defined for that DataFrame, the typing information will be removed.
structures.
Therefore, when performing Woodwork operations, you can be sure that if the schema is present on df.ww.schema
then the schema is valid for that DataFrame.
Defining a Valid Schema
Given a DataFrame and its Woodwork typing information, the schema will be considered valid if:
• All of the columns present in the schema are present on the DataFrame and vice versa
• The physical type used by each column’s Logical Type matches the corresponding series’ dtype
• If an index is present, the index column is unique [pandas only]
• If an index is present, the DataFrame’s underlying index matches the index column exactly [pandas only]
Calling sort_values on a DataFrame, for example, will not invalidate a DataFrame’s schema, as none of the above
properties get broken. In the example below, a new DataFrame is created with the columns sorted in descending order,
and it has Woodwork initialized. Looking at the schema, you will see that it’s exactly the same as the schema of the
original DataFrame.
[22]: sorted_df = df.ww.sort_values(['numbers'], ascending=False)

sorted_df
3 2 3 2020-09-03 30
2 1 2 2020-09-02 20
1 0 1 2020-09-01 10
[23]: sorted_df.ww
Column
times datetime64[ns] Datetime ['time_index']
Conversely, changing a column’s dtype so that it does not match the corresponding physical type by calling astype
on a DataFrame will invalidate the schema, removing it from the DataFrame. The resulting DataFrame will not have
Woodwork initialized, and a warning will be raised explaining why the schema was invalidated.
[24]: astype_df = df.ww.astype({'numbers':'float64'})

astype_df
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-datatables/envs/v0.
˓→3.1/lib/python3.7/site-packages/woodwork/table_accessor.py:588:
˓→TypingInfoMismatchWarning: Operation performed by astype has invalidated the
˓→Woodwork typing information:
dtype mismatch for column numbers between DataFrame dtype, float64, and Integer
˓→dtype, int64.
Please initialize Woodwork with DataFrame.ww.init

TypingInfoMismatchWarning)
1 0 1 2020-09-01 10.0
2 1 2 2020-09-02 20.0
3 2 3 2020-09-03 30.0
1.1. Quick Start 25

[25]: assert astype_df.ww.schema is None
Woodwork provides two helper functions that will allow you to check if a schema is valid for a given dataframe.
The ww.is_schema_valid function will return a boolean indicating whether or not the schema is valid for the
dataframe.
Check whether the schema from df is valid for the sorted_df created above.
[26]: ww.is_schema_valid(sorted_df, df.ww.schema)

[26]: True
The function ww.get_invalid_schema_message can be used to obtain a string message indicating the reason
for an invalid schema. If the schema is valid, this function will return None.
Use the function to determine why the schema from df is invalid for the astype_df created above.
[27]: ww.is_schema_valid(astype_df, df.ww.schema)

[27]: False
[28]: ww.get_invalid_schema_message(astype_df, df.ww.schema)

[28]: 'dtype mismatch for column numbers between DataFrame dtype, float64, and Integer
˓→dtype, int64'
Global Configuration Options
Woodwork contains global configuration options that you can use to control the behavior of certain aspects of Wood-
work. This guide provides an overview of working with those options, including viewing the current settings and
updating the config values.
Viewing Config Settings
To demonstrate how to display the current configuration options, follow along.

After you’ve imported Woodwork, you can view the options with ww.config as shown below.

ww.config
[1]: Woodwork Global Config Settings
-------------------------------
natural_language_threshold: 10
numeric_categorical_threshold: -1
The output of ww.config lists each of the available config variables followed by it’s current setting.
In the output above, the natural_language_threshold config variable has been set to 10 and the
numeric_categorical_threshold has been set to -1.
structures.
Updating Config Settings
Updating a config variable is done simply with a call to the ww.config.set_option function. This function
requires two arguments: the name of the config variable to update and the new value to set.
As an example, update the natural_language_threshold config variable to have a value of 25 instead of the
default value of 10.
[2]: ww.config.set_option('natural_language_threshold', 25)

ww.config
-------------------------------
As you can see from the output above, the value for the natural_language_threshold config variable has
been updated to 25.
Get Value for a Specific Config Variable
If you need access to the value that is set for a specific config variable you can access it with the ww.config.
get_option function, passing in the name of the config variable for which you want the value.
[3]: ww.config.get_option('natural_language_threshold')
[3]: 25
Resetting to Default Values
Config variables can be reset to their default values using the ww.config.reset_option function, passing in the
name of the variable to reset.
As an example, reset the natural_language_threshold config variable to its default value.
[4]: ww.config.reset_option('natural_language_threshold')
ww.config
-------------------------------
Available Config Settings
This section provides an overview of the current config options that can be set within Woodwork.
1.1. Quick Start 27

Natural Language Threshold
The natural_language_threshold config variable helps control the distinction between Categorical and
NaturalLanguage logical types during type inference. More specifically, this threshold represents the average
string length that is used to distinguish between these two types. If the average string length in a column is greater than
this threshold, the column is inferred as a NaturalLanguage column; otherwise, it is inferred as a Categorical
column. The natural_language_threshold config variable defaults to 10.
Numeric Categorical Threshold
Woodwork provides the option to infer numeric columns as the Categorical logical type if they have few enough
unique values. The numeric_categorical_threshold config variable allows users to set the threshold of
unique values below which numeric columns are inferred as categorical. The default threshold is -1, meaning that
numeric columns are not inferred to be categorical by default (because the fewest number of unique values a column
can have is zero).
Gain Statistical Insights into Your Data
Woodwork provides methods on your DataFrames to allow you to use the typing information stored by Woodwork to
better understand your data.
Follow along to learn how to use Woodwork’s statistical methods on a DataFrame of retail data while demonstrating
the full capabilities of the functions.

from woodwork.demo import load_retail
df = load_retail()
df.ww
Column
order_product_id category Categorical ['index']
order_date datetime64[ns] Datetime ['time_index']
customer_name category Categorical ['category']
structures.
DataFrame.ww.describe
Use df.ww.describe() to calculate statistics for the columns in a DataFrame, returning the results in the format
of a pandas DataFrame with the relevant calculations done for each column.
[2]: df.ww.describe()
[2]: order_id product_id description \
physical_type category category string
logical_type Categorical Categorical NaturalLanguage
semantic_tags {category} {category} {}
count 401604 401604 401604
nunique 22190 3684 NaN
nan_count 0 0 0
mean NaN NaN NaN
mode 576339 85123A WHITE HANGING HEART T-LIGHT HOLDER
std NaN NaN NaN
min NaN NaN NaN
first_quartile NaN NaN NaN
second_quartile NaN NaN NaN
third_quartile NaN NaN NaN
max NaN NaN NaN
num_true NaN NaN NaN
num_false NaN NaN NaN
quantity order_date unit_price \

physical_type int64 datetime64[ns] float64
logical_type Integer Datetime Double
semantic_tags {numeric} {time_index} {numeric}
count 401604.0 401604 401604.0
nunique 436.0 20460 620.0
nan_count 0 0 0
mean 12.183273 2011-07-10 12:08:23.848567552 5.732205
mode 1 2011-11-14 15:27:00 2.0625
std 250.283037 NaN 115.110658
min -80995.0 2010-12-01 08:26:00 0.0
first_quartile 2.0 NaN 2.0625
second_quartile 5.0 NaN 3.2175
third_quartile 12.0 NaN 6.1875
max 80995.0 2011-12-09 12:50:00 64300.5
customer_name country total cancelled

physical_type category category float64 bool
logical_type Categorical Categorical Double Boolean
semantic_tags {category} {category} {numeric} {}
count 401604 401604 401604.0 401604
nunique 4372 37 3946.0 NaN
nan_count 0 0 0 0
mean NaN NaN 34.012502 NaN
mode Mary Dalton United Kingdom 24.75 False
std NaN NaN 710.081161 NaN
min NaN NaN -277974.84 NaN
first_quartile NaN NaN 7.0125 NaN
second_quartile NaN NaN 19.305 NaN
third_quartile NaN NaN 32.67 NaN
max NaN NaN 277974.84 NaN
1.1. Quick Start 29


num_true NaN NaN NaN 8872
num_false NaN NaN NaN 392732
There are a couple things to note in the above dataframe:

• The Woodwork index, order_product_id, is not included
• We provide each column’s typing information according to Woodwork’s typing system
• Any statistics that can’t be calculated for a column, such as num_false on a Datetime are filled with NaN.
• Null values do not get counted in any of the calculations other than nunique
DataFrame.ww.value_counts
Use df.ww.value_counts() to calculate the most frequent values for each column that has category as a
standard tag. This returns a dictionary where each column is associated with a sorted list of dictionaries. Each
dictionary contains value and count.
[3]: df.ww.value_counts()
[3]: {'order_product_id': [{'value': 0, 'count': 1},
{'value': 267744, 'count': 1},
{'value': 267742, 'count': 1},
{'value': 267741, 'count': 1},
{'value': 267740, 'count': 1},
{'value': 267739, 'count': 1},
{'value': 267738, 'count': 1},
{'value': 267737, 'count': 1},
{'value': 267736, 'count': 1},
{'value': 267735, 'count': 1}],
'order_id': [{'value': '576339', 'count': 542},
{'value': '579196', 'count': 533},
{'value': '580727', 'count': 529},
{'value': '578270', 'count': 442},
{'value': '573576', 'count': 435},
{'value': '567656', 'count': 421},
{'value': '567183', 'count': 392},
{'value': '575607', 'count': 377},
{'value': '571441', 'count': 364},
{'value': '570488', 'count': 353}],
'product_id': [{'value': '85123A', 'count': 2065},
{'value': '22423', 'count': 1894},
{'value': '85099B', 'count': 1659},
{'value': '47566', 'count': 1409},
{'value': '84879', 'count': 1405},
{'value': '20725', 'count': 1346},
{'value': '22720', 'count': 1224},
{'value': 'POST', 'count': 1196},
{'value': '22197', 'count': 1110},
{'value': '23203', 'count': 1108}],
'customer_name': [{'value': 'Mary Dalton', 'count': 7812},
{'value': 'Dalton Grant', 'count': 5898},
{'value': 'Jeremy Woods', 'count': 5128},
{'value': 'Jasmine Salazar', 'count': 4459},
{'value': 'James Robinson', 'count': 2759},
{'value': 'Bryce Stewart', 'count': 2478},
structures.

{'value': 'Vanessa Sanchez', 'count': 2085},
{'value': 'Laura Church', 'count': 1853},
{'value': 'Kelly Alvarado', 'count': 1667},
{'value': 'Ashley Meyer', 'count': 1640}],
'country': [{'value': 'United Kingdom', 'count': 356728},
{'value': 'Germany', 'count': 9480},
{'value': 'France', 'count': 8475},
{'value': 'EIRE', 'count': 7475},
{'value': 'Spain', 'count': 2528},
{'value': 'Netherlands', 'count': 2371},
{'value': 'Belgium', 'count': 2069},
{'value': 'Switzerland', 'count': 1877},
{'value': 'Portugal', 'count': 1471},
{'value': 'Australia', 'count': 1258}]}
DataFrame.ww.mutual_information
df.ww.mutual_information calculates the mutual information between all pairs of relevant columns. Certain
types, like strings, can’t have mutual information calculated.
The mutual information between columns A and B can be understood as the amount of knowledge you can have
about column A if you have the values of column B. The more mutual information there is between A and B, the less
uncertainty there is in A knowing B, and vice versa.
[4]: df.ww.mutual_information()
[4]: column_1 column_2 mutual_info
0 order_id customer_name 0.886411
1 order_id product_id 0.475745
2 product_id unit_price 0.426383
3 order_id order_date 0.391906
4 product_id customer_name 0.361855
5 order_date customer_name 0.187982
6 quantity total 0.184497
7 customer_name country 0.155593
8 product_id total 0.152183
9 order_id total 0.129882
10 order_id country 0.126048
11 order_id quantity 0.114714
12 unit_price total 0.103210
13 customer_name total 0.099530
14 product_id quantity 0.088663
15 quantity customer_name 0.085515
16 quantity unit_price 0.082515
17 order_id unit_price 0.077681
18 product_id order_date 0.057175
19 total cancelled 0.044032
20 unit_price customer_name 0.041308
21 quantity cancelled 0.035528
22 product_id country 0.028569
23 country total 0.025071
24 order_id cancelled 0.022204
25 quantity country 0.021515
26 order_date country 0.010361
27 customer_name cancelled 0.006456
1.1. Quick Start 31


28 product_id cancelled 0.003769
29 country cancelled 0.003607
30 order_date unit_price 0.003180
31 order_date total 0.002625
32 unit_price country 0.002603
33 quantity order_date 0.002146
34 unit_price cancelled 0.001677
35 order_date cancelled 0.000199
Available Parameters
df.ww.mutual_information provides various parameters for tuning the mutual information calculation.
• num_bins - In order to calculate mutual information on continuous data, Woodwork bins numeric data into
categories. This parameter allows you to choose the number of bins with which to categorize data.
– Defaults to using 10 bins
– The more bins there are, the more variety a column will have. The number of bins used should accurately
portray the spread of the data.
• nrows - If nrows is set at a value below the number of rows in the DataFrame, that number of rows is randomly
sampled from the underlying data
– Defaults to using all the available rows.
– Decreasing the number of rows can speed up the mutual information calculation on a DataFrame with
many rows, but you should be careful that the number being sampled is large enough to accurately portray
the data.
• include_index - If set to True and an index is defined with a logical type that is valid for mutual informa-
tion, the index column will be included in the mutual information output.
– Defaults to False
Now that you understand the parameters, you can explore changing the number of bins. Note—this only affects
numeric columns quantity and unit_price. Increase the number of bins from 10 to 50, only showing the
impacted columns.
[5]: mi = df.ww. mutual_information()

mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price
˓→', 'quantity'])]

structures.

[6]: mi = df.ww.mutual_information(num_bins = 50)

mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price
˓→', 'quantity'])]

In order to include the index column in the mutual information output, run the calculation with
include_index=True.
[7]: mi = df.ww.mutual_information(include_index=True)
mi[mi['column_1'].isin(['order_product_id']) | mi['column_2'].isin(['order_product_id
˓→'])]

1 order_product_id order_id 0.845419
2 order_product_id customer_name 0.736457
3 order_product_id product_id 0.732680
8 order_product_id order_date 0.302856
9 order_product_id total 0.302435
10 order_product_id unit_price 0.299486
11 order_product_id quantity 0.266489
21 order_product_id country 0.093880
34 order_product_id cancelled 0.016307
Using Woodwork with Dask and Koalas DataFrames
Woodwork allows you to add custom typing information to Dask DataFrames or Koalas DataFrames when working
with datasets that are too large to easily fit in memory. Although initializing Woodwork on a Dask or Koalas DataFrame
follows the same process as you follow when initializing on a pandas DataFrame, there are a few limitations to be aware
of. This guide provides a brief overview of using Woodwork with a Dask or Koalas DataFrame. Along the way, the
guide highlights several key items to keep in mind when using a Dask or Koalas DataFrame as input.
Using Woodwork with either Dask or Koalas requires the installation of the Dask or Koalas libraries respectively.
These libraries can be installed directly with these commands:
python -m pip install "woodwork[dask]"
1.1. Quick Start 33

python -m pip install "woodwork[koalas]"
Dask DataFrame Example
Create a Dask DataFrame to use in our example. Normally you create the DataFrame directly by reading in the data
from saved files, but you will create it from a demo pandas DataFrame.
[1]: import dask.dataframe as dd

import woodwork as ww
df_pandas = ww.demo.load_retail(nrows=1000, init_woodwork=False)

df_dask = dd.from_pandas(df_pandas, npartitions=10)
df_dask
[1]: Dask DataFrame Structure:
order_product_id order_id product_id description quantity order_
˓→date unit_price customer_name country total cancelled
npartitions=10
0 int64 object object object int64
˓→datetime64[ns] float64 object object float64 bool
100 ... ... ... ... ...
˓→... ... ... ... ... ...
... ... ... ... ... ...
˓→... ... ... ... ... ...
900 ... ... ... ... ...
˓→... ... ... ... ... ...
999 ... ... ... ... ...
˓→... ... ... ... ... ...
Dask Name: from_pandas, 10 tasks
Now that you have a Dask DataFrame, you can use it to create a Woodwork DataFrame, just as you would with a
pandas DataFrame:
[2]: df_dask.ww.init(index='order_product_id')
df_dask.ww
Column
As you can see from the output above, Woodwork was initialized successfully, and logical type inference was per-
formed for all of the columns.
However, that illustrates one of the key issues in working with Dask DataFrames. In order to perform logical type
inference, Woodwork needs to bring the data into memory so it can be analyzed. Currently, Woodwork reads data
from the first partition of data only, and then uses this data for type inference. Depending on the complexity of the
structures.
data, this could be a time consuming operation. Additionally, if the first partition is not representative of the entire
dataset, the logical types for some columns may be inferred incorrectly.
Skipping or Overriding Type Inference
If this process takes too much time, or if the logical types are not inferred correctly, you can manually specify the
logical types for each column. If the logical type for a column is specified, type inference for that column will
be skipped. If logical types are specified for all columns, logical type inference will be skipped completely and
Woodwork will not need to bring any of the data into memory during initialization.
To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the
correct logical type defined for each column in the DataFrame, then pass that dictionary to the initialization call.
'order_product_id': 'Integer',
'order_id': 'Categorical',
'product_id': 'Categorical',
'description': 'NaturalLanguage',
'quantity': 'Integer',
'order_date': 'Datetime',
'unit_price': 'Double',
'total': 'Double',
'cancelled': 'Boolean',
}
df_dask.ww.init(index='order_product_id', logical_types=logical_types)
df_dask.ww
Column
DataFrame Statistics
There are some Woodwork methods that require bringing the underlying Dask DataFrame into memory: describe,
value_counts and mutual_information. When called, these methods will call a compute operation on the
DataFrame to calculate the desired information. This might be problematic for datasets that cannot fit in memory, so
exercise caution when using these methods.
[4]: df_dask.ww.describe(include=['numeric'])
[4]: quantity unit_price total
physical_type int64 float64 float64
1.1. Quick Start 35


logical_type Integer Double Double
semantic_tags {numeric} {numeric} {numeric}
count 1000.0 1000.0 1000.0
nunique 43.0 61.0 232.0
nan_count 0 0 0
mean 12.735 5.003658 40.390465
mode 1 2.0625 24.75
std 38.401634 9.73817 123.99357
min -24.0 0.165 -68.31
first_quartile 2.0 2.0625 5.709
second_quartile 4.0 3.34125 17.325
third_quartile 12.0 6.1875 33.165
max 600.0 272.25 2684.88
[5]: df_dask.ww.value_counts()
[5]: {'order_id': [{'value': '536464', 'count': 81},
{'value': '536520', 'count': 71},
{'value': '536412', 'count': 68},
{'value': '536401', 'count': 64},
{'value': '536415', 'count': 59},
{'value': '536409', 'count': 54},
{'value': '536408', 'count': 48},
{'value': '536381', 'count': 35},
{'value': '536488', 'count': 34},
{'value': '536446', 'count': 31}],
'product_id': [{'value': '22632', 'count': 11},
{'value': '85123A', 'count': 10},
{'value': '22633', 'count': 10},
{'value': '22961', 'count': 9},
{'value': '84029E', 'count': 9},
{'value': '22866', 'count': 7},
{'value': '84879', 'count': 7},
{'value': '22960', 'count': 7},
{'value': '21212', 'count': 7},
{'value': '22197', 'count': 7}],
{'value': 'Australia', 'count': 14},
{'value': 'Netherlands', 'count': 2}]}
[6]: df_dask.ww.mutual_information().head()
structures.
Koalas DataFrame Example
As above, first create a Koalas DataFrame to use in our example. Normally you create the DataFrame directly by
reading in the data from saved files, but here you create it from a demo pandas DataFrame.
[7]: # The two lines below only need to be executed if you do not have Spark properly
˓→configured.
# However if you are running into config errors, this resource may be useful:
# https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-
˓→address-service-sparkdriver-failed-after
import pyspark.sql as sql

spark = sql.SparkSession.builder.master('local[2]').config("spark.driver.
˓→extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=True").config("spark.sql.
˓→shuffle.partitions", "2").config("spark.driver.bindAddress", "127.0.0.1").
˓→getOrCreate()
[8]: import databricks.koalas as ks
df_koalas = ks.from_pandas(df_pandas)
df_koalas.head()
[8]: order_product_id order_id product_id description
˓→ quantity order_date unit_price customer_name country total
˓→cancelled
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER

˓→6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False
1 1 536365 71053 WHITE METAL LANTERN
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART.
Now that you have a Koalas DataFrame, you can initialize Woodwork, just as you would with a pandas DataFrame:
[9]: df_koalas.ww.init(index='order_product_id')
df_koalas.ww
Column
order_id string Categorical ['category']
product_id string Categorical ['category']
As you can see from the output above, Woodwork has been initialized successfully, and logical type inference was
performed for all of the columns.
1.1. Quick Start 37

Notes on Koalas Dtype Conversions
In the types table above, one important thing to notice is that the physical types for the Koalas DataFrame are different
than the physical types for the Dask DataFrame. The reason for this is that Koalas does not support the category
dtype that is available with pandas and Dask.
When Woodwork is initialized, the dtype of the DataFrame columns are converted to a set of standard dtypes, defined
by the LogicalType primary_dtype property. By default, Woodwork uses the category dtype for any categorical
logical types, but this is not available with Koalas.
For LogicalTypes that have primary_dtype properties that are not compatible with Koalas, Woodwork will try to
convert the column dtype, but will be unsuccessful. At that point, Woodwork will use a backup dtype that is compatible
with Koalas. The implication of this is that using Woodwork with a Koalas DataFrame may result in dtype values that
are different than the values you would get when working with an otherwise identical pandas DataFrame.
Since Koalas does not support the category dtype, any column that is inferred or specified with a logical type
of Categorical will have its values converted to strings and stored with a dtype of string. This means that a
categorical column containing numeric values, will be converted into the equivalent string values.
Finally, Koalas does not support the timedelta64[ns] dtype. For this, there is not a clean backup dtype, so the
use of Timedelta LogicalType is not supported with Koalas DataFrames.
Skipping or Overriding Type Inference
As with Dask, Woodwork must bring the data into memory so it can be analyzed for type inference. Currently,
Woodwork reads the first 100,000 rows of data to use for type inference when using a Koalas DataFrame as input. If
the first 100,000 rows are not representative of the entire dataset, the logical types for some columns might be inferred
incorrectly.
To skip logical type inference completely or to correct type inference issues, define a logical types dictionary with the
correct logical type defined for each column in the dataframe.
'order_product_id': 'Integer',
'order_id': 'Categorical',
'product_id': 'Categorical',
'description': 'NaturalLanguage',
'quantity': 'Integer',
'order_date': 'Datetime',
'unit_price': 'Double',
'total': 'Double',
'cancelled': 'Boolean',
}
df_koalas.ww.init(index='order_product_id', logical_types=logical_types)
df_koalas.ww
Column
order_id string Categorical ['category']
product_id string Categorical ['category']
structures.

country string Categorical ['category']
DataFrame Statistics
As with Dask, running describe, value_counts or mutual_information requires bringing the data into
memory to perform the analysis. When called, these methods will call a to_pandas operation on the DataFrame
to calculate the desired information. This may be problematic for very large datasets, so exercise caution when using
these methods.
[11]: df_koalas.ww.describe(include=['numeric'])
[11]: quantity unit_price total
physical_type int64 float64 float64
logical_type Integer Double Double
semantic_tags {numeric} {numeric} {numeric}
count 1000.0 1000.0 1000.0
nunique 43.0 61.0 232.0
nan_count 0 0 0
mean 12.735 5.003658 40.390465
mode 1 2.0625 24.75
std 38.401634 9.73817 123.99357
min -24.0 0.165 -68.31
first_quartile 2.0 2.0625 5.709
second_quartile 4.0 3.34125 17.325
third_quartile 12.0 6.1875 33.165
max 600.0 272.25 2684.88
[12]: df_koalas.ww.value_counts()
[12]: {'order_id': [{'value': '536464', 'count': 81},
{'value': '536520', 'count': 71},
{'value': '536412', 'count': 68},
{'value': '536401', 'count': 64},
{'value': '536415', 'count': 59},
{'value': '536409', 'count': 54},
{'value': '536408', 'count': 48},
{'value': '536381', 'count': 35},
{'value': '536488', 'count': 34},
{'value': '536446', 'count': 31}],
'product_id': [{'value': '22632', 'count': 11},
{'value': '85123A', 'count': 10},
{'value': '22633', 'count': 10},
{'value': '84029E', 'count': 9},
{'value': '22961', 'count': 9},
{'value': '22197', 'count': 7},
{'value': '21212', 'count': 7},
{'value': '22960', 'count': 7},
{'value': '84879', 'count': 7},
{'value': '22866', 'count': 7}],
1.1. Quick Start 39


{'value': 'Australia', 'count': 14},
{'value': 'Netherlands', 'count': 2}]}
[13]: df_koalas.ww.mutual_information().head()
Data Validation Limitations
Woodwork performs several validation checks to confirm that the data in the DataFrame is appropriate for the specified
parameters. Because some of these validation steps would require pulling the data into memory, they are skipped when
using Woodwork with a Dask or Koalas DataFrame. This section provides an overview of the validation checks that
are performed with pandas input but skipped with Dask or Koalas input.
Index Uniqueness
Normally a check is performed to verify that any column specified as the index contains no duplicate values. With
Dask or Koalas input, this check is skipped and you must manually verify that any column specified as an index
column contains unique values.
Data Consistency with LogicalType (Dask Only)
If you manually define the LogicalType for a column when initializing Woodwork, a check is performed to verify
that the data in that column is appropriate for the specified LogicalType. For example, with pandas input if you
specify a LogicalType of Double for a column that contains letters such as ['a', 'b', 'c'], an error is raised
because it is not possible to convert the letters into numeric values with the float dtype associated with the Double
LogicalType.
With Dask input, no such error appears at the time initialization. However, behind the scenes, Woodwork attempts to
convert the column physical type to float, and this conversion is added to the Dask task graph, without raising an
error. However, an error is raised if a compute operation is called on the DataFrame as Dask attempts to execute
the conversion step. Extra care should be taken when using Dask input to make sure any specified logical types are
consistent with the data in the columns to avoid this type of error.
structures.
Ordinal Order Values Check
For the Ordinal LogicalType, a check is typically performed to make sure that the data column does not contain any
values that are not present in the defined order values. This check will not be performed with Dask or Koalas input.
Users should manually verify that the defined order values are complete to avoid unexpected results.
Other Limitations
Reading from CSV Files
Woodwork provides the ability to read data directly from a CSV file into a Woodwork DataFrame. The helper function
used for this, woodwork.read_file, currently only reads the data into a pandas DataFrame. At some point,
this limitation may be removed, allowing data to be read into a Dask or Koalas DataFrame. For now, only pandas
DataFrames can be created with this function.
Sorting DataFrame on Time Index
When initializing with a time index, Woodwork, by default, will sort the input DataFrame first on the time index and
then on the index, if specified. Because sorting a distributed DataFrame is a computationally expensive operation, this
sorting is performed only when using a pandas DataFrame. If a sorted DataFrame is needed when using a Dask or
Koalas, the user should manually sort the DataFrame as needed.
Equality of Woodwork DataFrames
In order to avoid bringing a Dask DataFrame into memory, Woodwoork does not consider the equality of the data when
checking whether Woodwork Dataframe initialized from a Dask or Koalas DataFrame is equal to another Woodwork
DataFrame. This means that two DataFrames with identical names, columns, indices, semantic tags, and LogicalTypes
but different underlying data will be treated as equal if at least one of them uses Dask or Koalas.
LatLong Columns
When working with the LatLong logical type, Woodwork converts all LatLong columns to a standard format of a tuple
of floats for Dask DataFrames and a list of floats for Koalas DataFrames. In order to do this, the data is read into
memory, which may be problematic for large datatsets.
Integer Column Names
Woodwork allows column names of any format that is supported by the DataFrame. However, Dask DataFrames do
not currently support integer column names.
1.1. Quick Start 41

Setting DataFrame Index
When specifying a Woodwork index with a pandas DataFrame, the underlying index of the DataFrame will be updated
to match the column specified as the Woodwork index. When specifying a Woodwork index on a Dask or Koalas
DataFrame, however, the underlying index will remain unchanged.
Make Index
When using make_index during Woodwork initialization, a new index column is added in-place to the existing
DataFrame. Because this type of in-place operation is not currently possible with Koalas, Woodwork does not support
make_index when working with a Koalas DataFrame.
If a new index column is needed, this should be added by the user prior to initializing Woodwork. This can be done
easily with an operation such as this:
df = df.koalas.attach_id_column('distributed-sequence', 'index_col_name').
Customizing Logical Types and Type Inference
The default type system in Woodwork contains many built-in LogicalTypes that work for a wide variety of datasets. For
situations in which the built-in LogicalTypes are not sufficient, Woodwork allows you to create custom LogicalTypes.
Woodwork also has a set of standard type inference functions that can help in automatically identifying correct Logi-
calTypes in the data. Woodwork also allows you to override these existing functions, or add new functions for inferring
any custom LogicalTypes that are added.
This guide provides an overview of how to create custom LogicalTypes as well as how to override and add new type
inference functions. If you need to learn more about types and tags in Woodwork, refer to the Understanding Types
and Tags guide for more detail.
Viewing Built-In Logical Types
To view all of the default LogicalTypes in Woodwork, use the list_logical_types function. If the existing
types are not sufficient for your needs, you can create and register new LogicalTypes for use with Woodwork initialized
DataFrames and Series.
ww.list_logical_types()
0 Address address
1 Age age
3 Boolean boolean
7 Datetime datetime
8 Double double
12 Integer integer
structures.

14 LatLong lat_long
16 Ordinal ordinal
22 URL url

4 {} True True None
7 {} True True None
1.1. Quick Start 43


Registering a New LogicalType
The first step in registering a new LogicalType is to define the class for the new type. This is done by sub-classing the
built-in LogicalType class. There are a few class attributes that should be set when defining this new class. Each
is reviewed in more detail below.
For this example, you will work through an example for a dataset that contains UPC Codes. First create a new
UPCCode LogicalType. For this example, consider the UPC Code to be a type of categorical variable.
[2]: from woodwork.logical_types import LogicalType
class UPCCode(LogicalType):
"""Represents Logical Types that contain 12-digit UPC Codes."""
primary_dtype = 'category'
backup_dtype = 'string'
standard_tags = {'category', 'upc_code'}
When defining the UPCCode LogicalType class, three class attributes were set. All three of these attributes are
optional, and will default to the values defined on the LogicalType class if they are not set when defining the new
type.
• primary_dtype: This value specifies how the data will be stored. If the column of the dataframe is not
already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that
represents a valid pandas dtype. If not specified, this will default to 'string'.
• backup_dtype: This is primarily useful when working with Koalas dataframes. backup_dtype specifies
the dtype to use if Woodwork is unable to convert to the dtype specified by primary_dtype. In our example,
we set this to 'string' since Koalas does not currently support the 'category' dtype.
• standard_tags: This is a set of semantic tags to apply to any column that is set with the specified Logical-
Type. If not specified, standard_tags will default to an empty set.
• docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a
description of the type in the list of available types returned by ww.list_logical_types().
Note: Behind the scenes, Woodwork uses the category and numeric semantic tags to determine whether a
column is categorical or numeric column, respectively. If the new LogicalType you define represents a categorical or
numeric type, you should include the appropriate tag in the set of tags specified for standard_tags.
Now that you have created the new LogicalType, you can register it with the Woodwork type system so you can use
it. All modifications to the type system are performed by calling the appropriate method on the ww.type_system
object.
[3]: ww.type_system.add_type(UPCCode, parent='Categorical')
If you once again list the available LogicalTypes, you will see the new type you created was added to the list, including
the values for description, physical_type and standard_tags specified when defining the UPCCode LogicalType.
[4]: ww.list_logical_types()
structures.

0 Address address
1 Age age
3 Boolean boolean
7 Datetime datetime
8 Double double
12 Integer integer
14 LatLong lat_long
16 Ordinal ordinal
22 UPCCode upc_code
23 URL url
22 Represents Logical Types that contain 12-digit... category

1.1. Quick Start 45


4 {} True True None
7 {} True True None
22 {upc_code, category} False True Categorical
Logical Type Relationships
When adding a new type to the type system, you can specify an optional parent LogicalType as done above. When
performing type inference a given set of data might match multiple different LogicalTypes. Woodwork uses the
parent-child relationship defined when registering a type to determine which type to infer in this case.
When multiple matches are found, Woodwork will return the most specific type match found. By setting the parent
type to Categorical when registering the UPCCode LogicalType, you are telling Woodwork that if a data column
matches both Categorical and UPCCode during inference, the column should be considered as UPCCode as this
is more specific than Categorical. Woodwork always assumes that a child type is a more specific version of the
parent type.
Working with Custom LogicalTypes
Next, you will create a small sample DataFrame to demonstrate use of the new custom type. This sample DataFrame
includes an id column, a column with valid UPC Codes, and a column that should not be considered UPC Codes
because it contains non-numeric values.

df = pd.DataFrame({
'id': [0, 1, 2, 3],
'code': ['012345412359', '122345712358', '012345412359', '022323413459'],
'not_upc': ['abcdefghijkl', '122345712358', '012345412359', '022323413459']
})
Before using this dataframe, update Woodwork’s default threshold for differentiating between a NaturalLanguage
and Categorical column so that Woodwork will correctly recognize the code column as a Categorical
column. After setting the threshold, initialize Woodwork and verify that Woodwork has identified our column as
Categorical.
structures.
[6]: ww.config.set_option('natural_language_threshold', 12)

df.ww.init()
df.ww
Column
code category Categorical ['category']
not_upc category Categorical ['category']
The reason Woodwork did not identify the code column to have a UPCCode LogicalType, is that you have not yet
defined an inference function to use with this type. The inference function is what tells Woodwork how to match
columns to specific LogicalTypes.
Even without the inference function, you can manually tell Woodwork that the code column should be of type
UPCCode. This will set the physical type properly and apply the standard semantic tags you have defined
[7]: df.ww.init(logical_types = {'code': 'UPCCode'})

df.ww
Column
code category UPCCode ['upc_code', 'category']
Next, add a new inference function and allow Woodwork to automatically set the correct type for the code column.
Defining Custom Inference Functions
The first step in adding an inference function for the UPCCode LogicalType is to define an appropriate function.
Inference functions always accept a single parameter, a pandas.Series. The function should return True if the
series is a match for the LogicalType for which the function is associated, or False if the series is not a match.
For the UPCCode LogicalType, define a function to check that all of the values in a column are 12 character strings
that contain only numbers. Note, this function is for demonstration purposes only and may not catch all cases that
need to be considered for properly identifying a UPC Code.
[8]: def infer_upc_code(series):

# Make sure series contains only strings:
if not series.apply(type).eq(str).all():
return False
# Check that all items are 12 characters long
if all(series.str.len() == 12):
# Try to convert to a number
try:
series.astype('int')
return True
except:
return False
return False
After defining the new UPC Code inference function, add it to the Woodwork type system so it can be used when
inferring column types.
[9]: ww.type_system.update_inference_function('UPCCode', inference_function=infer_upc_code)
1.1. Quick Start 47

After updating the inference function, you can reinitialize Woodwork on the DataFarme. Notice that Woodwork has
correctly identified the code column to have a LogicalType of UPCCode and has correctly set the physical type and
added the standard tags to the semantic tags for that column.
Also note that the not_upc column was identified as Categorical. Even though this column contains 12-digit
strings, some of the values contain letters, and our inference function correctly told Woodwork this was not valid for
the UPCCode LogicalType.
[10]: df.ww.init()
df.ww
Column
Overriding Default Inference Functions
Overriding the default inference functions is done with the update_inference_function TypeSystem method.
Simply pass in the LogicalType for which you want to override the function, along with the new function to use.
For example you can tell Woodwork to use the new infer_upc_code function for the built in Categorical
LogicalType.
[11]: ww.type_system.update_inference_function('Categorical', inference_function=infer_upc_

˓→code)
If you initialize Woodwork on a DataFrame after updating the Categorical function, you can see that
the not_upc column is no longer identified as a Categorical column, but is rather set to the default
NaturalLanguage LogicalType. This is because the letters in the first row of the not_upc column cause our
inference function to return False for this column, while the default Categorical function will allow non-
numeric values to be present. After updating the inference function, this column is no longer considered a match
for the Categorical type, nor does the column match any other logical types. As a result, the LogicalType is set to
NaturalLanguage, the default type used when no type matches are found.
[12]: df.ww.init()
df.ww
Column
not_upc string NaturalLanguage []
Updating LogicalType Relationships
If you need to change the parent for a registered LogicalType, you can do this with the update_relationship
method. Update the new UPCCode LogicalType to be a child of NaturalLanguage instead.
[13]: ww.type_system.update_relationship('UPCCode', parent='NaturalLanguage')
The parent for a logical type can also be set to None to indicate this is a root-level LogicalType that is not a child of
any other existing LogicalType.
structures.
[14]: ww.type_system.update_relationship('UPCCode', parent=None)
Setting the proper parent-child relationships between logical types is important. Because Woodwork will return the
most specific LogicalType match found during inference, improper inference can occur if the relationships are not set
correctly.
As an example, if you initialize Woodwork after setting the UPCCode LogicalType to have a parent of None, you
will now see that the UPC Code column is inferred as Categorical instead of UPCCode. After setting the parent
to None, UPCCode and Categorical are now siblings in the relationship graph instead of having a parent-child
relationship as they did previously. When Woodwork finds multiple matches on the same level in the relationship
graph, the first match is returned, which in this case is Categorical. Without proper parent-child relationships set,
Woodwork is unable to determine which LogicalType is most specific.
[15]: df.ww.init()
df.ww
Column
not_upc string NaturalLanguage []
Removing a LogicalType
If a LogicalType is no longer needed, or is unwanted, it can be removed from the type system with the remove_type
method. When a LogicalType has been removed, a value of False will be present in the is_registered column
for the type. If a LogicalType that has children is removed, all of the children types will have their parent set to the
parent of the LogicalType that is being removed, assuming a parent was defined.
Remove the custom UPCCode type and confirm it has been removed from the type system by listing the available
LogicalTypes. You can confirm that the UPCCode type will no longer be used because it will have a value of False
listed in the is_registered column.
[16]: ww.type_system.remove_type('UPCCode')
ww.list_logical_types()
0 Address address
1 Age age
3 Boolean boolean
7 Datetime datetime
8 Double double
12 Integer integer
14 LatLong lat_long
16 Ordinal ordinal
1.1. Quick Start 49


22 UPCCode upc_code
23 URL url
22 Represents Logical Types that contain 12-digit... category

4 {} True True None
7 {} True True None
22 {upc_code, category} False False None
structures.
Resetting Type System to Defaults
Finally, if you made multiple changes to the default Woodwork type system and would like to reset everything back
to the default state, you can use the reset_defaults method as shown below. This unregisters any new types
you have registered, resets all relationships to their default values and sets all inference functions back to their default
functions.
[17]: ww.type_system.reset_defaults()
Overriding Default LogicalTypes
There may be times when you would like to override Woodwork’s default LogicalTypes. An example might be if you
wanted to use the nullable Int64 dtype for the Integer LogicalType instead of the default dtype of int64. In this
case, you want to stop Woodwork from inferring the default Integer LogicalType and have a compatible Logical
Type inferred instead. You may solve this issue in one of two ways.
First, you can create an entirely new LogicalType with its own name, MyInteger, and register it in the TypeSystem.
If you want to infer it in place of the normal Integer LogicalType, you would remove Integer from the type sys-
tem, and use Integer’s default inference function for MyInteger. Doing this will make it such that MyInteger
will get inferred any place that Integer would have previously. Note, that because Integer has a parent Logi-
calType of IntegerNullable, you also need to set the parent of MyInteger to be IntegerNullable when
registering with the type system.
[18]: from woodwork.logical_types import LogicalType
class MyInteger(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(MyInteger, int_inference_fn, parent='IntegerNullable')
df.ww.init()
df.ww
Column
id Int64 MyInteger ['numeric']
Above, you can see that the id column, which was previously inferred as Integer is now inferred as MyInteger
with the Int64 physical type. In the full list of Logical Types at ww.list_logical_types(), Integer
and MyInteger will now both be present, but Integer’s is_registered will be False while the value for
is_registered for MyInteger will be set to True.
The second option for overriding the default Logical Types allows you to create a new LogicalType with the same
name as an existing one. This might be desirable because it will allow Woodwork to interpret the string 'Integer'
as your new LogicalType, allowing previous code that might have selected 'Integer' to be used without updating
references to a new LogicalType like MyInteger.
Before adding a LogicalType whose name already exists into the TypeSystem, you must first unregister the default
LogicalType.
In order to avoid using the same name space locally between Integer LogicalTypes, it is recommended to reference
Woodwork’s default LogicalType as ww.logical_types.Integer.
1.1. Quick Start 51

[19]: ww.type_system.reset_defaults()
class Integer(LogicalType):
primary_dtype = 'Int64'
standard_tags = {'numeric'}
int_inference_fn = ww.type_system.inference_functions[ww.logical_types.Integer]
ww.type_system.remove_type(ww.logical_types.Integer)
ww.type_system.add_type(Integer, int_inference_fn, parent='IntegerNullable')
df.ww.init()
display(df.ww)
ww.type_system.reset_defaults()
Physical Type Logical Type Semantic Tag(s)
Column
id Int64 Integer ['numeric']
Notice how id now gets inferred as an Integer Logical Type that has Int64 as its Physical Type!
API Reference
WoodworkTableAccessor
WoodworkTableAccessor(dataframe)
WoodworkTableAccessor. Adds specified semantic tags to columns, updating the
add_semantic_tags(. . . ) Woodwork typing information.
WoodworkTableAccessor. Calculates statistics for data contained in the
describe([include]) DataFrame.
WoodworkTableAccessor. Calculates statistics for data contained in the
describe_dict([include]) DataFrame.
WoodworkTableAccessor.drop(columns) Drop specified columns from a DataFrame.
WoodworkTableAccessor.iloc Integer-location based indexing for selection by posi-
tion.
WoodworkTableAccessor.index The index column for the table
WoodworkTableAccessor.init([index, . . . ]) Initializes Woodwork typing information for a
DataFrame.
WoodworkTableAccessor.loc Access a group of rows by label(s) or a boolean array.
WoodworkTableAccessor.logical_types A dictionary containing logical types for each column
WoodworkTableAccessor. Calculates mutual information between all pairs of
mutual_information([. . . ]) columns in the DataFrame that support mutual informa-
tion.
WoodworkTableAccessor. Calculates mutual information between all pairs of
mutual_information_dict([. . . ]) columns in the DataFrame that support mutual informa-
tion.
WoodworkTableAccessor.physical_types A dictionary containing physical types for each column
WoodworkTableAccessor.pop(column_name) Return a Series with Woodwork typing information and
remove it from the DataFrame.
WoodworkTableAccessor. Remove the semantic tags for any column names in the
remove_semantic_tags(. . . ) provided semantic_tags dictionary, updating the Wood-
work typing information.
continues on next page
structures.
Table 1 – continued from previous page

WoodworkTableAccessor.rename(columns) Renames columns in a DataFrame, maintaining Wood-
WoodworkTableAccessor. Reset the semantic tags for the specified columns to the
reset_semantic_tags([. . . ]) default values.
WoodworkTableAccessor.schema A copy of the Woodwork typing information for the
DataFrame.
WoodworkTableAccessor.select([include, ex- Create a DataFrame with Woodwork typing informa-
clude]) tion initialized that includes only columns whose Log-
ical Type and semantic tags match conditions specified
in the list of types and tags to include or exclude.
WoodworkTableAccessor.semantic_tags A dictionary containing semantic tags for each column
WoodworkTableAccessor. Sets the index column of the DataFrame.
set_index(new_index)
WoodworkTableAccessor. Set the time index.
set_time_index(. . . )
WoodworkTableAccessor.set_types([. . . ]) Update the logical type and semantic tags for any
columns names in the provided types dictionaries,
updating the Woodwork typing information for the
DataFrame.
WoodworkTableAccessor.time_index The time index column for the table
WoodworkTableAccessor.to_disk(path[, . . . ]) Write Woodwork table to disk in the format specified by
format, location specified by path.
WoodworkTableAccessor.to_dictionary() Get a dictionary representation of the Woodwork typing
information.
WoodworkTableAccessor.types DataFrame containing the physical dtypes, logical types
and semantic tags for the schema.
WoodworkTableAccessor. A dictionary containing the use_standard_tags setting
use_standard_tags for each column in the table
WoodworkTableAccessor. Returns a list of dictionaries with counts for the most
value_counts([. . . ]) frequent values in each column (only
woodwork.table_accessor.WoodworkTableAccessor
class woodwork.table_accessor.WoodworkTableAccessor(dataframe)
__init__(dataframe)
Initialize self. See help(type(self)) for accurate signature.
Methods
__init__(dataframe) Initialize self.

add_semantic_tags(semantic_tags) Adds specified semantic tags to columns, updating
the Woodwork typing information.
describe([include]) Calculates statistics for data contained in the
DataFrame.
describe_dict([include]) Calculates statistics for data contained in the
DataFrame.
drop(columns) Drop specified columns from a DataFrame.
1.1. Quick Start 53


init([index, time_index, logical_types, . . . ]) Initializes Woodwork typing information for a
DataFrame.
mutual_information([num_bins, nrows, . . . ]) Calculates mutual information between all pairs of
columns in the DataFrame that support mutual infor-
mation.
mutual_information_dict([num_bins, Calculates mutual information between all pairs of
nrows, . . . ]) columns in the DataFrame that support mutual infor-
mation.
pop(column_name) Return a Series with Woodwork typing information
and remove it from the DataFrame.
remove_semantic_tags(semantic_tags) Remove the semantic tags for any column names in
the provided semantic_tags dictionary, updating the
Woodwork typing information.
rename(columns) Renames columns in a DataFrame, maintaining
reset_semantic_tags([columns, re- Reset the semantic tags for the specified columns to
tain_index_tags]) the default values.
select([include, exclude]) Create a DataFrame with Woodwork typing infor-
mation initialized that includes only columns whose
Logical Type and semantic tags match conditions
specified in the list of types and tags to include or
exclude.
set_index(new_index) Sets the index column of the DataFrame.
set_time_index(new_time_index) Set the time index.
set_types([logical_types, semantic_tags, . . . ]) Update the logical type and semantic tags for any
updating the Woodwork typing information for the
DataFrame.
to_dictionary() Get a dictionary representation of the Woodwork
typing information.
to_disk(path[, format, compression, . . . ]) Write Woodwork table to disk in the format specified
by format, location specified by path.
value_counts([ascending, top_n, dropna]) Returns a list of dictionaries with counts for the most
frequent values in each column (only
Attributes
iloc Integer-location based indexing for selection by po-

sition.
index The index column for the table
loc Access a group of rows by label(s) or a boolean array.
logical_types A dictionary containing logical types for each col-
umn
physical_types A dictionary containing physical types for each col-
umn
schema A copy of the Woodwork typing information for the
DataFrame.
semantic_tags A dictionary containing semantic tags for each col-
umn
time_index The time index column for the table
structures.

types DataFrame containing the physical dtypes, logical
types and semantic tags for the schema.
use_standard_tags A dictionary containing the use_standard_tags set-
ting for each column in the table
woodwork.table_accessor.WoodworkTableAccessor.add_semantic_tags
WoodworkTableAccessor.add_semantic_tags(semantic_tags)
Adds specified semantic tags to columns, updating the Woodwork typing information. Will retain any previously
set values.
Parameters semantic_tags (dict[str -> str/list/set]) – A dictionary mapping the
columns in the DataFrame to the tags that should be added to the column’s semantic tags
woodwork.table_accessor.WoodworkTableAccessor.describe
WoodworkTableAccessor.describe(include=None)
Calculates statistics for data contained in the DataFrame.
Parameters include (list[str or LogicalType], optional) – filter for what
columns to include in the statistics returned. Can be a list of column names, semantic tags,
logical types, or a list combining any of the three. It follows the most broad specification. Fa-
vors logical types then semantic tag then column name. If no matching columns are found, an
empty DataFrame will be returned.
Returns A Dataframe containing statistics for the data or the subset of the original DataFrame that
contains the logical types, semantic tags, or column names specified in include.
Return type pd.DataFrame
woodwork.table_accessor.WoodworkTableAccessor.describe_dict
WoodworkTableAccessor.describe_dict(include=None)
Calculates statistics for data contained in the DataFrame.
Parameters include (list[str or LogicalType], optional) – filter for what
columns to include in the statistics returned. Can be a list of column names, semantic tags,
logical types, or a list combining any of the three. It follows the most broad specification. Fa-
vors logical types then semantic tag then column name. If no matching columns are found, an
empty DataFrame will be returned.
Returns A dictionary with a key for each column in the data or for each column matching the logical
types, semantic tags or column names specified in include, paired with a value containing a
dictionary containing relevant statistics for that column.
Return type dict[str -> dict]
1.1. Quick Start 55

woodwork.table_accessor.WoodworkTableAccessor.drop
WoodworkTableAccessor.drop(columns)
Drop specified columns from a DataFrame.
Parameters columns (str or list[str]) – Column name or names to drop. Must be
present in the DataFrame.
Returns DataFrame with the specified columns removed, maintaining Woodwork typing informa-
tion.
Return type DataFrame
Note: This method is used for removing columns only. To remove rows with drop, go through the DataFrame
directly and then reinitialize Woodwork with DataFrame.ww.init instead of calling DataFrame.ww.
drop.
woodwork.table_accessor.WoodworkTableAccessor.iloc
property WoodworkTableAccessor.iloc
Integer-location based indexing for selection by position. .iloc[] is primarily integer position based (from 0
to length-1 of the axis), but may also be used with a boolean array.
If the selection result is a DataFrame or Series, Woodwork typing information will be initialized for the returned
object when possible.
Allowed inputs are: An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A slice object with ints,
e.g. 1:7. A boolean array. A callable function with one argument (the calling Series, DataFrame or
Panel) and that returns valid output for indexing (one of the above). This is useful in method chains, when
you don’t have a reference to the calling object, but would like to base your selection on some value.
woodwork.table_accessor.WoodworkTableAccessor.index
property WoodworkTableAccessor.index
The index column for the table
woodwork.table_accessor.WoodworkTableAccessor.init
WoodworkTableAccessor.init(index=None, time_index=None, logical_types=None,

make_index=False, already_sorted=False, schema=None, vali-
date=True, use_standard_tags=True, **kwargs)
Initializes Woodwork typing information for a DataFrame.
Parameters
• index (str, optional) – Name of the index column.
• time_index (str, optional) – Name of the time index column.
• logical_types (dict[str -> LogicalType]) – Dictionary mapping column
names in the DataFrame to the LogicalType for the column.
structures.
• make_index (bool, optional) – If True, will create a new unique, numeric index
column with the name specified by index and will add the new index column to the sup-
plied DataFrame. If True, the name specified in index cannot match an existing column
name in dataframe. If False, the name is specified in index must match a column
present in the dataframe. Defaults to False.
• already_sorted (bool, optional) – Indicates whether the input DataFrame is al-
ready sorted on the time index. If False, will sort the dataframe first on the time_index and
then on the index (pandas DataFrame only). Defaults to False.
• name (str, optional) – Name used to identify the DataFrame.
• semantic_tags (dict, optional) – Dictionary mapping column names in Wood-
work to the semantic tags for the column. The keys in the dictionary should be strings that
correspond to column names. There are two options for specifying the dictionary values:
(str): If only one semantic tag is being set, a single string can be used as a value. (list[str]
or set[str]): If multiple tags are being set, a list or set of strings can be used as the value.
Semantic tags will be set to an empty set for any column not included in the dictionary.
• table_metadata (dict[str -> json serializable], optional) – Dic-
tionary containing extra metadata for Woodwork.
• column_metadata (dict[str -> dict[str -> json serializable]],
optional) – Dictionary mapping column names to that column’s metadata dictionary.
• use_standard_tags (bool, dict[str -> bool], optional) – Determines
whether standard semantic tags will be added to columns based on the specified logical type
for the column. If a single boolean is supplied, will apply the same use_standard_tags value
to all columns. A dictionary can be used to specify use_standard_tags values for
individual columns. Unspecified columns will use the default value. Defaults to True.
• column_descriptions (dict[str -> str], optional) – Dictionary map-
ping column names to column descriptions.
• schema (Woodwork.TableSchema, optional) – Typing information to use for the
DataFrame instead of performing inference. Any other arguments provided will be ignored.
Note that any changes made to the schema object after initialization will propagate to the
DataFrame. Similarly, to avoid unintended typing information changes, the same schema
object should not be shared between DataFrames.
• validate (bool, optional) – Whether parameter and data validation should occur.
Defaults to True. Warning: Should be set to False only when parameters and data are
known to be valid. Any errors resulting from skipping validation with invalid inputs may
not be easily understood.
woodwork.table_accessor.WoodworkTableAccessor.loc
property WoodworkTableAccessor.loc
Access a group of rows by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
If the selection result is a DataFrame or Series, Woodwork typing information will be initialized for the returned
object when possible.
Allowed inputs are: A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never
as an integer position along the index). A list or array of labels, e.g. ['a', 'b', 'c']. A slice object
with labels, e.g. 'a':'f'. A boolean array of the same length as the axis being sliced, e.g. [True,
False, True]. An alignable boolean Series. The index of the key will be aligned before masking. An
1.1. Quick Start 57

alignable Index. The Index of the returned selection will be the input. A callable function with one
argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
woodwork.table_accessor.WoodworkTableAccessor.logical_types
property WoodworkTableAccessor.logical_types
A dictionary containing logical types for each column
woodwork.table_accessor.WoodworkTableAccessor.mutual_information
WoodworkTableAccessor.mutual_information(num_bins=10, nrows=None, in-

clude_index=False)
Calculates mutual information between all pairs of columns in the DataFrame that support mutual information.
Logical Types that support mutual information are as follows: Age, AgeNullable, Boolean, BooleanNullable,
Categorical, CountryCode, Datetime, Double, Integer, IntegerNullable, Ordinal, PostalCode, and SubRegion-
Code
Parameters
• num_bins (int) – Determines number of bins to use for converting numeric features into
categorical.
• nrows (int) – The number of rows to sample for when determining mutual info. If speci-
fied, samples the desired number of rows from the data. Defaults to using all rows.
• include_index (bool) – If True, the column specified as the index will be included
as long as its LogicalType is valid for mutual information calculations. If False, the index
column will not have mutual information calculated for it. Defaults to False.
Returns A DataFrame containing mutual information with columns column_1, column_2, and mu-
tual_info that is sorted in decending order by mutual info. Mutual information values are be-
tween 0 (no mutual information) and 1 (perfect dependency).
woodwork.table_accessor.WoodworkTableAccessor.mutual_information_dict
WoodworkTableAccessor.mutual_information_dict(num_bins=10, nrows=None, in-

clude_index=False)
Calculates mutual information between all pairs of columns in the DataFrame that support mutual information.
Logical Types that support mutual information are as follows: Age, AgeNullable, Boolean, BooleanNullable,
Categorical, CountryCode, Datetime, Double, Integer, IntegerNullable, Ordinal, PostalCode, and SubRegion-
Code
Parameters
• num_bins (int) – Determines number of bins to use for converting numeric features into
categorical.
• nrows (int) – The number of rows to sample for when determining mutual info. If speci-
fied, samples the desired number of rows from the data. Defaults to using all rows.
• include_index (bool) – If True, the column specified as the index will be included
as long as its LogicalType is valid for mutual information calculations. If False, the index
column will not have mutual information calculated for it. Defaults to False.
structures.
Returns A list containing dictionaries that have keys column_1, column_2, and mutual_info that is
sorted in decending order by mutual info. Mutual information values are between 0 (no mutual
information) and 1 (perfect dependency).
Return type list(dict)
woodwork.table_accessor.WoodworkTableAccessor.physical_types
property WoodworkTableAccessor.physical_types
A dictionary containing physical types for each column
woodwork.table_accessor.WoodworkTableAccessor.pop
WoodworkTableAccessor.pop(column_name)
Return a Series with Woodwork typing information and remove it from the DataFrame.
Parameters column (str) – Name of the column to pop.
Returns Popped series with Woodwork initialized
Return type Series
woodwork.table_accessor.WoodworkTableAccessor.remove_semantic_tags
WoodworkTableAccessor.remove_semantic_tags(semantic_tags)
Remove the semantic tags for any column names in the provided semantic_tags dictionary, updating the Wood-
work typing information. Including index or time_index tags will set the Woodwork index or time index to None
for the DataFrame.
columns in the DataFrame to the tags that should be removed from the column’s semantic tags
woodwork.table_accessor.WoodworkTableAccessor.rename
WoodworkTableAccessor.rename(columns)
Renames columns in a DataFrame, maintaining Woodwork typing information.
Parameters columns (dict[str -> str]) – A dictionary mapping current column names to
new column names.
Returns DataFrame with the specified columns renamed, maintaining Woodwork typing informa-
tion.
1.1. Quick Start 59

woodwork.table_accessor.WoodworkTableAccessor.reset_semantic_tags
WoodworkTableAccessor.reset_semantic_tags(columns=None, retain_index_tags=False)
Reset the semantic tags for the specified columns to the default values. The default values will be either an
empty set or a set of the standard tags based on the column logical type, controlled by the use_standard_tags
property on each column. Column names can be provided as a single string, a list of strings or a set of strings.
If columns is not specified, tags will be reset for all columns.
Parameters
• columns (str/list/set, optional) – The columns for which the semantic tags
should be reset.
• retain_index_tags (bool, optional) – If True, will retain any index or
time_index semantic tags set on the column. If False, will clear all semantic tags. Defaults
to False.
woodwork.table_accessor.WoodworkTableAccessor.schema
property WoodworkTableAccessor.schema
A copy of the Woodwork typing information for the DataFrame.
woodwork.table_accessor.WoodworkTableAccessor.select
WoodworkTableAccessor.select(include=None, exclude=None)
Create a DataFrame with Woodwork typing information initialized that includes only columns whose Logical
Type and semantic tags match conditions specified in the list of types and tags to include or exclude. Values for
both include and exclude cannot be provided in a single call.
If no matching columns are found, an empty DataFrame will be returned.
Parameters
• include (str or LogicalType or list[str or LogicalType]) – Logi-
cal types, semantic tags to include in the DataFrame.
• exclude (str or LogicalType or list[str or LogicalType]) – Logi-
cal types, semantic tags to exclude from the DataFrame.
Returns The subset of the original DataFrame that matches the conditions specified by include
or exclude. Has Woodwork typing information initialized.
woodwork.table_accessor.WoodworkTableAccessor.semantic_tags
property WoodworkTableAccessor.semantic_tags
A dictionary containing semantic tags for each column
structures.
woodwork.table_accessor.WoodworkTableAccessor.set_index
WoodworkTableAccessor.set_index(new_index)
Sets the index column of the DataFrame. Adds the ‘index’ semantic tag to the column and clears the tag from
any previously set index column.
Setting a column as the index column will also cause any previously set standard tags for the column to be
removed.
Clears the DataFrame’s index by passing in None.
Parameters new_index (str) – The name of the column to set as the index
woodwork.table_accessor.WoodworkTableAccessor.set_time_index
WoodworkTableAccessor.set_time_index(new_time_index)
Set the time index. Adds the ‘time_index’ semantic tag to the column and clears the tag from any previously set
index column
Parameters new_time_index (str) – The name of the column to set as the time index. If None,
will remove the time_index.
woodwork.table_accessor.WoodworkTableAccessor.set_types
WoodworkTableAccessor.set_types(logical_types=None, semantic_tags=None, re-

tain_index_tags=True)
Update the logical type and semantic tags for any columns names in the provided types dictionaries, updating
the Woodwork typing information for the DataFrame.
Parameters
• logical_types (dict[str -> str], optional) – A dictionary defining the
new logical types for the specified columns.
• semantic_tags (dict[str -> str/list/set], optional) – A dictionary
defining the new semantic_tags for the specified columns.
time_index semantic tags set on the column. If False, will replace all semantic tags any
time a column’s semantic tags or logical type changes. Defaults to True.
woodwork.table_accessor.WoodworkTableAccessor.time_index
property WoodworkTableAccessor.time_index
The time index column for the table
1.1. Quick Start 61

woodwork.table_accessor.WoodworkTableAccessor.to_disk
WoodworkTableAccessor.to_disk(path, format='csv', compression=None, profile_name=None,

**kwargs)
Write Woodwork table to disk in the format specified by format, location specified by path. Path could be a
local path or an S3 path. If writing to S3 a tar archive of files will be written.
Note: As the engine fastparquet cannot handle nullable pandas dtypes, pyarrow will be used for serialization
to parquet.
Parameters
• path (str) – Location on disk to write to (will be created as a directory)
• format (str) – Format to use for writing Woodwork data. Defaults to csv. Possible values
are: {‘csv’, ‘pickle’, ‘parquet’}.
• compression (str) – Name of the compression to use. Possible values are: {‘gzip’,
‘bz2’, ‘zip’, ‘xz’, None}.
• profile_name (str) – Name of AWS profile to use, False to use an anonymous profile,
or None.
• kwargs (keywords) – Additional keyword arguments to pass as keywords arguments to
the underlying serialization method or to specify AWS profile.
woodwork.table_accessor.WoodworkTableAccessor.to_dictionary
WoodworkTableAccessor.to_dictionary()
Get a dictionary representation of the Woodwork typing information.
Returns Description of the typing information.
Return type dict
woodwork.table_accessor.WoodworkTableAccessor.types
property WoodworkTableAccessor.types
DataFrame containing the physical dtypes, logical types and semantic tags for the schema.
woodwork.table_accessor.WoodworkTableAccessor.use_standard_tags
property WoodworkTableAccessor.use_standard_tags
A dictionary containing the use_standard_tags setting for each column in the table
structures.
woodwork.table_accessor.WoodworkTableAccessor.value_counts
WoodworkTableAccessor.value_counts(ascending=False, top_n=10, dropna=False)

Returns a list of dictionaries with counts for the most frequent values in each column (only for columns
with category as a standard tag).
Parameters
• ascending (bool) – Defines whether each list of values should be sorted most frequent
to least frequent value (False), or least frequent to most frequent value (True). Defaults to
False.
• top_n (int) – the number of top values to retrieve. Defaults to 10.
• dropna (bool) – determines whether to remove NaN values when finding frequency.
Defaults to False.
Returns a list of dictionaries for each categorical column with keys count and value.
Return type list(dict)
WoodworkColumnAccessor
WoodworkColumnAccessor(series)
WoodworkColumnAccessor. Add the specified semantic tags to the set of tags.
add_semantic_tags(. . . )
WoodworkColumnAccessor.description The description of the series
WoodworkColumnAccessor.iloc Integer-location based indexing for selection by posi-
tion.
WoodworkColumnAccessor.init([logical_type, Initializes Woodwork typing information for a Series.
. . . ])
WoodworkColumnAccessor.loc Access a group of rows by label(s) or a boolean array.
WoodworkColumnAccessor.logical_type The logical type of the series
WoodworkColumnAccessor.metadata The metadata of the series
WoodworkColumnAccessor. Removes specified semantic tags from the current tags.
remove_semantic_tags(. . . )
WoodworkColumnAccessor. Reset the semantic tags to the default values.
reset_semantic_tags()
WoodworkColumnAccessor.semantic_tags The semantic tags assigned to the series
WoodworkColumnAccessor. Update the logical type for the series, clearing any previ-
set_logical_type(. . . ) ously set semantic tags, and returning a new series with
Woodwork initialied.
WoodworkColumnAccessor. Replace current semantic tags with new values.
set_semantic_tags(. . . )
WoodworkColumnAccessor.
use_standard_tags
1.1. Quick Start 63

woodwork.column_accessor.WoodworkColumnAccessor
class woodwork.column_accessor.WoodworkColumnAccessor(series)
__init__(series)
Methods
__init__(series) Initialize self.

add_semantic_tags(semantic_tags) Add the specified semantic tags to the set of tags.
init([logical_type, semantic_tags, . . . ]) Initializes Woodwork typing information for a Se-
ries.
remove_semantic_tags(semantic_tags) Removes specified semantic tags from the current
tags.
reset_semantic_tags() Reset the semantic tags to the default values.
set_logical_type(logical_type) Update the logical type for the series, clearing any
previously set semantic tags, and returning a new se-
ries with Woodwork initialied.
set_semantic_tags(semantic_tags) Replace current semantic tags with new values.
Attributes
description The description of the series

iloc Integer-location based indexing for selection by po-
sition.
loc Access a group of rows by label(s) or a boolean array.
logical_type The logical type of the series
metadata The metadata of the series
schema
semantic_tags The semantic tags assigned to the series
use_standard_tags
woodwork.column_accessor.WoodworkColumnAccessor.add_semantic_tags
WoodworkColumnAccessor.add_semantic_tags(semantic_tags)
Add the specified semantic tags to the set of tags.
Parameters semantic_tags (str/list/set) – New semantic tag(s) to add
structures.
woodwork.column_accessor.WoodworkColumnAccessor.description
property WoodworkColumnAccessor.description
The description of the series
woodwork.column_accessor.WoodworkColumnAccessor.iloc
property WoodworkColumnAccessor.iloc
Integer-location based indexing for selection by position. .iloc[] is primarily integer position based (from 0
to length-1 of the axis), but may also be used with a boolean array.
If the selection result is a Series, Woodwork typing information will be initialized for the returned Series.
Allowed inputs are: An integer, e.g. 5. A list or array of integers, e.g. [4, 3, 0]. A slice object with ints,
e.g. 1:7. A boolean array. A callable function with one argument (the calling Series, DataFrame or
Panel) and that returns valid output for indexing (one of the above). This is useful in method chains, when
you don’t have a reference to the calling object, but would like to base your selection on some value.
woodwork.column_accessor.WoodworkColumnAccessor.init
WoodworkColumnAccessor.init(logical_type=None, semantic_tags=None, use_standard_tags=True,

description=None, metadata=None, schema=None, validate=True)
Initializes Woodwork typing information for a Series.
Parameters
• logical_type (LogicalType or str, optional) – The logical type that
should be assigned to the series. If no value is provided, the LogicalType for the series
will be inferred. If the LogicalType provided or inferred does not have a dtype that is com-
patible with the series dtype, an error will be raised.
• semantic_tags (str or list or set, optional) – Semantic tags to assign
to the series. Defaults to an empty set if not specified. There are two options for specifying
the semantic tags: (str) If only one semantic tag is being set, a single string can be passed.
(list or set) If multiple tags are being set, a list or set of strings can be passed.
• use_standard_tags (bool, optional) – If True, will add standard semantic tags
to the series based on the inferred or specified logical type of the series. Defaults to True.
• description (str, optional) – Optional text describing the contents of the series.
• metadata (dict[str -> json serializable], optional) – Metadata asso-
ciated with the series.
• schema (Woodwork.ColumnSchema, optional) – Typing information to use for
the Series instead of performing inference. Any other arguments provided will be ignored.
Note that any changes made to the schema object after initialization will propagate to the
Series. Similarly, to avoid unintended typing information changes, the same schema object
should not be shared between Series.
1.1. Quick Start 65

woodwork.column_accessor.WoodworkColumnAccessor.loc
property WoodworkColumnAccessor.loc
Access a group of rows by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
If the selection result is a Series, Woodwork typing information will be initialized for the returned Series.
Allowed inputs are: A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never
as an integer position along the index). A list or array of labels, e.g. ['a', 'b', 'c']. A slice object
with labels, e.g. 'a':'f'. A boolean array of the same length as the axis being sliced, e.g. [True,
False, True]. An alignable boolean Series. The index of the key will be aligned before masking. An
alignable Index. The Index of the returned selection will be the input. A callable function with one
argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
woodwork.column_accessor.WoodworkColumnAccessor.logical_type
property WoodworkColumnAccessor.logical_type
The logical type of the series
woodwork.column_accessor.WoodworkColumnAccessor.metadata
property WoodworkColumnAccessor.metadata
The metadata of the series
woodwork.column_accessor.WoodworkColumnAccessor.remove_semantic_tags
WoodworkColumnAccessor.remove_semantic_tags(semantic_tags)
Removes specified semantic tags from the current tags.
Parameters semantic_tags (str/list/set) – Semantic tag(s) to remove.
woodwork.column_accessor.WoodworkColumnAccessor.reset_semantic_tags
WoodworkColumnAccessor.reset_semantic_tags()
Reset the semantic tags to the default values. The default values will be either an empty set or a set of the
standard tags based on the column logical type, controlled by the use_standard_tags property.
Parameters None –
woodwork.column_accessor.WoodworkColumnAccessor.semantic_tags
property WoodworkColumnAccessor.semantic_tags
The semantic tags assigned to the series
structures.
woodwork.column_accessor.WoodworkColumnAccessor.set_logical_type
WoodworkColumnAccessor.set_logical_type(logical_type)
Update the logical type for the series, clearing any previously set semantic tags, and returning a new series with
Woodwork initialied.
Parameters logical_type (LogicalType, str) – The new logical type to set for the series.
Returns A new series with the updated logical type.
Return type Series
woodwork.column_accessor.WoodworkColumnAccessor.set_semantic_tags
WoodworkColumnAccessor.set_semantic_tags(semantic_tags)
Replace current semantic tags with new values. If use_standard_tags is set to True for the series, any standard
tags associated with the LogicalType of the series will be added as well.
Parameters semantic_tags (str/list/set) – New semantic tag(s) to set
woodwork.column_accessor.WoodworkColumnAccessor.use_standard_tags
property WoodworkColumnAccessor.use_standard_tags
TableSchema
TableSchema(column_names, logical_types[, . . . ])
TableSchema.add_semantic_tags(semantic_tags)Adds specified semantic tags to columns, updating the
TableSchema.index The index column for the table
TableSchema.logical_types A dictionary containing logical types for each column
TableSchema.rename(columns) Renames columns in a TableSchema
TableSchema.remove_semantic_tags(semantic_tags) Remove the semantic tags for any column names in the
provided semantic_tags dictionary, updating the Wood-
TableSchema.reset_semantic_tags([columns, Reset the semantic tags for the specified columns to the
. . . ]) default values.
TableSchema.semantic_tags A dictionary containing semantic tags for each column
TableSchema.set_index(new_index[, validate]) Sets the index.
TableSchema.set_time_index(new_time_index[, Set the time index.
. . . ])
TableSchema.set_types([logical_types, . . . ]) Update the logical type and semantic tags for any
columns names in the provided types dictionaries, up-
dating the TableSchema at those columns.
TableSchema.time_index The time index column for the table
TableSchema.types DataFrame containing the physical dtypes, logical types
and semantic tags for the TableSchema.
TableSchema.use_standard_tags
1.1. Quick Start 67

woodwork.table_schema.TableSchema
class woodwork.table_schema.TableSchema(column_names, logical_types, name=None,

index=None, time_index=None, seman-
tic_tags=None, table_metadata=None, col-
umn_metadata=None, use_standard_tags=False,
column_descriptions=None, validate=True)
__init__(column_names, logical_types, name=None, index=None, time_index=None, seman-

tic_tags=None, table_metadata=None, column_metadata=None, use_standard_tags=False,
column_descriptions=None, validate=True)
Create TableSchema
Parameters
• column_names (list, set) – The columns present in the TableSchema.
• logical_types (dict[str -> LogicalType]) – Dictionary mapping column
names in the TableSchema to the LogicalType for the column. All columns present in the
TableSchema must be present in the logical_types dictionary.
• name (str, optional) – Name used to identify the TableSchema.
• semantic_tags (dict, optional) – Dictionary mapping column names in the
TableSchema to the semantic tags for the column. The keys in the dictionary should be
strings that correspond to columns in the TableSchema. There are two options for specify-
ing the dictionary values: (str): If only one semantic tag is being set, a single string can be
used as a value. (list[str] or set[str]): If multiple tags are being set, a list or set of strings
can be used as the value. Semantic tags will be set to an empty set for any column not
included in the dictionary.
• table_metadata (dict[str -> json serializable], optional) –
Dictionary containing extra metadata for the TableSchema.
• column_metadata (dict[str -> dict[str -> json
serializable]], optional) – Dictionary mapping column names to that
column’s metadata dictionary.
• use_standard_tags (bool, dict[str -> bool], optional) – Deter-
mines whether standard semantic tags will be added to columns based on the spec-
ified logical type for the column. If a single boolean is supplied, will apply the
same use_standard_tags value to all columns. A dictionary can be used to specify
use_standard_tags values for individual columns. Unspecified columns will use
the default value. Defaults to False.
• column_descriptions (dict[str -> str], optional) – Dictionary map-
ping column names to column descriptions.
• validate (bool, optional) – Whether parameter validation should occur. De-
faults to True. Warning: Should be set to False only when parameters and data are known
to be valid. Any errors resulting from skipping validation with invalid inputs may not be
easily understood.
structures.
Methods
__init__(column_names, logical_types[, . . . ]) Create TableSchema

add_semantic_tags(semantic_tags) Adds specified semantic tags to columns, updating
the Woodwork typing information.
remove_semantic_tags(semantic_tags) Remove the semantic tags for any column names in
the provided semantic_tags dictionary, updating the
rename(columns) Renames columns in a TableSchema
reset_semantic_tags([columns, re- Reset the semantic tags for the specified columns to
tain_index_tags]) the default values.
set_index(new_index[, validate]) Sets the index.
set_time_index(new_time_index[, validate]) Set the time index.
set_types([logical_types, semantic_tags, . . . ]) Update the logical type and semantic tags for any
updating the TableSchema at those columns.
Attributes
index The index column for the table

logical_types A dictionary containing logical types for each col-
umn
semantic_tags A dictionary containing semantic tags for each col-
umn
time_index The time index column for the table
types DataFrame containing the physical dtypes, logical
types and semantic tags for the TableSchema.
use_standard_tags
woodwork.table_schema.TableSchema.add_semantic_tags
TableSchema.add_semantic_tags(semantic_tags)
Adds specified semantic tags to columns, updating the Woodwork typing information. Will retain any previously
set values.
columns in the DataFrame to the tags that should be added to the column’s semantic tags
1.1. Quick Start 69

woodwork.table_schema.TableSchema.index
property TableSchema.index
The index column for the table
woodwork.table_schema.TableSchema.logical_types
property TableSchema.logical_types
A dictionary containing logical types for each column
woodwork.table_schema.TableSchema.rename
TableSchema.rename(columns)
Renames columns in a TableSchema
Parameters columns (dict[str -> str]) – A dictionary mapping current column names to
new column names.
Returns TableSchema with the specified columns renamed.
Return type woodwork.TableSchema
woodwork.table_schema.TableSchema.remove_semantic_tags
TableSchema.remove_semantic_tags(semantic_tags)
Remove the semantic tags for any column names in the provided semantic_tags dictionary, updating the Wood-
work typing information. Including index or time_index tags will set the Woodwork index or time index to None
for the DataFrame.
columns in the DataFrame to the tags that should be removed from the column’s semantic tags
woodwork.table_schema.TableSchema.reset_semantic_tags
TableSchema.reset_semantic_tags(columns=None, retain_index_tags=False)
Reset the semantic tags for the specified columns to the default values. The default values will be either an
empty set or a set of the standard tags based on the column logical type, controlled by the use_standard_tags
property on the table. Column names can be provided as a single string, a list of strings or a set of strings. If
columns is not specified, tags will be reset for all columns.
Parameters
• columns (str/list/set, optional) – The columns for which the semantic tags
should be reset.
time_index semantic tags set on the column. If False, will clear all semantic tags. Defaults
to False.
structures.
woodwork.table_schema.TableSchema.semantic_tags
property TableSchema.semantic_tags
A dictionary containing semantic tags for each column
woodwork.table_schema.TableSchema.set_index
TableSchema.set_index(new_index, validate=True)
Sets the index. Handles setting a new index, updating the index, or removing the index.
Parameters new_index (str) – Name of the new index column. Must be present in the Ta-
bleSchema. If None, will remove the index.
woodwork.table_schema.TableSchema.set_time_index
TableSchema.set_time_index(new_time_index, validate=True)
Set the time index. Adds the ‘time_index’ semantic tag to the column and clears the tag from any previously set
index column
Parameters new_time_index (str) – The name of the column to set as the time index. If None,
will remove the time_index.
woodwork.table_schema.TableSchema.set_types
TableSchema.set_types(logical_types=None, semantic_tags=None, retain_index_tags=True)

Update the logical type and semantic tags for any columns names in the provided types dictionaries, updating
the TableSchema at those columns.
Parameters
• logical_types (dict[str -> LogicalType], optional) – A dictionary
defining the new logical types for the specified columns.
• semantic_tags (dict[str -> str/list/set], optional) – A dictionary
defining the new semantic_tags for the specified columns.
time_index semantic tags set on the column. If False, will replace all semantic tags any
time a column’s semantic tags or logical type changes. Defaults to True.
woodwork.table_schema.TableSchema.time_index
property TableSchema.time_index
The time index column for the table
1.1. Quick Start 71

woodwork.table_schema.TableSchema.types
property TableSchema.types
DataFrame containing the physical dtypes, logical types and semantic tags for the TableSchema.
woodwork.table_schema.TableSchema.use_standard_tags
property TableSchema.use_standard_tags
ColumnSchema
ColumnSchema([logical_type, semantic_tags, . . . ])
ColumnSchema.is_boolean Whether the ColumnSchema is a Boolean column
ColumnSchema.is_categorical Whether the ColumnSchema is categorical in nature
ColumnSchema.is_datetime Whether the ColumnSchema is a Datetime column
ColumnSchema.is_numeric Whether the ColumnSchema is numeric in nature
woodwork.table_schema.ColumnSchema
class woodwork.table_schema.ColumnSchema(logical_type=None, semantic_tags=None,

use_standard_tags=False, description=None,
metadata=None, validate=True)
__init__(logical_type=None, semantic_tags=None, use_standard_tags=False, description=None,

metadata=None, validate=True)
Create ColumnSchema
Parameters
• logical_type (LogicalType, optional) – The column’s LogicalType.
• semantic_tags (str, list, set, optional) – The semantic tag(s) specified
for the column.
• use_standard_tags (boolean, optional) – If True, will add standard semantic
tags to the column based on the specified logical type if a logical type is defined for the
column. Defaults to False.
• description (str, optional) – User description of the column.
• metadata (dict[str -> json serializable], optional) – Extra meta-
data provided by the user.
• validate (bool, optional) – Whether to perform parameter validation. Defaults
to True.
structures.
Methods
__init__([logical_type, semantic_tags, . . . ]) Create ColumnSchema
Attributes
is_boolean Whether the ColumnSchema is a Boolean column

is_categorical Whether the ColumnSchema is categorical in nature
is_datetime Whether the ColumnSchema is a Datetime column
is_numeric Whether the ColumnSchema is numeric in nature
woodwork.table_schema.ColumnSchema.is_boolean
property ColumnSchema.is_boolean
Whether the ColumnSchema is a Boolean column
woodwork.table_schema.ColumnSchema.is_categorical
property ColumnSchema.is_categorical
Whether the ColumnSchema is categorical in nature
woodwork.table_schema.ColumnSchema.is_datetime
property ColumnSchema.is_datetime
Whether the ColumnSchema is a Datetime column
woodwork.table_schema.ColumnSchema.is_numeric
property ColumnSchema.is_numeric
Whether the ColumnSchema is numeric in nature
Serialization
typing_info_to_dict(dataframe) Creates the description for a Woodwork table, including

typing information for each column and loading infor-
mation.
write_dataframe(dataframe, path[, format]) Write underlying DataFrame data to disk or S3 path.
write_typing_info(typing_info, path) Writes Woodwork typing information to the specified
path at woodwork_typing_info.json
write_woodwork_table(dataframe, path[, . . . ]) Serialize Woodwork table and write to disk or S3 path.
1.1. Quick Start 73

woodwork.serialize.typing_info_to_dict
woodwork.serialize.typing_info_to_dict(dataframe)
Creates the description for a Woodwork table, including typing information for each column and loading infor-
mation.
Parameters dataframe (pd.DataFrame, dd.Dataframe, ks.DataFrame) –
DataFrame with Woodwork typing information initialized.
Returns Dictionary containing Woodwork typing information
Return type dict
woodwork.serialize.write_dataframe
woodwork.serialize.write_dataframe(dataframe, path, format='csv', **kwargs)

Write underlying DataFrame data to disk or S3 path.
Parameters
• dataframe (pd.DataFrame, dd.DataFrame, ks.DataFrame) – DataFrame
with Woodwork typing information initialized.
• path (str) – Location on disk to write the Woodwork table.
• format (str) – Format to use for writing Woodwork data. Defaults to csv.
the underlying serialization method.
Returns Information on storage location and format of data.
Return type dict
woodwork.serialize.write_typing_info
woodwork.serialize.write_typing_info(typing_info, path)
Writes Woodwork typing information to the specified path at woodwork_typing_info.json
Parameters typing_info (dict) – Dictionary containing Woodwork typing information.
woodwork.serialize.write_woodwork_table
woodwork.serialize.write_woodwork_table(dataframe, path, profile_name=None, **kwargs)

Serialize Woodwork table and write to disk or S3 path.
Parameters
• dataframe (pd.DataFrame, dd.DataFrame, ks.DataFrame) – DataFrame
with Woodwork typing information initialized.
• path (str) – Location on disk to write the Woodwork table.
• profile_name (str, bool) – The AWS profile specified to write to S3. Will default
to None and search for AWS credentials. Set to False to use an anonymous profile.
the underlying serialization method or to specify AWS profile.
structures.
Deserialization
read_table_typing_information(path) Read Woodwork typing information from disk, S3 path,

or URL.
read_woodwork_table(path[, profile_name, . . . ]) Read Woodwork table from disk, S3 path, or URL.
woodwork.deserialize.read_table_typing_information
woodwork.deserialize.read_table_typing_information(path)
Read Woodwork typing information from disk, S3 path, or URL.
Parameters path (str) – Location on disk, S3 path, or URL to read woodwork_typing_info.json.
Returns Woodwork typing information dictionary
Return type dict
woodwork.deserialize.read_woodwork_table
woodwork.deserialize.read_woodwork_table(path, profile_name=None, validate=False,

**kwargs)
Read Woodwork table from disk, S3 path, or URL.
Parameters
• path (str) – Directory on disk, S3 path, or URL to read woodwork_typing_info.json.
• profile_name (str, bool) – The AWS profile specified to write to S3. Will default
to None and search for AWS credentials. Set to False to use an anonymous profile.
• validate (bool, optional) – Whether parameter and data validation should occur
when initializing Woodwork dataframe during deserialization. Defaults to False. Note: If
serialized data was modified outside of Woodwork and you are unsure of the validity of the
data or typing information, validate should be set to True.
• kwargs (keywords) – Additional keyword arguments to pass as keyword arguments to
the underlying deserialization method.
Returns DataFrame with Woodwork typing information initialized.
Logical Types
Address() Represents Logical Types that contain address values.

Age() Represents Logical Types that contain non-negative
numbers indicating a person’s age.
AgeNullable() Represents Logical Types that contain non-negative
numbers indicating a person’s age.
Boolean() Represents Logical Types that contain binary values in-
dicating true/false.
Categorical([encoding]) Represents Logical Types that contain unordered dis-
crete values that fall into one of a set of possible values.
1.1. Quick Start 75


CountryCode() Represents Logical Types that contain categorical infor-
mation specifically used to represent countries.
Datetime([datetime_format]) Represents Logical Types that contain date and time in-
formation.
Double() Represents Logical Types that contain positive and neg-
ative numbers, some of which include a fractional com-
ponent.
EmailAddress() Represents Logical Types that contain email address
values.
Filepath() Represents Logical Types that specify locations of di-
rectories and files in a file system.
Integer() Represents Logical Types that contain positive and neg-
ative numbers without a fractional component, includ-
ing zero (0).
IPAddress() Represents Logical Types that contain IP addresses, in-
cluding both IPv4 and IPv6 addresses.
LatLong() Represents Logical Types that contain latitude and lon-
gitude values in decimal degrees.
NaturalLanguage() Represents Logical Types that contain text or characters
representing natural human language
Ordinal(order) Represents Logical Types that contain ordered discrete
values.
PersonFullName() Represents Logical Types that may contain first, middle
and last names, including honorifics and suffixes.
PhoneNumber() Represents Logical Types that contain numeric digits
and characters representing a phone number
PostalCode() Represents Logical Types that contain a series of postal
codes for representing a group of addresses.
SubRegionCode() Represents Logical Types that contain codes represent-
ing a portion of a larger geographic region.
Timedelta() Represents Logical Types that contain values specifying
a duration of time
URL() Represents Logical Types that contain URLs, which
may include protocol, hostname and file name
woodwork.logical_types.Address
class woodwork.logical_types.Address
Represents Logical Types that contain address values.
Examples
['1 Miller Drive, New York, NY 12345', '1 Berkeley Street, Boston, MA 67891']
['26387 Russell Hill, Dallas, TX 34521', '54305 Oxford Street, Seattle, WA 95132']
__init__()
structures.
Methods
__init__() Initialize self.
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Age
class woodwork.logical_types.Age
Represents Logical Types that contain non-negative numbers indicating a person’s age. Has ‘numeric’ as a
standard tag.
Examples
[15, 22, 45]

[30, 62, 87]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 77

woodwork.logical_types.AgeNullable
class woodwork.logical_types.AgeNullable
Represents Logical Types that contain non-negative numbers indicating a person’s age. Has ‘numeric’ as a
standard tag. May also contain null values.
Examples
[np.nan, 22, 45]

[30, 62, np.nan]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Boolean
class woodwork.logical_types.Boolean
Represents Logical Types that contain binary values indicating true/false.
Examples
[True, False, True]

[0, 1, 1]
__init__()
structures.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Categorical
class woodwork.logical_types.Categorical(encoding=None)
Represents Logical Types that contain unordered discrete values that fall into one of a set of possible values.
Has ‘category’ as a standard tag.
Examples
["red", "green", "blue"]

["produce", "dairy", "bakery"]
[3, 1, 2]
__init__(encoding=None)
Methods
__init__([encoding]) Initialize self.
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 79

woodwork.logical_types.CountryCode
class woodwork.logical_types.CountryCode
Represents Logical Types that contain categorical information specifically used to represent countries. Has
‘category’ as a standard tag.
Examples
["AUS", "USA", "UKR"]

["GB", "NZ", "DE"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Datetime
class woodwork.logical_types.Datetime(datetime_format=None)
Represents Logical Types that contain date and time information.
Parameters datetime_format (str) – Desired datetime format for data
Examples
["2020-09-10",
"2020-01-10 00:00:00",
"01/01/2000 08:30"]
__init__(datetime_format=None)
structures.
Methods
__init__([datetime_format]) Initialize self.
Attributes
backup_dtype
datetime_format
primary_dtype
standard_tags
type_string
woodwork.logical_types.Double
class woodwork.logical_types.Double
Represents Logical Types that contain positive and negative numbers, some of which include a fractional com-
ponent. Includes zero (0). Has ‘numeric’ as a standard tag.
Examples
[1.2, 100.4, 3.5]

[-15.34, 100, 58.3]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 81

woodwork.logical_types.EmailAddress
class woodwork.logical_types.EmailAddress
Represents Logical Types that contain email address values.
Examples
["[email protected]",
"[email protected]",
"[email protected]"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Filepath
class woodwork.logical_types.Filepath
Represents Logical Types that specify locations of directories and files in a file system.
Examples
["/usr/local/bin",
"/Users/john.smith/dev/index.html",
"/tmp"]
__init__()
structures.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Integer
class woodwork.logical_types.Integer
Represents Logical Types that contain positive and negative numbers without a fractional component, including
zero (0). Has ‘numeric’ as a standard tag.
Examples
[100, 35, 0]
[-54, 73, 11]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 83

woodwork.logical_types.IPAddress
class woodwork.logical_types.IPAddress
Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.
Examples
["172.16.254.1",
"192.0.0.0",
"2001:0db8:0000:0000:0000:ff00:0042:8329"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.LatLong
class woodwork.logical_types.LatLong
Represents Logical Types that contain latitude and longitude values in decimal degrees.
Note: LatLong values will be stored with the object dtype as a tuple of floats (or a list of floats for Koalas
DataFrames) and must contain only two values.
Null latitude or longitude values will be stored as np.nan, and a fully null LatLong (np.nan, np.nan) will be
stored as just a single nan.
Examples
[(33.670914, -117.841501),
(40.423599, -86.921162),
(-45.031705, nan)]
__init__()
structures.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.NaturalLanguage
class woodwork.logical_types.NaturalLanguage
Represents Logical Types that contain text or characters representing natural human language
Examples
["This is a short sentence.",

"I like to eat pizza!",
"When will humans go to mars?"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 85

woodwork.logical_types.Ordinal
class woodwork.logical_types.Ordinal(order)
Represents Logical Types that contain ordered discrete values. Has ‘category’ as a standard tag.
Parameters order (list or tuple) – An list or tuple specifying the order of the ordinal val-
ues from low to high. The underlying series cannot contain values that are not present in the
order values.
Examples
["first", "second", "third"]

["bronze", "silver", "gold"]
__init__(order)
Methods
__init__(order) Initialize self.
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.PersonFullName
class woodwork.logical_types.PersonFullName
Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.
Examples
["Mr. John Doe, Jr.",

"Doe, Mrs. Jane",
"James Brown"]
__init__()
structures.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.PhoneNumber
class woodwork.logical_types.PhoneNumber
Represents Logical Types that contain numeric digits and characters representing a phone number
Examples
["1-(555)-123-5495",
"+1-555-123-5495",
"5551235495"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 87

woodwork.logical_types.PostalCode
class woodwork.logical_types.PostalCode
Represents Logical Types that contain a series of postal codes for representing a group of addresses. Has
‘category’ as a standard tag.
Examples
["90210"
"60018-0123",
"SW1A"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.SubRegionCode
class woodwork.logical_types.SubRegionCode
Represents Logical Types that contain codes representing a portion of a larger geographic region. Has ‘category’
as a standard tag.
Examples
["US-CO", "US-MA", "US-CA"]

["AU-NSW", "AU-TAS", "AU-QLD"]
__init__()
structures.
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
woodwork.logical_types.Timedelta
class woodwork.logical_types.Timedelta
Represents Logical Types that contain values specifying a duration of time
Examples
[pd.Timedelta('1 days 00:00:00'),

pd.Timedelta('-1 days +23:40:00'),
pd.Timedelta('4 days 12:00:00')]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
1.1. Quick Start 89

woodwork.logical_types.URL
class woodwork.logical_types.URL
Represents Logical Types that contain URLs, which may include protocol, hostname and file name
Examples
["http://google.com",
"https://example.com/index.html",
"example.com"]
__init__()
Methods
Attributes
backup_dtype
primary_dtype
standard_tags
type_string
TypeSystem
TypeSystem([inference_functions, . . . ])
TypeSystem.add_type(logical_type[, . . . ]) Add a new LogicalType to the TypeSystem, optionally
specifying the corresponding inference function and a
parent type.
TypeSystem.infer_logical_type(series) Infer the logical type for the given series
TypeSystem.remove_type(logical_type) Remove a logical type from the TypeSystem.
TypeSystem.reset_defaults() Reset type system to the default settings that were spec-
ified at initialization.
TypeSystem.update_inference_function(. . . )Update the inference function for the specified Logical-
Type.
TypeSystem.update_relationship(logical_type,Add or update a relationship.
...)
structures.
woodwork.type_sys.type_system.TypeSystem
class woodwork.type_sys.type_system.TypeSystem(inference_functions=None,
relationships=None, de-
fault_type=NaturalLanguage)
__init__(inference_functions=None, relationships=None, default_type=NaturalLanguage)

Create a new TypeSystem object. LogicalTypes that are present in the keys of the inference_functions
dictionary will be considered registered LogicalTypes.
Parameters
• inference_functions (dict[LogicalType->func], optional) – Dictio-
nary mapping LogicalTypes to their corresponding type inference functions. If None, only
the default LogicalType will be registered without an inference function.
• relationships (list, optional) – List of tuples, each with two elements, spec-
ifying parent-child relationships between logical types. The first element should be the
parent LogicalType. The second element should be the child LogicalType. If not speci-
fied, will default to an empty list indicating all types should be considered root types with
no children.
• default_type (LogicalType, optional) – The default LogicalType to use if no
inference matches are found. If not specified, will default to the built-in NaturalLanguage
LogicalType.
Methods
__init__([inference_functions, . . . ]) Create a new TypeSystem object.

add_type(logical_type[, inference_function, . . . ]) Add a new LogicalType to the TypeSystem, option-
ally specifying the corresponding inference function
and a parent type.
infer_logical_type(series) Infer the logical type for the given series
remove_type(logical_type) Remove a logical type from the TypeSystem.
reset_defaults() Reset type system to the default settings that were
specified at initialization.
str_to_logical_type(logical_str[, params, Helper function for converting a string value to the
. . . ]) corresponding logical type object.
update_inference_function(logical_type, Update the inference function for the specified Log-
...) icalType.
update_relationship(logical_type, parent) Add or update a relationship.
Attributes
registered_types Returns a list of all registered types

root_types Returns a list of all registered types that do not have
a parent type
1.1. Quick Start 91

woodwork.type_sys.type_system.TypeSystem.add_type
TypeSystem.add_type(logical_type, inference_function=None, parent=None)

Add a new LogicalType to the TypeSystem, optionally specifying the corresponding inference function and a
parent type.
Parameters
• logical_type (LogicalType) – The new LogicalType to add.
• inference_function (func, optional) – The inference function to use for infer-
ring the given LogicalType. Defaults to None. If not specified, this LogicalType will never
be inferred.
• parent (LogicalType, optional) – The parent LogicalType, if applicable. De-
faults to None. If not specified, this type will be considered a root type with no parent.
woodwork.type_sys.type_system.TypeSystem.infer_logical_type
TypeSystem.infer_logical_type(series)
Infer the logical type for the given series
Parameters series (pandas.Series) – The series for which to infer the LogicalType.
woodwork.type_sys.type_system.TypeSystem.remove_type
TypeSystem.remove_type(logical_type)
Remove a logical type from the TypeSystem. Any children of the remove type will have their parent set to the
parent of the removed type.
Parameters logical_type (LogicalType) – The LogicalType to remove.
woodwork.type_sys.type_system.TypeSystem.reset_defaults
TypeSystem.reset_defaults()
Reset type system to the default settings that were specified at initialization.
Parameters None –
woodwork.type_sys.type_system.TypeSystem.update_inference_function
TypeSystem.update_inference_function(logical_type, inference_function)
Update the inference function for the specified LogicalType.
Parameters
• logical_type (LogicalType) – The LogicalType for which to update the inference
function.
• inference_function (func) – The new inference function to use. Can be set to None
to skip type inference for the specified LogicalType.
structures.
woodwork.type_sys.type_system.TypeSystem.update_relationship
TypeSystem.update_relationship(logical_type, parent)
Add or update a relationship. If the specified LogicalType exists in the relationship graph, its parent will be
updated. If the specified LogicalType does not exist in relationships, the relationship will be added.
Parameters
• logical_type (LogicalType) – The LogicalType for which to update the parent
value.
• parent (LogicalType) – The new parent to set for the specified LogicalType.
Utils
Type Utils
list_logical_types Returns a dataframe describing all of the available Log-

ical Types.
list_semantic_tags Returns a dataframe describing all of the common se-
mantic tags.
woodwork.type_sys.utils.list_logical_types
woodwork.type_sys.utils.list_logical_types()
Returns a dataframe describing all of the available Logical Types.
Parameters None –
Returns A dataframe containing details on each LogicalType, including the corresponding physical
type and any standard semantic tags.
woodwork.type_sys.utils.list_semantic_tags
woodwork.type_sys.utils.list_semantic_tags()
Returns a dataframe describing all of the common semantic tags.
Parameters None –
Returns A dataframe containing details on each Semantic Tag, including the corresponding logical
type(s).
1.1. Quick Start 93

General Utils
get_valid_mi_types Generate a list of LogicalTypes that are valid for calcu-

lating mutual information.
read_file Read data from the specified file and return a DataFrame
with initialized Woodwork typing information.
woodwork.utils.get_valid_mi_types
woodwork.utils.get_valid_mi_types()
Generate a list of LogicalTypes that are valid for calculating mutual information. Note that index columns are
not valid for calculating mutual information, but their types may be returned by this function.
Parameters None –
Returns A list of the LogicalTypes that can be use to calculate mutual information
Return type list(LogicalType)
woodwork.utils.read_file
woodwork.utils.read_file(filepath=None, content_type=None, name=None, index=None,

time_index=None, semantic_tags=None, logical_types=None,
use_standard_tags=True, validate=True, **kwargs)
Read data from the specified file and return a DataFrame with initialized Woodwork typing information.
Note: As the engine fastparquet cannot handle nullable pandas dtypes, pyarrow will be used for
reading from parquet.
Parameters
• filepath (str) – A valid string path to the file to read
• content_type (str) – Content type of file to read
• name (str, optional) – Name used to identify the DataFrame.
• semantic_tags (dict, optional) – Dictionary mapping column names in the
dataframe to the semantic tags for the column. The keys in the dictionary should be strings
that correspond to columns in the underlying dataframe. There are two options for specify-
ing the dictionary values: (str): If only one semantic tag is being set, a single string can be
used as a value. (list[str] or set[str]): If multiple tags are being set, a list or set of strings can
be used as the value. Semantic tags will be set to an empty set for any column not included
in the dictionary.
• logical_types (dict[str -> LogicalType], optional) – Dictionary map-
ping column names in the dataframe to the LogicalType for the column. LogicalTypes will
be inferred for any columns not present in the dictionary.
to columns based on the inferred or specified logical type for the column. Defaults to True.
structures.
• **kwargs – Additional keyword arguments to pass to the underlying pandas read file
function. For more information on available keywords refer to the pandas documentation.
Returns DataFrame created from the specified file with Woodwork typing information initialized.
get_invalid_schema_message Return a message indicating the reason that the provided

schema cannot be used to initialize Woodwork on the
dataframe.
init_series Initializes Woodwork typing information for a Series,
returning a new Series.
is_schema_valid Check if a schema is valid for initializing Woodwork on
a dataframe
woodwork.accessor_utils.get_invalid_schema_message
woodwork.accessor_utils.get_invalid_schema_message(dataframe, schema)
Return a message indicating the reason that the provided schema cannot be used to initialize Woodwork on the
dataframe. If the schema is valid for the dataframe, None will be returned.
Parameters
• dataframe (DataFrame) – The dataframe against which to check the schema.
• schema (ww.TableSchema) – The schema to use in the validity check.
Returns The reason that the schema is invalid for the dataframe
Return type str or None
woodwork.accessor_utils.init_series
woodwork.accessor_utils.init_series(series, logical_type=None, semantic_tags=None,

use_standard_tags=True, description=None, meta-
data=None)
Initializes Woodwork typing information for a Series, returning a new Series. The dtype of the returned series
will be converted to match the dtype associated with the LogicalType.
Parameters
• series (pd.Series, dd.Series, or ks.Series) – The original series from
which to create the Woodwork initialized series.
• logical_type (LogicalType or str, optional) – The logical type that
should be assigned to the series. If no value is provided, the LogicalType for the series
will be inferred.
• semantic_tags (str or list or set, optional) – Semantic tags to assign
to the series. Defaults to an empty set if not specified. There are two options for specifying
the semantic tags: (str) If only one semantic tag is being set, a single string can be passed.
(list or set) If multiple tags are being set, a list or set of strings can be passed.
1.1. Quick Start 95


to the series based on the inferred or specified logical type of the series. Defaults to True.
• description (str, optional) – Optional text describing the contents of the series.
• metadata (dict[str -> json serializable], optional) – Metadata asso-
ciated with the series.
Returns A series with Woodwork typing information initialized
Return type Series
woodwork.accessor_utils.is_schema_valid
woodwork.accessor_utils.is_schema_valid(dataframe, schema)
Check if a schema is valid for initializing Woodwork on a dataframe
Parameters
• dataframe (DataFrame) – The dataframe against which to check the schema.
• schema (ww.TableSchema) – The schema to use in the validity check.
Returns Boolean indicating whether the schema is valid for the dataframe
Return type boolean
Demo Data
load_retail([id, nrows, init_woodwork]) Load a demo retail dataset into a DataFrame, optionally
initializing Woodwork’s typing information.
woodwork.demo.load_retail
woodwork.demo.load_retail(id='demo_retail_data', nrows=None, init_woodwork=True)

Load a demo retail dataset into a DataFrame, optionally initializing Woodwork’s typing information.
Parameters
• id (str, optional) – The name to assign to the DataFrame, if returning a DataFrame
with Woodwork typing information initialized. If not returning a DataFrame with Wood-
work initialized, this will be ignored. Defaults to demo_retail_data.
• nrows (int, optional) – The number of rows to return in the dataset. If None, will
return all possible rows. Defaults to None.
• init_woodwork (bool) – If True, will return a pandas DataFrame with Woodwork typ-
ing information initialized. If False, will return a DataFrame without Woodwork initialized.
Defaults to False.
Returns A DataFrame containing the demo data with Woodwork typing initialized. If
init_woodwork is False, will return an uninitialized DataFrame.
structures.
Release Notes
v0.3.1 May 12, 2021
Warning: This Woodwork release uses a weak reference for maintaining a reference from the acces-
sor to the DataFrame. Because of this, chaining a Woodwork call onto another call that creates a new
DataFrame or Series object can be problematic.
Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the
DataFrame in a new variable and then initialize Woodwork:
df = pd.DataFrame({'id':[1, 2, 3]})
df.ww.init()
• Enhancements
– Add deep parameter to Woodwork Accessor and Schema equality checks (#889)
– Add support for reading from parquet files to woodwork.read_file (#909)
• Changes
– Remove command line functions for list logical and semantic tags (#891)
– Keep index and time index tags for single column when selecting from a table (#888)
– Update accessors to store weak reference to data (#894)
• Documentation Changes
– Update nbsphinx version to fix docs build issue (#911, #913)
• Testing Changes
– Use Minimum Dependency Generator GitHub Action and remove tools folder (#897)
– Move all latest and minimum dependencies into 1 folder (#912)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey,
@thehomebrewnerd
Breaking Changes
• The command line functions python -m woodwork list-logical-types and python -m

woodwork list-semantic-tags no longer exist. Please call the underlying Python functions ww.
list_logical_types() and ww.list_semantic_tags().
v0.3.0 May 3, 2021
• Enhancements
– Add is_schema_valid and get_invalid_schema_message functions for
checking schema validity (#834)
– Add logical type for Age and AgeNullable (#849)
– Add logical type for Address (#858)
– Add generic to_disk function to save Woodwork schema and data (#872)
1.1. Quick Start 97

– Add generic read_file function to read file as Woodwork DataFrame (#878)

• Fixes
– Raise error when a column is set as the index and time index (#859)
– Allow NaNs in index for schema validation check (#862)
– Fix bug where invalid casting to Boolean would not raise error (#863)
• Changes
– Consistently use ColumnNotPresentError for mismatches between user input and
dataframe/schema columns (#837)
– Raise custom WoodworkNotInitError when accessing Woodwork attributes before
initialization (#838)
– Remove check requiring Ordinal instance for initializing a ColumnSchema object
(#870)
– Increase koalas min version to 1.8.0 (#885)
– Improve formatting of release notes (#874)
• Testing Changes
– Remove unnecessary argument in codecov upload job (#853)
– Change from GitHub Token to regenerated GitHub PAT dependency checkers (#855)
– Update README.md with non-nullable dtypes in code example (#856)
Thanks to the following people for contributing to this release: @frances-h, @gsheni, @jeff-hernandez,
@rwedge, @tamargrey, @thehomebrewnerd
Breaking Changes
• Woodwork tables can no longer be saved using to disk df.ww.to_csv, df.ww.to_pickle, or df.ww.
to_parquet. Use df.ww.to_disk instead.
• The read_csv function has been replaced by read_file.
v0.2.0 Apr 20, 2021
Warning: This Woodwork release does not support Python 3.6
• Enhancements
– Add validation control to WoodworkTableAccessor (#736)
– Store make_index value on WoodworkTableAccessor (#780)
– Add optional exclude parameter to WoodworkTableAccessor select method (#783)
– Add validation control to deserialize.read_woodwork_table and ww.
read_csv (#788)
structures.
– Add WoodworkColumnAccessor.schema and handle copying column schema

(#799)
– Allow initializing a WoodworkColumnAccessor with a ColumnSchema (#814)
– Add __repr__ to ColumnSchema (#817)
– Add BooleanNullable and IntegerNullable logical types (#830)
– Add validation control to WoodworkColumnAccessor (#833)
• Changes
– Rename FullName logical type to PersonFullName (#740)
– Rename ZIPCode logical type to PostalCode (#741)
– Fix issue with smart-open version 5.0.0 (#750, #758)
– Update minimum scikit-learn version to 0.22 (#763)
– Drop support for Python version 3.6 (#768)
– Remove ColumnNameMismatchWarning (#777)
– get_column_dict does not use standard tags by default (#782)
– Make logical_type and name params to _get_column_dict optional (#786)
– Rename Schema object and files to match new table-column schema structure (#789)
– Store column typing information in a ColumnSchema object instead of a dictionary
(#791)
– TableSchema does not use standard tags by default (#806)
– Store use_standard_tags on the ColumnSchema instead of the TableSchema
(#809)
– Move functions in column_schema.py to be methods on ColumnSchema (#829)
– Update Pygments version requirement (#751)
– Update spark config for docs build (#787, #801, #810)
• Testing Changes
– Add unit tests against minimum dependencies for python 3.6 on PRs and main (#743, #753,
#763)
– Update spark config for test fixtures (#787)
– Separate latest unit tests into pandas, dask, koalas (#813)
– Update latest dependency checker to generate separate core, koalas, and dask dependencies
(#815, #825)
– Ignore latest dependency branch when checking for updates to the release notes (#827)
– Change from GitHub PAT to auto generated GitHub Token for dependency checker (#831)
– Expand ColumnSchema semantic tag testing coverage and null logical_type testing
coverage (#832)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @rwedge,
@tamargrey, @thehomebrewnerd
1.1. Quick Start 99

Breaking Changes
• The ZIPCode logical type has been renamed to PostalCode

• The FullName logical type has been renamed to PersonFullName
• The Schema object has been renamed to TableSchema
• With the ColumnSchema object, typing information for a column can no longer be accessed with
df.ww.columns[col_name]['logical_type']. Instead use df.ww.columns[col_name].
logical_type.
• The Boolean and Integer logical types will no longer work with data that contains null values. The new
BooleanNullable and IntegerNullable logical types should be used if null values are present.
v0.1.0 Mar 22, 2021
• Enhancements
– Implement Schema and Accessor API (#497)
– Add Schema class that holds typing info (#499)
– Add WoodworkTableAccessor class that performs type inference and stores Schema (#514)
– Allow initializing Accessor schema with a valid Schema object (#522)
– Add ability to read in a csv and create a DataFrame with an initialized Woodwork Schema
(#534)
– Add ability to call pandas methods from Accessor (#538, #589)
– Add helpers for checking if a column is one of Boolean, Datetime, numeric, or categorical
(#553)
– Add ability to load demo retail dataset with a Woodwork Accessor (#556)
– Add select to WoodworkTableAccessor (#548)
– Add mutual_information to WoodworkTableAccessor (#571)
– Add WoodworkColumnAccessor class (#562)
– Add semantic tag update methods to column accessor (#573)
– Add describe and describe_dict to WoodworkTableAccessor (#579)
– Add init_series util function for initializing a series with dtype change (#581)
– Add set_logical_type method to WoodworkColumnAccessor (#590)
– Add semantic tag update methods to table schema (#591)
– Add warning if additional parameters are passed along with schema (#593)
– Better warning when accessing column properties before init (#596)
– Update column accessor to work with LatLong columns (#598)
– Add set_index to WoodworkTableAccessor (#603)
– Implement loc and iloc for WoodworkColumnAccessor (#613)
– Add set_time_index to WoodworkTableAccessor (#612)
– Implement loc and iloc for WoodworkTableAccessor (#618)
structures.
– Allow updating logical types with set_types and make relevant DataFrame changes
(#619)
– Allow serialization of WoodworkColumnAccessor to csv, pickle, and parquet (#624)
– Add DaskColumnAccessor (#625)
– Allow deserialization from csv, pickle, and parquet to Woodwork table (#626)
– Add value_counts to WoodworkTableAccessor (#632)
– Add KoalasColumnAccessor (#634)
– Add pop to WoodworkTableAccessor (#636)
– Add drop to WoodworkTableAccessor (#640)
– Add rename to WoodworkTableAccessor (#646)
– Add DaskTableAccessor (#648)
– Add Schema properties to WoodworkTableAccessor (#651)
– Add KoalasTableAccessor (#652)
– Adds __getitem__ to WoodworkTableAccessor (#633)
– Update Koalas min version and add support for more new pandas dtypes with Koalas (#678)
– Adds __setitem__ to WoodworkTableAccessor (#669)
• Fixes
– Create new Schema object when performing pandas operation on Accessors (#595)
– Fix bug in _reset_semantic_tags causing columns to share same semantic tags set
(#666)
– Maintain column order in DataFrame and Woodwork repr (#677)
• Changes
– Move mutual information logic to statistics utils file (#584)
– Bump min Koalas version to 1.4.0 (#638)
– Preserve pandas underlying index when not creating a Woodwork index (#664)
– Restrict Koalas version to <1.7.0 due to breaking changes (#674)
– Clean up dtype usage across Woodwork (#682)
– Improve error when calling accessor properties or methods before init (#683)
– Remove dtype from Schema dictionary (#685)
– Add include_index param and allow unique columns in Accessor mutual information
(#699)
– Include DataFrame equality and use_standard_tags in WoodworkTableAccessor
equality check (#700)
– Remove DataTable and DataColumn classes to migrate towards the accessor approach
(#713)
– Change sample_series dtype to not need conversion and remove convert_series
util (#720)
– Rename Accessor methods since DataTable has been removed (#723)
1.1. Quick Start 101

– Update README.md and Get Started guide to use accessor (#655, #717)
– Update Understanding Types and Tags guide to use accessor (#657)
– Update docstrings and API Reference page (#660)
– Update statistical insights guide to use accessor (#693)
– Update Customizing Type Inference guide to use accessor (#696)
– Update Dask and Koalas guide to use accessor (#701)
– Update index notebook and install guide to use accessor (#715)
– Add section to documentation about schema validity (#729)
– Update README.md and Get Started guide to use pd.read_csv (#730)
– Make small fixes to documentation formatting (#731)
• Testing Changes
– Add tests to Accessor/Schema that weren’t previously covered (#712, #716)
– Update release branch name in notes update check (#719)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbrid-
strup, @tamargrey, @thehomebrewnerd
Breaking Changes
• The DataTable and DataColumn classes have been removed and replaced by new
WoodworkTableAccessor and WoodworkColumnAccessor classes which are used through the
ww namespace available on DataFrames after importing Woodwork.
v0.0.11 Mar 15, 2021
• Changes
– Restrict Koalas version to <1.7.0 due to breaking changes (#674)
– Include unique columns in mutual information calculations (#687)
– Add parameter to include index column in mutual information calculations (#692)
– Update to remove warning message from statistical insights guide (#690)
• Testing Changes
– Update branch reference in tests to run on main (#641)
– Make release notes updated check separate from unit tests (#642)
– Update release branch naming instructions (#644)
Thanks to the following people for contributing to this release: @gsheni, @tamargrey, @thehomebrewn-
erd
structures.
v0.0.10 Feb 25, 2021
• Changes
– Avoid calculating mutualinfo for non-unique columns (#563)
– Preserve underlying DataFrame index if index column is not specified (#588)
– Add blank issue template for creating issues (#630)
• Testing Changes
– Update branch reference in tests workflow (#552, #601)
– Fixed text on back arrow on install page (#564)
– Refactor test_datatable.py (#574)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @johnbrid-
strup, @tamargrey
v0.0.9 Feb 5, 2021
• Enhancements
– Add Python 3.9 support without Koalas testing (#511)
– Add get_valid_mi_types function to list LogicalTypes valid for mutual information
calculation (#517)
• Fixes
– Handle missing values in Datetime columns when calculating mutual information (#516)
– Support numpy 1.20.0 by restricting version for koalas and changing serialization error
message (#532)
– Move Koalas option setting to DataTable init instead of import (#543)
– Add Alteryx OSS Twitter link (#519)
– Update logo and add new favicon (#521)
– Multiple improvements to Getting Started page and guides (#527)
– Clean up API Reference and docstrings (#536)
– Added Open Graph for Twitter and Facebook (#544)
erd

v0.0.8 Jan 25, 2021
• Enhancements
– Add DataTable.df property for accessing the underling DataFrame (#470)
– Set index of underlying DataFrame to match DataTable index (#464)
• Fixes
– Sort underlying series when sorting dataframe (#468)
– Allow setting indices to current index without side effects (#474)
• Changes
– Fix release document with Github Actions link for CI (#462)
– Don’t allow registered LogicalTypes with the same name (#477)
– Move str_to_logical_type to TypeSystem class (#482)
– Remove pyarrow from core dependencies (#508)
erd
v0.0.7 Dec 14, 2020
• Enhancements
– Allow for user-defined logical types and inference functions in TypeSystem object (#424)
– Add __repr__ to DataTable (#425)
– Allow initializing DataColumn with numpy array (#430)
– Add drop to DataTable (#434)
– Migrate CI tests to Github Actions (#417, #441, #451)
– Add metadata to DataColumn for user-defined metadata (#447)
• Fixes
– Update DataColumn name when using setitem on column with no name (#426)
– Don’t allow pickle serialization for Koalas DataFrames (#432)
– Check DataTable metadata in equality check (#449)
– Propagate all attributes of DataTable in _new_dt_including (#454)
• Changes
– Update links to use alteryx org Github URL (#423)
– Support column names of any type allowed by the underlying DataFrame (#442)
– Use object dtype for LatLong columns for easy access to latitude and longitude values
(#414)
– Restrict dask version to prevent 2020.12.0 release from being installed (#453)
– Lower minimum requirement for numpy to 1.15.4, and set pandas minimum requirement
1.1.1 (#459)
structures.
• Testing Changes
– Fix missing test coverage (#436)
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey,
@thehomebrewnerd
v0.0.6 Nov 30, 2020
• Enhancements
– Add support for creating DataTable from Koalas DataFrame (#327)
– Add ability to initialize DataTable with numpy array (#367)
– Add describe_dict method to DataTable (#405)
– Add mutual_information_dict method to DataTable (#404)
– Add metadata to DataTable for user-defined metadata (#392)
– Add update_dataframe method to DataTable to update underlying DataFrame (#407)
– Sort dataframe if time_index is specified, bypass sorting with already_sorted pa-
rameter. (#410)
– Add description attribute to DataColumn (#416)
– Implement DataColumn.__len__ and DataTable.__len__ (#415)
• Fixes
– Rename data_column.py datacolumn.py (#386)
– Rename data_table.py datatable.py (#387)
– Rename get_mutual_information mutual_information (#390)
• Changes
– Lower moto test requirement for serialization/deserialization (#376)
– Make Koalas an optional dependency installable with woodwork[koalas] (#378)
– Remove WholeNumber LogicalType from Woodwork (#380)
– Updates to LogicalTypes to support Koalas 1.4.0 (#393)
– Replace set_logical_types and set_semantic_tags with just set_types
(#379)
– Remove copy_dataframe parameter from DataTable initialization (#398)
– Implement DataTable.__sizeof__ to return size of the underlying dataframe (#401)
– Include Datetime columns in mutual info calculation (#399)
– Maintain column order on DataTable operations (#406)
• Testing Changes
– Add pyarrow, dask, and koalas to automated dependency checks (#388)
– Use new version of pull request Github Action (#394)
– Improve parameterization for test_datatable_equality (#409)

Thanks to the following people for contributing to this release: @ctduffy, @gsheni, @tamargrey, @the-
homebrewnerd
Breaking Changes
• The DataTable.set_semantic_tags method was removed. DataTable.set_types can be used

instead.
• The DataTable.set_logical_types method was removed. DataTable.set_types can be used
instead.
• WholeNumber was removed from LogicalTypes. Columns that were previously inferred as WholeNumber
will now be inferred as Integer.
• The DataTable.get_mutual_information was renamed to DataTable.
mutual_information.
• The copy_dataframe parameter was removed from DataTable initialization.
v0.0.5 Nov 11, 2020
• Enhancements
– Add __eq__ to DataTable and DataColumn and update LogicalType equality (#318)
– Add value_counts() method to DataTable (#342)
– Support serialization and deserialization of DataTables via csv, pickle, or parquet (#293)
– Add shape property to DataTable and DataColumn (#358)
– Add iloc method to DataTable and DataColumn (#365)
– Add numeric_categorical_threshold config value to allow inferring numeric
columns as Categorical (#363)
– Add rename method to DataTable (#367)
• Fixes
– Catch non numeric time index at validation (#332)
• Changes
– Support logical type inference from a Dask DataFrame (#248)
– Fix validation checks and make_index to work with Dask DataFrames (#260)
– Skip validation of Ordinal order values for Dask DataFrames (#270)
– Improve support for datetimes with Dask input (#286)
– Update DataTable.describe to work with Dask input (#296)
– Update DataTable.get_mutual_information to work with Dask input (#300)
– Modify to_pandas function to return DataFrame with correct index (#281)
– Rename DataColumn.to_pandas method to DataColumn.to_series (#311)
– Rename DataTable.to_pandas method to DataTable.to_dataframe (#319)
– Remove UserWarning when no matching columns found (#325)
structures.
– Remove copy parameter from DataTable.to_dataframe and DataColumn.

to_series (#338)
– Allow pandas ExtensionArrays as inputs to DataColumn (#343)
– Move warnings to a separate exceptions file and call via UserWarning subclasses (#348)
– Make Dask an optional dependency installable with woodwork[dask] (#357)
– Create a guide for using Woodwork with Dask (#304)
– Add conda install instructions (#305, #309)
– Fix README.md badge with correct link (#314)
– Simplify issue templates to make them easier to use (#339)
– Remove extra output cell in Start notebook (#341)
• Testing Changes
– Parameterize numeric time index tests (#288)
– Add DockerHub credentials to CI testing environment (#326)
– Fix removing files for serialization test (#350)
homebrewnerd
Breaking Changes
• The DataColumn.to_pandas method was renamed to DataColumn.to_series.

• The DataTable.to_pandas method was renamed to DataTable.to_dataframe.
• copy is no longer a parameter of DataTable.to_dataframe or DataColumn.to_series.
v0.0.4 Oct 21, 2020
• Enhancements
– Add optional include parameter for DataTable.describe() to filter results (#228)
– Add make_index parameter to DataTable.__init__ to enable optional creation of
a new index column (#238)
– Add support for setting ranking order on columns with Ordinal logical type (#240)
– Add list_semantic_tags function and CLI to get dataframe of woodwork seman-
tic_tags (#244)
– Add support for numeric time index on DataTable (#267)
– Add pop method to DataTable (#289)
– Add entry point to setup.py to run CLI commands (#285)
• Fixes
– Allow numeric datetime time indices (#282)
• Changes

– Remove redundant methods DataTable.select_ltypes and DataTable.

select_semantic_tags (#239)
– Make results of get_mutual_information more clear by sorting and removing self
calculation (#247)
– Lower minimum scikit-learn version to 0.21.3 (#297)
– Add guide for dt.describe and dt.get_mutual_information (#245)
– Update README.md with documentation link (#261)
– Add footer to doc pages with Alteryx Open Source (#258)
– Add types and tags one-sentence definitions to Understanding Types and Tags guide (#271)
– Add issue and pull request templates (#280, #284)
• Testing Changes
– Add automated process to check latest dependencies. (#268)
– Add test for setting a time index with specified string logical type (#279)
homebrewnerd
v0.0.3 Oct 9, 2020
• Enhancements
– Implement setitem on DataTable to create/overwrite an existing DataColumn (#165)
– Add to_pandas method to DataColumn to access the underlying series (#169)
– Add list_logical_types function and CLI to get dataframe of woodwork LogicalTypes
(#172)
– Add describe method to DataTable to generate statistics for the underlying data (#181)
– Add optional return_dataframe parameter to load_retail to return either
DataFrame or DataTable (#189)
– Add get_mutual_information method to DataTable to generate mutual information
between columns (#203)
– Add read_csv function to create DataTable directly from CSV file (#222)
• Fixes
– Fix bug causing incorrect values for quartiles in DataTable.describe method (#187)
– Fix bug in DataTable.describe that could cause an error if certain semantic tags
were applied improperly (#190)
– Fix bug with instantiated LogicalTypes breaking when used with issubclass (#231)
• Changes
– Remove unnecessary add_standard_tags attribute from DataTable (#171)
– Remove standard tags from index column and do not return stats for index column from
DataTable.describe (#196)
structures.
– Update DataColumn.set_semantic_tags and DataColumn.

add_semantic_tags to return new objects (#205)
– Update various DataTable methods to return new objects rather than modifying in place
(#210)
– Move datetime_format to Datetime LogicalType (#216)
– Do not calculate mutual info with index column in DataTable.
get_mutual_information (#221)
– Move setting of underlying physical types from DataTable to DataColumn (#233)
– Remove unused code from sphinx conf.py, update with Github URL(#160, #163)
– Update README and docs with new Woodwork logo, with better code snippets (#161,
#159)
– Add DataTable and DataColumn to API Reference (#162)
– Add docstrings to LogicalType classes (#168)
– Add Woodwork image to index, clear outputs of Jupyter notebook in docs (#173)
– Update contributing.md, release.md with all instructions (#176)
– Add section for setting index and time index to start notebook (#179)
– Rename changelog to Release Notes (#193)
– Add section for standard tags to start notebook (#188)
– Add Understanding Types and Tags user guide (#201)
– Add missing docstring to list_logical_types (#202)
– Add Woodwork Global Configuration Options guide (#215)
• Testing Changes
– Add tests that confirm dtypes are as expected after DataTable init (#152)
– Remove unused none_df test fixture (#224)
– Add test for LogicalType.__str__ method (#225)
erd
v0.0.2 Sep 28, 2020
• Fixes
– Fix formatting issue when printing global config variables (#138)
• Changes
– Change add_standard_tags to use_standard_Tags to better describe behavior (#149)
– Change access of underlying dataframe to be through to_pandas with ._dataframe field
on class (#146)
– Remove replace_none parameter to DataTables (#146)

– Add working code example to README and create Using Woodwork page (#103)
erd
v0.1.0 Sep 24, 2020
• Add natural_language_threshold global config option used for Categori-

cal/NaturalLanguage type inference (#135)
• Add global config options and add datetime_format option for type inference (#134)
• Fix bug with Integer and WholeNumber inference in column with pd.NA values (#133)
• Add DataTable.ltypes property to return series of logical types (#131)
• Add ability to create new datatable from specified columns with dt[[columns]] (#127)
• Handle setting and tagging of index and time index columns (#125)
• Add combined tag and ltype selection (#124)
• Add changelog, and update changelog check to CI (#123)
• Implement reset_semantic_tags (#118)
• Implement DataTable getitem (#119)
• Add remove_semantic_tags method (#117)
• Add semantic tag selection (#106)
• Add github action, rename to woodwork (#113)
• Add license to setup.py (#112)
• Reset semantic tags on logical type change (#107)
• Add standard numeric and category tags (#100)
• Change semantic_types to semantic_tags, a set of strings (#100)
• Update dataframe dtypes based on logical types (#94)
• Add select_logical_types to DataTable (#96)
• Add pygments to dev-requirements.txt (#97)
• Add replacing None with np.nan in DataTable init (#87)
• Refactor DataColumn to make semantic_types and logical_type private (#86)
• Add pandas_dtype to each Logical Type, and remove dtype attribute on DataColumn (#85)
• Add set_semantic_types methods on both DataTable and DataColumn (#75)
• Support passing camel case or snake case strings for setting logical types (#74)
• Improve flexibility when setting semantic types (#72)
• Add Whole Number Inference of Logical Types (#66)
• Add dtypes property to DataTables and repr for DataColumn (#61)
• Allow specification of semantic types during DataTable creation (#69)
• Implements set_logical_types on DataTable (#65)
structures.
• Add init files to tests to fix code coverage (#60)

• Add AutoAssign bot (#59)
• Add logical types validation in DataTables (#49)
• Fix working_directory in CI (#57)
• Add infer_logical_types for DataColumn (#45)
• Fix ReadME library name, and code coverage badge (#56, #56)
• Add code coverage (#51)
• Improve and refactor the validation checks during initialization of a DataTable (#40)
• Add dataframe attribute to DataTable (#39)
• Update ReadME with minor usage details (#37)
• Add License (#34)
• Rename from datatables to datatables (#4)
• Add Logical Types, DataTable, DataColumn (#3)
• Add Makefile, setup.py, requirements.txt (#2)
• Initial Release (#1)
erd

structures.
INDEX
Symbols method), 89
__init__()
__init__() (woodwork.column_accessor.WoodworkColumnAccessor (woodwork.logical_types.URL method),
method), 64 90
__init__() (woodwork.logical_types.Address __init__() (woodwork.table_accessor.WoodworkTableAccessor
method), 76 method), 53
__init__() (woodwork.logical_types.Age method), __init__() (woodwork.table_schema.ColumnSchema
77 method), 72
__init__() (woodwork.logical_types.AgeNullable __init__() (woodwork.table_schema.TableSchema
__init__() (woodwork.logical_types.Boolean __init__() (woodwork.type_sys.type_system.TypeSystem
__init__() (woodwork.logical_types.Categorical
method), 79 A
__init__() (woodwork.logical_types.CountryCode add_semantic_tags() (wood-
method), 80 work.column_accessor.WoodworkColumnAccessor
__init__() (woodwork.logical_types.Datetime method), 64
method), 80 add_semantic_tags() (wood-
__init__() (woodwork.logical_types.Double work.table_accessor.WoodworkTableAccessor
__init__() (woodwork.logical_types.EmailAddress add_semantic_tags() (wood-
method), 82 work.table_schema.TableSchema method),
__init__() (woodwork.logical_types.Filepath 69
method), 82 add_type() (woodwork.type_sys.type_system.TypeSystem
__init__() (woodwork.logical_types.IPAddress method), 92
method), 84 Address (class in woodwork.logical_types), 76
__init__() (woodwork.logical_types.Integer Age (class in woodwork.logical_types), 77
method), 83 AgeNullable (class in woodwork.logical_types), 78
__init__() (woodwork.logical_types.LatLong
method), 84 B
__init__() (woodwork.logical_types.NaturalLanguage Boolean (class in woodwork.logical_types), 78
method), 85
__init__() (woodwork.logical_types.Ordinal C
method), 86 Categorical (class in woodwork.logical_types), 79
__init__() (woodwork.logical_types.PersonFullName ColumnSchema (class in woodwork.table_schema), 72
method), 86 CountryCode (class in woodwork.logical_types), 80
__init__() (woodwork.logical_types.PhoneNumber
method), 87 D
__init__() (woodwork.logical_types.PostalCode
Datetime (class in woodwork.logical_types), 80
method), 88
describe() (woodwork.table_accessor.WoodworkTableAccessor
__init__() (woodwork.logical_types.SubRegionCode
method), 55
method), 88
__init__() (woodwork.logical_types.Timedelta
113
describe_dict() (wood- is_schema_valid() (in module wood-

work.table_accessor.WoodworkTableAccessor work.accessor_utils), 96
method), 55
description() (wood- L
work.column_accessor.WoodworkColumnAccessorLatLong (class in woodwork.logical_types), 84
property), 65 list_logical_types() (in module wood-
Double (class in woodwork.logical_types), 81 work.type_sys.utils), 93
drop() (woodwork.table_accessor.WoodworkTableAccessorlist_semantic_tags() (in module wood-
method), 56 work.type_sys.utils), 93
load_retail() (in module woodwork.demo), 96
E loc() (woodwork.column_accessor.WoodworkColumnAccessor
EmailAddress (class in woodwork.logical_types), 82 property), 66
loc() (woodwork.table_accessor.WoodworkTableAccessor
F property), 57
Filepath (class in woodwork.logical_types), 82 logical_type() (wood-
work.column_accessor.WoodworkColumnAccessor
G property), 66
get_invalid_schema_message() (in module logical_types() (wood-
woodwork.accessor_utils), 95 work.table_accessor.WoodworkTableAccessor
get_valid_mi_types() (in module wood- property), 58
work.utils), 94 logical_types() (wood-
work.table_schema.TableSchema property),
I 70
iloc() (woodwork.column_accessor.WoodworkColumnAccessor
property), 65
M
iloc() (woodwork.table_accessor.WoodworkTableAccessormetadata() (woodwork.column_accessor.WoodworkColumnAccessor
property), 56 property), 66
mutual_information()
index() (woodwork.table_accessor.WoodworkTableAccessor (wood-
property), 56 work.table_accessor.WoodworkTableAccessor
index() (woodwork.table_schema.TableSchema prop- method), 58
erty), 70 mutual_information_dict() (wood-
infer_logical_type() (woodwork.table_accessor.WoodworkTableAccessor
work.type_sys.type_system.TypeSystem method), 58
method), 92
N
init() (woodwork.column_accessor.WoodworkColumnAccessor
method), 65 NaturalLanguage (class in wood-
init() (woodwork.table_accessor.WoodworkTableAccessor work.logical_types), 85
method), 56
init_series() (in module wood- O
work.accessor_utils), 95 Ordinal (class in woodwork.logical_types), 86
Integer (class in woodwork.logical_types), 83
IPAddress (class in woodwork.logical_types), 84 P
is_boolean() (wood- PersonFullName (class in woodwork.logical_types),
work.table_schema.ColumnSchema property), 86
73 PhoneNumber (class in woodwork.logical_types), 87
is_categorical() (wood- physical_types() (wood-
work.table_schema.ColumnSchema property), work.table_accessor.WoodworkTableAccessor
73 property), 59
is_datetime() (wood- pop() (woodwork.table_accessor.WoodworkTableAccessor
work.table_schema.ColumnSchema property), method), 59
73 PostalCode (class in woodwork.logical_types), 88
is_numeric() (wood-
work.table_schema.ColumnSchema property), R
73 read_file() (in module woodwork.utils), 94
114 Index
read_table_typing_information() (in mod- method), 67

ule woodwork.deserialize), 75 set_semantic_tags() (wood-
read_woodwork_table() (in module woodwork.column_accessor.WoodworkColumnAccessor
work.deserialize), 75 method), 67
remove_semantic_tags() (wood- set_time_index() (wood-
work.column_accessor.WoodworkColumnAccessor work.table_accessor.WoodworkTableAccessor
remove_semantic_tags() (wood- set_time_index() (wood-
work.table_accessor.WoodworkTableAccessor work.table_schema.TableSchema method),
method), 59 71
remove_semantic_tags() (wood- set_types() (wood-
work.table_schema.TableSchema method), work.table_accessor.WoodworkTableAccessor
70 method), 61
remove_type() (wood- set_types() (woodwork.table_schema.TableSchema
work.type_sys.type_system.TypeSystem method), 71
method), 92 SubRegionCode (class in woodwork.logical_types),
rename() (woodwork.table_accessor.WoodworkTableAccessor 88
method), 59
rename() (woodwork.table_schema.TableSchema T
method), 70 TableSchema (class in woodwork.table_schema), 68
reset_defaults() (wood- time_index() (wood-
work.type_sys.type_system.TypeSystem work.table_accessor.WoodworkTableAccessor
method), 92 property), 61
reset_semantic_tags() (wood- time_index() (wood-
work.column_accessor.WoodworkColumnAccessor work.table_schema.TableSchema property),
method), 66 71
reset_semantic_tags() (wood- Timedelta (class in woodwork.logical_types), 89
work.table_accessor.WoodworkTableAccessor to_dictionary() (wood-
method), 60 work.table_accessor.WoodworkTableAccessor
reset_semantic_tags() (wood- method), 62
work.table_schema.TableSchema method), to_disk() (woodwork.table_accessor.WoodworkTableAccessor
70 method), 62
types() (woodwork.table_accessor.WoodworkTableAccessor
S property), 62
schema() (woodwork.table_accessor.WoodworkTableAccessor
types() (woodwork.table_schema.TableSchema prop-
property), 60 erty), 72
select() (woodwork.table_accessor.WoodworkTableAccessor
TypeSystem (class in wood-
method), 60 work.type_sys.type_system), 91
semantic_tags() (wood- typing_info_to_dict() (in module wood-
work.column_accessor.WoodworkColumnAccessor work.serialize), 74
property), 66
semantic_tags() (wood- U
work.table_accessor.WoodworkTableAccessor update_inference_function() (wood-
property), 60 work.type_sys.type_system.TypeSystem
semantic_tags() (wood- method), 92
work.table_schema.TableSchema property), update_relationship() (wood-
71 work.type_sys.type_system.TypeSystem
set_index() (wood- method), 93
work.table_accessor.WoodworkTableAccessor URL (class in woodwork.logical_types), 90
method), 61 use_standard_tags() (wood-
set_index() (woodwork.table_schema.TableSchema work.column_accessor.WoodworkColumnAccessor
method), 71 property), 67
set_logical_type() (wood- use_standard_tags() (wood-
work.column_accessor.WoodworkColumnAccessor work.table_accessor.WoodworkTableAccessor
Index 115
property), 62
use_standard_tags() (wood-
work.table_schema.TableSchema property),
72
V
value_counts() (wood-
work.table_accessor.WoodworkTableAccessor
method), 63
W
WoodworkColumnAccessor (class in wood-
work.column_accessor), 64
WoodworkTableAccessor (class in wood-
work.table_accessor), 53
write_dataframe() (in module wood-
work.serialize), 74
write_typing_info() (in module wood-
work.serialize), 74
write_woodwork_table() (in module wood-
work.serialize), 74
116 Index

Woodwork Alteryx Com en v0.3.1

Uploaded by

Copyright:

Available Formats

Woodwork Alteryx Com en v0.3.1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Woodwork Alteryx Com en v0.3.1

Uploaded by

Copyright:

Available Formats

Woodwork Documentation

May 12, 2021

WOODWORK IS A LIBRARY THAT HELPS WITH DATA TYPING OF

1.1 Quick Start

[1]: import woodwork as ww

[2]: filtered_df = df.ww.select(include=['numeric', 'Boolean'])

(continued from previous page)

1.1.1 Table of contents

To install Woodwork from PyPI, run the following command:

python -m pip install woodwork

python -m pip install "woodwork[complete]"

You can use Woodwork with Dask DataFrames by running:

python -m pip install "woodwork[dask]"

You can use Woodwork with Koalas DataFrames by running:

python -m pip install "woodwork[koalas]"

To install Woodwork from conda run the following command:

conda install -c conda-forge woodwork

git clone https://github.com/alteryx/woodwork.git

Dependency Min Version Notes

To make contributions to the codebase, please follow the guidelines here.

Types and Tags

[1]: import pandas as pd

1.1. Quick Start 5

(continued from previous page)

order_date unit_price customer_name country total \

Initializing Woodwork on a DataFrame

[2]: import woodwork as ww

df.ww.init(name="retail", make_index=True, index="order_product_id")

Instead of calling pd.DataFrame({'id':[1, 2, 3]}).ww.init(), first store the DataFrame in a new

[3]: head_df = df.ww.head(5)

quantity order_date unit_price customer_name country \

1.1. Quick Start 7

Updating Logical Types

(continued from previous page)

[401604 rows x 4 columns]

Adding Semantic Tags

[8]: df.ww.set_types(semantic_tags={'description':'product_details', 'total': 'currency'})

[9]: category_df = df.ww.select('category')

[10]: category_numeric_df = df.ww.select(['numeric', 'category'])

1.1. Quick Start 9

[11]: mixed_df = df.ww.select(['Boolean', 'product_details'])

[12]: total = df.ww['total']

Select multiple columns by supplying a list of column names.

[14]: multiple_cols_df = df.ww[['product_id', 'total', 'unit_price']]

Removing Semantic Tags

(continued from previous page)

Set Index and Time Index

1.1. Quick Start 11

(continued from previous page)

Using Woodwork with a Series

List Logical Types

1.1. Quick Start 13

(continued from previous page)

standard_tags is_default_type is_registered parent_type

The guides below provide more detail on the functionality of Woodwork.

Understanding Types and Tags

Definitions of Types and Tags

1.1. Quick Start 15