This document describes feature data storage format for offline retrieval in Feast.
One of the design goals of Feast is being able to plug seamlessly into existing infrastructure, and avoiding adding operational overhead to your ML stack. So instead of being yet another database, Feast relies on existing data storage facilities to store offline feature data.
Feast provides first class support for the following data warehouses (DWH) to store feature data offline out of the box:
The integration between Feast and the DWH is highly configurable, but at the same time there are some non-configurable implications and assumptions that Feast imposes on table schemas and mapping between database-native types and Feast type system. This is what this document is about.
For brevity, below we'll use just "DWH" for the data warehouse that is used as an offline storage engine for feature data.
For common Feast terms, like "Feature Table", "Entity" please refer to Feast glossary.
Feature data is stored in tables in the DWH. There is one DWH table per Feast Feature Table. Each table in DWH is expected to have three groups of columns:
- One or more Entity columns. Together they compose an Entity Key. Their types should match Entity type definitions in Feast metadata, according to the mapping for the specific DWH engine being used. The name of the column must match the entity name.
- One Entity timestamp column, also called "event timestamp". The type is DWH-specific timestamp type. The name of the column is set when you configure the offline data source.
- Optional "created timestamp" column. This is typically wallclock time of when the feature value was computed. If there are two feature values with the same Entity Key and Event Timestamp, the one with more recent Created Timestamp will take precedence. The type is DWH-specific timestamp type. The name of the column is set when you configure the offline data source.
- One or more feature value columns. Their types should match Feature type defined in Feast metadata, according to the mapping for the specific DWH engine being used. The names must match feature names, but can optionally be remapped when configuring the offline data source.
Here's how Feast types map to Pandas types for Feast APIs that take in or return a Pandas dataframe:
| Feast Type | Pandas Type |
|---|---|
| Event Timestamp | datetime64[ns] |
| BYTES | bytes |
| STRING | str , category |
| INT32 | int16, uint16, int32, uint32 |
| INT64 | int64, uint64 |
| UNIX_TIMESTAMP | datetime64[ns], datetime64[ns, tz] |
| DOUBLE | float64 |
| FLOAT | float32 |
| BOOL | bool |
| BYTES_LIST | list[bytes] |
| STRING_LIST | list[str] |
| INT32_LIST | list[int] |
| INT64_LIST | list[int] |
| UNIX_TIMESTAMP_LIST | list[unix_timestamp] |
| DOUBLE_LIST | list[float] |
| FLOAT_LIST | list[float] |
| BOOL_LIST | list[bool] |
| MAP | dict (Dict[str, Any]) |
| MAP_LIST | list[dict] (List[Dict[str, Any]]) |
| JSON | object (parsed Python dict/list/str) |
| JSON_LIST | list[object] |
| STRUCT | dict (Dict[str, Any]) |
| STRUCT_LIST | list[dict] (List[Dict[str, Any]]) |
Note that this mapping is non-injective, that is more than one Pandas type may corresponds to one Feast type (but not vice versa). In these cases, when converting Feast values to Pandas, the first Pandas type in the table above is used.
Feast array types are mapped to a pandas column with object dtype, that contains a Python array of corresponding type.
Another thing to note is Feast doesn't support timestamp type for entity and feature columns. Values of datetime type in pandas dataframe are converted to int64 if they are found in entity and feature columns. In order to easily differentiate int64 to timestamp features, there is a UNIX_TIMESTAMP type that is an int64 under the hood.
Here's how Feast types map to BigQuery types when using BigQuery for offline storage when reading data from BigQuery to the online store:
| Feast Type | BigQuery Type |
|---|---|
| Event Timestamp | DATETIME |
| BYTES | BYTES |
| STRING | STRING |
| INT32 | INT64 / INTEGER |
| INT64 | INT64 / INTEGER |
| UNIX_TIMESTAMP | INT64 / INTEGER |
| DOUBLE | FLOAT64 / FLOAT |
| FLOAT | FLOAT64 / FLOAT |
| BOOL | BOOL |
| BYTES_LIST | ARRAY<BYTES> |
| STRING_LIST | ARRAY<STRING> |
| INT32_LIST | ARRAY<INT64> |
| INT64_LIST | ARRAY<INT64> |
| UNIX_TIMESTAMP_LIST | ARRAY<INT64> |
| DOUBLE_LIST | ARRAY<FLOAT64> |
| FLOAT_LIST | ARRAY<FLOAT64> |
| BOOL_LIST | ARRAY<BOOL> |
| MAP | JSON / STRUCT |
| MAP_LIST | ARRAY<JSON> / ARRAY<STRUCT> |
| JSON | JSON |
| JSON_LIST | ARRAY<JSON> |
| STRUCT | STRUCT / RECORD |
| STRUCT_LIST | ARRAY<STRUCT> |
Values that are not specified by the table above will cause an error on conversion.
Here's how Feast types map to Snowflake types when using Snowflake for offline storage See source here: https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#snowflake-to-pandas-data-mapping
| Feast Type | Snowflake Python Type |
|---|---|
| Event Timestamp | DATETIME64[NS] |
| UNIX_TIMESTAMP | DATETIME64[NS] |
| STRING | STR |
| INT32 | INT8 / UINT8 / INT16 / UINT16 / INT32 / UINT32 |
| INT64 | INT64 / UINT64 |
| DOUBLE | FLOAT64 |
| MAP | VARIANT / OBJECT |
| JSON | JSON / VARIANT |
Here's how Feast types map to Redshift types when using Redshift for offline storage:
| Feast Type | Redshift Type |
|---|---|
| Event Timestamp | TIMESTAMP / TIMESTAMPTZ |
| BYTES | VARBYTE |
| STRING | VARCHAR |
| INT32 | INT4 / SMALLINT |
| INT64 | INT8 / BIGINT |
| DOUBLE | FLOAT8 / DOUBLE PRECISION |
| FLOAT | FLOAT4 / REAL |
| BOOL | BOOL |
| MAP | SUPER |
| JSON | json / SUPER |
Note: Redshift's SUPER type stores semi-structured JSON data. During materialization, Feast automatically handles SUPER columns that are exported as JSON strings by parsing them back into Python dictionaries before converting to MAP proto values.