Skip to content

Latest commit

 

History

History
128 lines (106 loc) · 6.51 KB

File metadata and controls

128 lines (106 loc) · 6.51 KB

Feast Offline Store Format

Overview

This document describes feature data storage format for offline retrieval in Feast.

One of the design goals of Feast is being able to plug seamlessly into existing infrastructure, and avoiding adding operational overhead to your ML stack. So instead of being yet another database, Feast relies on existing data storage facilities to store offline feature data.

Feast provides first class support for the following data warehouses (DWH) to store feature data offline out of the box:

The integration between Feast and the DWH is highly configurable, but at the same time there are some non-configurable implications and assumptions that Feast imposes on table schemas and mapping between database-native types and Feast type system. This is what this document is about.

Terminology

For brevity, below we'll use just "DWH" for the data warehouse that is used as an offline storage engine for feature data.

For common Feast terms, like "Feature Table", "Entity" please refer to Feast glossary.

Table schema

Feature data is stored in tables in the DWH. There is one DWH table per Feast Feature Table. Each table in DWH is expected to have three groups of columns:

  • One or more Entity columns. Together they compose an Entity Key. Their types should match Entity type definitions in Feast metadata, according to the mapping for the specific DWH engine being used. The name of the column must match the entity name.
  • One Entity timestamp column, also called "event timestamp". The type is DWH-specific timestamp type. The name of the column is set when you configure the offline data source.
  • Optional "created timestamp" column. This is typically wallclock time of when the feature value was computed. If there are two feature values with the same Entity Key and Event Timestamp, the one with more recent Created Timestamp will take precedence. The type is DWH-specific timestamp type. The name of the column is set when you configure the offline data source.
  • One or more feature value columns. Their types should match Feature type defined in Feast metadata, according to the mapping for the specific DWH engine being used. The names must match feature names, but can optionally be remapped when configuring the offline data source.

Type mappings

Pandas types

Here's how Feast types map to Pandas types for Feast APIs that take in or return a Pandas dataframe:

Feast Type Pandas Type
Event Timestamp datetime64[ns]
BYTES bytes
STRING str , category
INT32 int16, uint16, int32, uint32
INT64 int64, uint64
UNIX_TIMESTAMP datetime64[ns], datetime64[ns, tz]
DOUBLE float64
FLOAT float32
BOOL bool
BYTES_LIST list[bytes]
STRING_LIST list[str]
INT32_LIST list[int]
INT64_LIST list[int]
UNIX_TIMESTAMP_LIST list[unix_timestamp]
DOUBLE_LIST list[float]
FLOAT_LIST list[float]
BOOL_LIST list[bool]
MAP dict (Dict[str, Any])
MAP_LIST list[dict] (List[Dict[str, Any]])
JSON object (parsed Python dict/list/str)
JSON_LIST list[object]
STRUCT dict (Dict[str, Any])
STRUCT_LIST list[dict] (List[Dict[str, Any]])

Note that this mapping is non-injective, that is more than one Pandas type may corresponds to one Feast type (but not vice versa). In these cases, when converting Feast values to Pandas, the first Pandas type in the table above is used.

Feast array types are mapped to a pandas column with object dtype, that contains a Python array of corresponding type.

Another thing to note is Feast doesn't support timestamp type for entity and feature columns. Values of datetime type in pandas dataframe are converted to int64 if they are found in entity and feature columns. In order to easily differentiate int64 to timestamp features, there is a UNIX_TIMESTAMP type that is an int64 under the hood.

BigQuery types

Here's how Feast types map to BigQuery types when using BigQuery for offline storage when reading data from BigQuery to the online store:

Feast Type BigQuery Type
Event Timestamp DATETIME
BYTES BYTES
STRING STRING
INT32 INT64 / INTEGER
INT64 INT64 / INTEGER
UNIX_TIMESTAMP INT64 / INTEGER
DOUBLE FLOAT64 / FLOAT
FLOAT FLOAT64 / FLOAT
BOOL BOOL
BYTES_LIST ARRAY<BYTES>
STRING_LIST ARRAY<STRING>
INT32_LIST ARRAY<INT64>
INT64_LIST ARRAY<INT64>
UNIX_TIMESTAMP_LIST ARRAY<INT64>
DOUBLE_LIST ARRAY<FLOAT64>
FLOAT_LIST ARRAY<FLOAT64>
BOOL_LIST ARRAY<BOOL>
MAP JSON / STRUCT
MAP_LIST ARRAY<JSON> / ARRAY<STRUCT>
JSON JSON
JSON_LIST ARRAY<JSON>
STRUCT STRUCT / RECORD
STRUCT_LIST ARRAY<STRUCT>

Values that are not specified by the table above will cause an error on conversion.

Snowflake Types

Here's how Feast types map to Snowflake types when using Snowflake for offline storage See source here: https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#snowflake-to-pandas-data-mapping

Feast Type Snowflake Python Type
Event Timestamp DATETIME64[NS]
UNIX_TIMESTAMP DATETIME64[NS]
STRING STR
INT32 INT8 / UINT8 / INT16 / UINT16 / INT32 / UINT32
INT64 INT64 / UINT64
DOUBLE FLOAT64
MAP VARIANT / OBJECT
JSON JSON / VARIANT

Redshift Types

Here's how Feast types map to Redshift types when using Redshift for offline storage:

Feast Type Redshift Type
Event Timestamp TIMESTAMP / TIMESTAMPTZ
BYTES VARBYTE
STRING VARCHAR
INT32 INT4 / SMALLINT
INT64 INT8 / BIGINT
DOUBLE FLOAT8 / DOUBLE PRECISION
FLOAT FLOAT4 / REAL
BOOL BOOL
MAP SUPER
JSON json / SUPER

Note: Redshift's SUPER type stores semi-structured JSON data. During materialization, Feast automatically handles SUPER columns that are exported as JSON strings by parsing them back into Python dictionaries before converting to MAP proto values.