This documentation describe spec file used by Feast. There are 4 kind of specs:
- Entity Spec, which describe an entity definition.
- Feature Spec, which describe feature definition.
- Import Spec, which describe how to ingest/populate data for one/more features.
- Storage Spec, which describe storage either for warehouse or serving.
An entity is a type with an associated key which generally maps onto a known domain object, e.g. driver, customer, area, and merchant. An entity determines how a feature may be retrieved. e.g. for a driver entity all driver features must be looked up with an associated driver id entity key.
Entity is described by entity spec. Following is an example of an entity spec
name: driver
description: GO-JEK’s 2 wheeler driver
tags:
- two wheeler
- gojek
| Name | Type | Convention | Description |
|---|---|---|---|
name |
string | Lower snake case (e.g driver or driver_area) | Entity name MUST be unique |
description |
string | N.A. | Description of the entity |
tags |
List of string | N.A. | Free form grouping |
A Feature is an individual measurable property or characteristic of an Entity. A feature has many to one relationship with an entity by referencing the entity's name from its feature spec. In the context of Feast, a Feature has the following attributes:
- Entity - it must be associated with a known Entity to Feast (see Entity Spec)
- ValueType - the feature type must be defined, e.g. String, Bytes, Int64, Int32, Float etc.
Following is en example of feature spec
id: titanic_passenger.alone
name: alone
entity: titanic_passenger
owner: [email protected]
description: binary variable denoting whether the passenger was alone on the titanic.
valueType: INT64
uri: http://jupyter.s.ds.golabs.io/user/your_user/lab/tree/shared/zhiling.c/feast_titanic/titanic.ipynb
| Name | Type | Convention | Description |
|---|---|---|---|
entity |
string | lower snake case (e.g. driver, driver_area) |
Entity related to this feature |
id |
string | lower snake case with format [entity].[featureName] (e.g.: customer.age_prediction) |
feature id is a unique identifier of a feature, it is used both for feature retrieval from serving / warehouse |
name |
string | lower snake case, e.g: age_prediction |
short name of a feature |
owner |
string | use email or any unique identifier | owner is mainly be used to inform user who is responsible of maintaining the feature |
description |
string | Keep the description concise, more information about the feature can be provided as documentation linked by uri |
human readable description of a feature |
valueType |
Enum(one of: BYTES,STRING,INT32,INT64,DOUBLE,FLOAT, BOOL, TIMESTAMP) |
N.A. | Value type of the feature |
group |
string | lower snake case | feature group inherited by this feature. |
tags |
List of string | N.A. | Free form grouping |
options |
Key value string | N.A. | Option is used for extendability of feature spec |
Import spec describe how data is ingested into Feast to populate one or more features. An import spec contains information about:
- The entities to be imported.
- Import source type.
- Import source options, as required by the type.
- Import source schema, if required by the type.
Feast supports ingesting feature from 4 type of sources:
- File (either CSV or JSON)
- Bigquery Table
- Pubsub Topic
- Pubsub Subscription
Following section describes the import spec that has to be created for each of sources:
Example of csv file import spec:
type: file.csv
sourceOptions:
path: gs://my-bucket/customer_total_purchase.csv
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-09-25T00:00:00.000Z
fields:
- name: timestamp
- name: customer_id
- name: total_purchase
featureId: customer_id.total_purchase
Example of json file import spec:
type: file.json
sourceOptions:
path: gs://my-bucket/customer_total_purchase.json
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-09-25T00:00:00.000Z
fields:
- name: timestamp
- name: customer_id
- name: total_purchase
featureId: customer_id.total_purchase
Notes on import spec for file type:
- The
typefield must befile.csvorfile.json options.pathspecify the location of file, it must be accessible by Ingestion Job. (e.g. GCS)entitieslist must only contain one entity.schemamust be specified.schema.entityIdColumnmust be same as one of thefieldsname. It is used to find out which column is used as entity ID.schema.timestampColumncan be specified to inform thetimestampcolumn name. This field should be exclusively used withschema.timestampValue. Ifschema.timestampValueis used instead, all feature will have same timestamp.schema.fieldsis a list of column/fields in the files. You can specify whether the field contains a feature by addingfeatureIdattribute and specify the feature ID.
Import spec for Big Query is almost similar to importing a file. The only difference is that all schema.fields.name has to match column name of the source table.
For example to import following table from BigQuery table gcp-project.source-dataset.source-table to a feature with id customer.last_login:
| customer_id | last_login |
|---|---|
| ... | ... |
| ... | ... |
You will need to specify import spec which looks like this:
type: bigquery
options:
project: gcp-project
dataset: source-dataset
table: source-table
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-10-25T00:00:00.000Z
fields:
- name: customer_id
- name: last_login
featureId: customer.last_login
Notes on import spec for bigquery type:
- The
typefield must bebigquery options.projectspecifies GCP project of the source table.options.datasetspecifies Big Query data set of the source table.options.tablespecifies the source table name.entitieslist must only contain one entity.schema.entityIdColumnmust be same as one of thefieldsname. It is used to find out which column is used as entity ID.schema.timestampColumncan be specified to inform thetimestampcolumn name. This field should be exclusively used withschema.timestampValue. Ifschema.timestampValueis used instead, all feature will have same timestamp.schema.fieldsis a list of column in the table and allschema.fields.namemust match column name of the table. You can specify whether the column contains a feature by addingfeatureIdattribute and specify the feature ID.
Important Notes Be careful when ingesting a partitioned table since by default ingestion job will read the whole table and might cause unpredictable ingestion result. You can specify which partition to ingest by using table decorator, as follow:
type: bigquery
options:
project: gcp-project
dataset: source-dataset
table: source-table$20181101 # yyyyMMdd format, it will only ingest the said partition.
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-10-25T00:00:00.000Z
fields:
- name: customer_id
- name: last_login
featureId: customer.last_login
You can ingest from either PubSub topic or subscription.
For example following import spec will ingest data from PubSub topic feast-test in GCP project my-gcp-project
type: pubsub
options:
topic: projects/my-gcp-project/topics/feast-test
entities:
- customer
schema:
fields:
- featureId: customer.last_login
Or if the source data is a subscription with name feast-test-subscription
type: pubsub
options:
topic: projects/my-gcp-project/subscriptions/feast-test-subscription
entities:
- customer
schema:
fields:
- featureId: customer.last_login
The PubSub source type expects that all messages that the ingestion job receives are binary encoded FeatureRow protobufs. See FeatureRow.proto.
Note that:
- You must be explicit about which features should be ingested using
schema.fields.
There are 2 kinds of storage in feast:
- Serving Storage (BigTable and Redis)
- Warehouse Storage (BigQuery)
Serving storage is intended to support low latency feature retrieval, whereas warehouse storage is intended for training or data exploration. Each of supported storage has slightly different option. Following are the storage spec description for each storage.
BigQuery is used as warehouse storage in Feast.
id: BIGQUERY1
type: bigquery
options:
dataset: "test_feast"
project: "the-big-data-staging-007"
tempLocation: "gs://zl-test-bucket"
BigTable is used as serving storage. It is advisable to use BigTable for feature which doesn't require low latency access or feature which has huge amount of data.
id: BIGTABLE1
type: bigtable
options:
instance: "ds-staging"
project: "the-big-data-staging-007"
Redis is recommended to be used by feature which requires very low latency access. However, it has limited storage size compared to BigTable.
id: REDIS1
type: redis
options:
host: "127.0.0.1"
port: "6379"