Skip to content

Commit 1200dbf

Browse files
feat: Add complex type support (Map, JSON, Struct) with schema validation (#5974)
* feat: Fix Map/Dict support and implement schema validation Signed-off-by: ntkathole <[email protected]> * feat: Modified default example with different data types Signed-off-by: ntkathole <[email protected]> --------- Signed-off-by: ntkathole <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]>
1 parent 7ab7642 commit 1200dbf

40 files changed

+2362
-172
lines changed

.pre-commit-config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ repos:
1414
stages: [commit]
1515
language: system
1616
types: [python]
17+
exclude: '_pb2\.py$'
1718
entry: bash -c 'uv run ruff check --fix "$@" && uv run ruff format "$@"' --
1819
pass_filenames: true
1920

@@ -24,6 +25,7 @@ repos:
2425
stages: [commit]
2526
language: system
2627
types: [python]
28+
exclude: '_pb2\.py$'
2729
entry: bash -c 'uv run ruff check "$@" && uv run ruff format --check "$@"' --
2830
pass_filenames: true
2931

docs/getting-started/concepts/feast-types.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,44 @@ To make this possible, Feast itself has a type system for all the types it is ab
55

66
Feast's type system is built on top of [protobuf](https://github.com/protocolbuffers/protobuf). The messages that make up the type system can be found [here](https://github.com/feast-dev/feast/blob/master/protos/feast/types/Value.proto), and the corresponding python classes that wrap them can be found [here](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/types.py).
77

8-
Feast supports primitive data types (numerical values, strings, bytes, booleans and timestamps). The only complex data type Feast supports is Arrays, and arrays cannot contain other arrays.
8+
Feast supports the following categories of data types:
9+
10+
- **Primitive types**: numerical values (`Int32`, `Int64`, `Float32`, `Float64`), `String`, `Bytes`, `Bool`, and `UnixTimestamp`.
11+
- **Array types**: ordered lists of any primitive type, e.g. `Array(Int64)`, `Array(String)`.
12+
- **Set types**: unordered collections of unique values for any primitive type, e.g. `Set(String)`, `Set(Int64)`.
13+
- **Map types**: dictionary-like structures with string keys and values that can be any supported Feast type (including nested maps), e.g. `Map`, `Array(Map)`.
14+
- **JSON type**: opaque JSON data stored as a string at the proto level but semantically distinct from `String` — backends use native JSON types (`jsonb`, `VARIANT`, etc.), e.g. `Json`, `Array(Json)`.
15+
- **Struct type**: schema-aware structured type with named, typed fields. Unlike `Map` (which is schema-free), a `Struct` declares its field names and their types, enabling schema validation, e.g. `Struct({"name": String, "age": Int32})`.
16+
17+
For a complete reference with examples, see [Type System](../../reference/type-system.md).
918

1019
Each feature or schema field in Feast is associated with a data type, which is stored in Feast's [registry](registry.md). These types are also used to ensure that Feast operates on values correctly (e.g. making sure that timestamp columns used for [point-in-time correct joins](point-in-time-joins.md) actually have the timestamp type).
1120

12-
As a result, each system that feast interacts with needs a way to translate data types from the native platform, into a feast type. E.g., Snowflake SQL types are converted to Feast types [here](https://rtd.feast.dev/en/master/feast.html#feast.type_map.snowflake_python_type_to_feast_value_type). The onus is therefore on authors of offline or online store connectors to make sure that this type mapping happens correctly.
21+
As a result, each system that Feast interacts with needs a way to translate data types from the native platform into a Feast type. E.g., Snowflake SQL types are converted to Feast types [here](https://rtd.feast.dev/en/master/feast.html#feast.type_map.snowflake_python_type_to_feast_value_type). The onus is therefore on authors of offline or online store connectors to make sure that this type mapping happens correctly.
22+
23+
### Backend Type Mapping for Complex Types
24+
25+
Map, JSON, and Struct types are supported across all major Feast backends:
26+
27+
| Backend | Native Type | Feast Type |
28+
|---------|-------------|------------|
29+
| PostgreSQL | `jsonb` | `Map`, `Json`, `Struct` |
30+
| PostgreSQL | `jsonb[]` | `Array(Map)` |
31+
| Snowflake | `VARIANT`, `OBJECT` | `Map` |
32+
| Snowflake | `JSON` | `Json` |
33+
| Redshift | `SUPER` | `Map` |
34+
| Redshift | `json` | `Json` |
35+
| BigQuery | `JSON` | `Json` |
36+
| BigQuery | `STRUCT`, `RECORD` | `Struct` |
37+
| Spark | `map<string,string>` | `Map` |
38+
| Spark | `array<map<string,string>>` | `Array(Map)` |
39+
| Spark | `struct<...>` | `Struct` |
40+
| Spark | `array<struct<...>>` | `Array(Struct(...))` |
41+
| MSSQL | `nvarchar(max)` | `Map`, `Json`, `Struct` |
42+
| DynamoDB | Proto bytes | `Map`, `Json`, `Struct` |
43+
| Redis | Proto bytes | `Map`, `Json`, `Struct` |
44+
| Milvus | `VARCHAR` (serialized) | `Map`, `Json`, `Struct` |
45+
46+
**Note**: When the backend native type is ambiguous (e.g., `jsonb` could be `Map`, `Json`, or `Struct`), the **schema-declared Feast type takes precedence**. The backend-to-Feast type mappings above are only used for schema inference when no explicit type is provided.
1347

1448
**Note**: Feast currently does *not* support a null type in its type system.

docs/getting-started/concepts/feature-view.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Feature views consist of:
2424
* (optional, but recommended) a schema specifying one or more [features](feature-view.md#field) (without this, Feast will infer the schema by reading from the data source)
2525
* (optional, but recommended) metadata (for example, description, or other free-form metadata via `tags`)
2626
* (optional) a TTL, which limits how far back Feast will look when generating historical datasets
27+
* (optional) `enable_validation=True`, which enables schema validation during materialization (see [Schema Validation](#schema-validation) below)
2728

2829
Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.
2930

@@ -159,6 +160,43 @@ Feature names must be unique within a [feature view](feature-view.md#feature-vie
159160

160161
Each field can have additional metadata associated with it, specified as key-value [tags](https://rtd.feast.dev/en/master/feast.html#feast.field.Field).
161162

163+
## Schema Validation
164+
165+
Feature views support an optional `enable_validation` parameter that enables schema validation during materialization and historical feature retrieval. When enabled, Feast verifies that:
166+
167+
- All declared feature columns are present in the input data.
168+
- Column data types match the expected Feast types (mismatches are logged as warnings).
169+
170+
This is useful for catching data quality issues early in the pipeline. To enable it:
171+
172+
```python
173+
from feast import FeatureView, Field
174+
from feast.types import Int32, Int64, Float32, Json, Map, String, Struct
175+
176+
validated_fv = FeatureView(
177+
name="validated_features",
178+
entities=[driver],
179+
schema=[
180+
Field(name="trips_today", dtype=Int64),
181+
Field(name="rating", dtype=Float32),
182+
Field(name="preferences", dtype=Map),
183+
Field(name="config", dtype=Json), # opaque JSON data
184+
Field(name="address", dtype=Struct({"street": String, "city": String, "zip": Int32})), # typed struct
185+
],
186+
source=my_source,
187+
enable_validation=True, # enables schema checks
188+
)
189+
```
190+
191+
**JSON vs Map vs Struct**: These three complex types serve different purposes:
192+
- **`Map`**: Schema-free dictionary with string keys. Use when the keys and values are dynamic.
193+
- **`Json`**: Opaque JSON data stored as a string. Backends use native JSON types (`jsonb`, `VARIANT`). Use for configuration blobs or API responses where you don't need field-level typing.
194+
- **`Struct`**: Schema-aware structured type with named, typed fields. Persisted through the registry via Field tags. Use when you know the exact structure and want type safety.
195+
196+
Validation is supported in all compute engines (Local, Spark, and Ray). When a required column is missing, a `ValueError` is raised. Type mismatches are logged as warnings but do not block execution, allowing for safe gradual adoption.
197+
198+
The `enable_validation` parameter is also available on `BatchFeatureView` and `StreamFeatureView`, as well as their respective decorators (`@batch_feature_view` and `@stream_feature_view`).
199+
162200
## \[Alpha] On demand feature views
163201

164202
On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.

docs/how-to-guides/dbt-integration.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,12 @@ Feast automatically maps dbt/warehouse column types to Feast types:
289289
| `TIMESTAMP`, `DATETIME` | `UnixTimestamp` |
290290
| `BYTES`, `BINARY` | `Bytes` |
291291
| `ARRAY<type>` | `Array(type)` |
292+
| `JSON`, `JSONB` | `Map` (or `Json` if declared in schema) |
293+
| `VARIANT`, `OBJECT` | `Map` |
294+
| `SUPER` | `Map` |
295+
| `MAP<string,string>` | `Map` |
296+
| `STRUCT`, `RECORD` | `Struct` (BigQuery) |
297+
| `struct<...>` | `Struct` (Spark) |
292298

293299
Snowflake `NUMBER(precision, scale)` types are handled specially:
294300
- Scale > 0: `Float64`

docs/specs/offline_store_format.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,12 @@ Here's how Feast types map to Pandas types for Feast APIs that take in or return
4949
| DOUBLE\_LIST | `list[float]`|
5050
| FLOAT\_LIST | `list[float]`|
5151
| BOOL\_LIST | `list[bool]`|
52+
| MAP | `dict` (`Dict[str, Any]`)|
53+
| MAP\_LIST | `list[dict]` (`List[Dict[str, Any]]`)|
54+
| JSON | `object` (parsed Python dict/list/str)|
55+
| JSON\_LIST | `list[object]`|
56+
| STRUCT | `dict` (`Dict[str, Any]`)|
57+
| STRUCT\_LIST | `list[dict]` (`List[Dict[str, Any]]`)|
5258

5359
Note that this mapping is non-injective, that is more than one Pandas type may corresponds to one Feast type (but not vice versa). In these cases, when converting Feast values to Pandas, the **first** Pandas type in the table above is used.
5460

@@ -78,6 +84,12 @@ Here's how Feast types map to BigQuery types when using BigQuery for offline sto
7884
| DOUBLE\_LIST | `ARRAY<FLOAT64>`|
7985
| FLOAT\_LIST | `ARRAY<FLOAT64>`|
8086
| BOOL\_LIST | `ARRAY<BOOL>`|
87+
| MAP | `JSON` / `STRUCT` |
88+
| MAP\_LIST | `ARRAY<JSON>` / `ARRAY<STRUCT>` |
89+
| JSON | `JSON` |
90+
| JSON\_LIST | `ARRAY<JSON>` |
91+
| STRUCT | `STRUCT` / `RECORD` |
92+
| STRUCT\_LIST | `ARRAY<STRUCT>` |
8193

8294
Values that are not specified by the table above will cause an error on conversion.
8395

@@ -94,3 +106,23 @@ https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#snowflake-
94106
| INT32 | `INT8 / UINT8 / INT16 / UINT16 / INT32 / UINT32` |
95107
| INT64 | `INT64 / UINT64` |
96108
| DOUBLE | `FLOAT64` |
109+
| MAP | `VARIANT` / `OBJECT` |
110+
| JSON | `JSON` / `VARIANT` |
111+
112+
#### Redshift Types
113+
Here's how Feast types map to Redshift types when using Redshift for offline storage:
114+
115+
| Feast Type | Redshift Type |
116+
|-------------|--|
117+
| Event Timestamp | `TIMESTAMP` / `TIMESTAMPTZ` |
118+
| BYTES | `VARBYTE` |
119+
| STRING | `VARCHAR` |
120+
| INT32 | `INT4` / `SMALLINT` |
121+
| INT64 | `INT8` / `BIGINT` |
122+
| DOUBLE | `FLOAT8` / `DOUBLE PRECISION` |
123+
| FLOAT | `FLOAT4` / `REAL` |
124+
| BOOL | `BOOL` |
125+
| MAP | `SUPER` |
126+
| JSON | `json` / `SUPER` |
127+
128+
Note: Redshift's `SUPER` type stores semi-structured JSON data. During materialization, Feast automatically handles `SUPER` columns that are exported as JSON strings by parsing them back into Python dictionaries before converting to `MAP` proto values.

protos/feast/core/FeatureView.proto

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ message FeatureView {
3636
FeatureViewMeta meta = 2;
3737
}
3838

39-
// Next available id: 17
39+
// Next available id: 18
4040
// TODO(adchia): refactor common fields from this and ODFV into separate metadata proto
4141
message FeatureViewSpec {
4242
// Name of the feature view. Must be unique. Not updated.
@@ -89,6 +89,9 @@ message FeatureViewSpec {
8989

9090
// The transformation mode (e.g., "python", "pandas", "spark", "sql", "ray")
9191
string mode = 16;
92+
93+
// Whether schema validation is enabled during materialization
94+
bool enable_validation = 17;
9295
}
9396

9497
message FeatureViewMeta {

protos/feast/core/StreamFeatureView.proto

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ message StreamFeatureView {
3737
FeatureViewMeta meta = 2;
3838
}
3939

40-
// Next available id: 20
40+
// Next available id: 21
4141
message StreamFeatureViewSpec {
4242
// Name of the feature view. Must be unique. Not updated.
4343
string name = 1;
@@ -99,5 +99,8 @@ message StreamFeatureViewSpec {
9999
// Hop size for tiling (e.g., 5 minutes). Determines the granularity of pre-aggregated tiles.
100100
// If not specified, defaults to 5 minutes. Only used when enable_tiling is true.
101101
google.protobuf.Duration tiling_hop_size = 19;
102+
103+
// Whether schema validation is enabled during materialization
104+
bool enable_validation = 20;
102105
}
103106

protos/feast/types/Value.proto

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@ message ValueType {
5353
FLOAT_SET = 27;
5454
BOOL_SET = 28;
5555
UNIX_TIMESTAMP_SET = 29;
56+
JSON = 32;
57+
JSON_LIST = 33;
58+
STRUCT = 34;
59+
STRUCT_LIST = 35;
5660
}
5761
}
5862

@@ -88,6 +92,10 @@ message Value {
8892
FloatSet float_set_val = 27;
8993
BoolSet bool_set_val = 28;
9094
Int64Set unix_timestamp_set_val = 29;
95+
string json_val = 32;
96+
StringList json_list_val = 33;
97+
Map struct_val = 34;
98+
MapList struct_list_val = 35;
9199
}
92100
}
93101

sdk/python/feast/batch_feature_view.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ def __init__(
9797
feature_transformation: Optional[Transformation] = None,
9898
batch_engine: Optional[Dict[str, Any]] = None,
9999
aggregations: Optional[List[Aggregation]] = None,
100+
enable_validation: bool = False,
100101
):
101102
if not flags_helper.is_test():
102103
warnings.warn(
@@ -136,6 +137,7 @@ def __init__(
136137
source=source, # type: ignore[arg-type]
137138
sink_source=sink_source,
138139
mode=mode,
140+
enable_validation=enable_validation,
139141
)
140142

141143
def get_feature_transformation(self) -> Optional[Transformation]:
@@ -169,6 +171,7 @@ def batch_feature_view(
169171
description: str = "",
170172
owner: str = "",
171173
schema: Optional[List[Field]] = None,
174+
enable_validation: bool = False,
172175
):
173176
"""
174177
Creates a BatchFeatureView object with the given user-defined function (UDF) as the transformation.
@@ -199,6 +202,7 @@ def decorator(user_function):
199202
schema=schema,
200203
udf=user_function,
201204
udf_string=udf_string,
205+
enable_validation=enable_validation,
202206
)
203207
functools.update_wrapper(wrapper=batch_feature_view_obj, wrapped=user_function)
204208
return batch_feature_view_obj

sdk/python/feast/driver_test_data.py

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,38 @@ def create_driver_hourly_stats_df(drivers, start_date, end_date) -> pd.DataFrame
136136
df_all_drivers["conv_rate"] = np.random.random(size=rows).astype(np.float32)
137137
df_all_drivers["acc_rate"] = np.random.random(size=rows).astype(np.float32)
138138
df_all_drivers["avg_daily_trips"] = np.random.randint(0, 1000, size=rows).astype(
139-
np.int32
139+
np.int64
140140
)
141141
df_all_drivers["created"] = pd.to_datetime(pd.Timestamp.now(tz=None).round("ms"))
142142

143+
# Complex type columns for Map, Json, and Struct examples
144+
import json as _json
145+
146+
df_all_drivers["driver_metadata"] = [
147+
{
148+
"vehicle_type": np.random.choice(["sedan", "suv", "truck"]),
149+
"rating": str(round(np.random.uniform(3.0, 5.0), 1)),
150+
}
151+
for _ in range(len(df_all_drivers))
152+
]
153+
df_all_drivers["driver_config"] = [
154+
_json.dumps(
155+
{
156+
"max_distance_km": int(np.random.randint(10, 200)),
157+
"preferred_zones": list(
158+
np.random.choice(
159+
["north", "south", "east", "west"], size=2, replace=False
160+
)
161+
),
162+
}
163+
)
164+
for _ in range(len(df_all_drivers))
165+
]
166+
df_all_drivers["driver_profile"] = [
167+
{"name": f"driver_{driver_id}", "age": str(int(np.random.randint(25, 60)))}
168+
for driver_id in df_all_drivers["driver_id"]
169+
]
170+
143171
# Create duplicate rows that should be filtered by created timestamp
144172
# TODO: These duplicate rows area indirectly being filtered out by the point in time join already. We need to
145173
# inject a bad row at a timestamp where we know it will get joined to the entity dataframe, and then test that

0 commit comments

Comments
 (0)