Skip to content

#863 cat features support for onnx-exporter#2999

Open
SergeevVladislav wants to merge 1 commit intocatboost:masterfrom
SergeevVladislav:onnx-category-features
Open

#863 cat features support for onnx-exporter#2999
SergeevVladislav wants to merge 1 commit intocatboost:masterfrom
SergeevVladislav:onnx-category-features

Conversation

@SergeevVladislav
Copy link

@SergeevVladislav SergeevVladislav commented Jan 9, 2026

splitFlatFeatureIdx = split.OneHotFeature.CatFeatureIdx;
nodeMode = TModeNode::BRANCH_EQ;
// In ONNX, categorical values are represented as integers
splitValue = static_cast<float>(split.OneHotFeature.Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split.OneHotFeature.Value is a hashed 32-bit value, see its assignment here, CatBoost converts input categorical values represented as strings (even integers are converted to strings for this) to hashed 32-bit values using this function.

So you cannot just convert it to float.

I think the proper way for ONNX export is to do this in a manner similar to how it is done for PMML export:

  • enumerate all possible values,
  • map from source string values to these enumerated ids using LabelEncoder, except the enumerated ids would be float for ONNX as ONNX tree operators only accept float tensors as input.
  • concatenate numerical features and converted categorical features tensors.

Only after that you can use this BRANCH_EQ comparison on these converted values.

PMML export code for one-hot features:

splitValue = split.FloatFeature.Split;
} else if (split.Type == ESplitType::OneHotFeature) {
// For categorical features, we use one-hot encoding with equality comparison
splitFlatFeatureIdx = split.OneHotFeature.CatFeatureIdx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatFeatureIdx is an index among categorical features only.
And ONNX tree operators use a single index for all features (in CatBoost we usually call such indices flat).

If you concatenate all numerical and categorical features (with numerical features first) as I suggested below, then the flat index for a categorical feature will be countOfAllNumericalFeatures + split.OneHotFeature.CatFeatureIdx

// For categorical features, missing values typically don't track true
missingValueTracksTrue = 0;
} else {
CB_ENSURE_INTERNAL(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an internal error only because the absence of categorical features had been checked in model_exporter.cpp before.
In this PR you remove this check. So now, this is a valid error for a user. For such errors CB_ENSURE macro should be used instead.

// For EQ/NEQ modes, we treat them as categorical (one-hot) feature splits
split.Type = ESplitType::OneHotFeature;
split.OneHotFeature.CatFeatureIdx = treesAttributes.nodes_featureids->ints(idx);
split.OneHotFeature.Value = static_cast<int>(treesAttributes.nodes_values->floats(idx));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split.OneHotFeature.Value is a hashed 32-bit value, see its assignment here, CatBoost converts input categorical values represented as strings (even integers are converted to strings for this) to hashed 32-bit values using this function.

If you want to consider float tensor input as categorical features you will need to:

  • convert float values to integers first (and check that they are indeed represent integers without the fractional part)
  • convert these integers to TString using ToString(intValue) function call.
  • assign split.OneHotFeature.Value to the hashed value computed from the string representation using CalcCatFeatureHash(stringValue)

} else if (nodeMode == TModeNode::BRANCH_EQ || nodeMode == TModeNode::BRANCH_NEQ) {
// For EQ/NEQ modes, we treat them as categorical (one-hot) feature splits
split.Type = ESplitType::OneHotFeature;
split.OneHotFeature.CatFeatureIdx = treesAttributes.nodes_featureids->ints(idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes_featureids->ints will contain an index in all features (both numerical and categorical, such indices are called flat in CatBoost) but split.OneHotFeature.CatFeatureIdx must contain an index only among categorical features.

Some special logic is needed to distinguish what features can be considered numerical and what can be considered categorical (I am not sure what it should be, perhaps check what features participate in equality comparisons (BRANCH_EQ, BRANCH_NEQ) with integer values only and consider them categorical and features that are compared only using BRANCH_GTE, BRANCH_GT, BRANCH_LT, BRANCH_LTE should be considered numerical) and then a special mapping from a flat index to a per-type index has to be created.
Something similar to what is done in CoreML import with InputIndexToPerTypeIndex, see here

(*floatFeatures)[split.FloatFeature.FloatFeature].NanValueTreatment =
ENanValueTreatment::AsTrue;
}
floatFeatureBorders[split.FloatFeature.FloatFeature].insert(split.FloatFeature.Split);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not add copy-paste. Leave the logic with swapping and single code for numeric features as is and only add new logic for BRANCH_EQ / BRANCH_NEQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants