#863 cat features support for onnx-exporter#2999
#863 cat features support for onnx-exporter#2999SergeevVladislav wants to merge 1 commit intocatboost:masterfrom
Conversation
| splitFlatFeatureIdx = split.OneHotFeature.CatFeatureIdx; | ||
| nodeMode = TModeNode::BRANCH_EQ; | ||
| // In ONNX, categorical values are represented as integers | ||
| splitValue = static_cast<float>(split.OneHotFeature.Value); |
There was a problem hiding this comment.
split.OneHotFeature.Value is a hashed 32-bit value, see its assignment here, CatBoost converts input categorical values represented as strings (even integers are converted to strings for this) to hashed 32-bit values using this function.
So you cannot just convert it to float.
I think the proper way for ONNX export is to do this in a manner similar to how it is done for PMML export:
- enumerate all possible values,
- map from source string values to these enumerated ids using LabelEncoder, except the enumerated ids would be float for ONNX as ONNX tree operators only accept float tensors as input.
- concatenate numerical features and converted categorical features tensors.
Only after that you can use this BRANCH_EQ comparison on these converted values.
PMML export code for one-hot features:
| splitValue = split.FloatFeature.Split; | ||
| } else if (split.Type == ESplitType::OneHotFeature) { | ||
| // For categorical features, we use one-hot encoding with equality comparison | ||
| splitFlatFeatureIdx = split.OneHotFeature.CatFeatureIdx; |
There was a problem hiding this comment.
CatFeatureIdx is an index among categorical features only.
And ONNX tree operators use a single index for all features (in CatBoost we usually call such indices flat).
If you concatenate all numerical and categorical features (with numerical features first) as I suggested below, then the flat index for a categorical feature will be countOfAllNumericalFeatures + split.OneHotFeature.CatFeatureIdx
| // For categorical features, missing values typically don't track true | ||
| missingValueTracksTrue = 0; | ||
| } else { | ||
| CB_ENSURE_INTERNAL( |
There was a problem hiding this comment.
It is an internal error only because the absence of categorical features had been checked in model_exporter.cpp before.
In this PR you remove this check. So now, this is a valid error for a user. For such errors CB_ENSURE macro should be used instead.
| // For EQ/NEQ modes, we treat them as categorical (one-hot) feature splits | ||
| split.Type = ESplitType::OneHotFeature; | ||
| split.OneHotFeature.CatFeatureIdx = treesAttributes.nodes_featureids->ints(idx); | ||
| split.OneHotFeature.Value = static_cast<int>(treesAttributes.nodes_values->floats(idx)); |
There was a problem hiding this comment.
split.OneHotFeature.Value is a hashed 32-bit value, see its assignment here, CatBoost converts input categorical values represented as strings (even integers are converted to strings for this) to hashed 32-bit values using this function.
If you want to consider float tensor input as categorical features you will need to:
- convert float values to integers first (and check that they are indeed represent integers without the fractional part)
- convert these integers to
TStringusingToString(intValue)function call. - assign
split.OneHotFeature.Valueto the hashed value computed from the string representation usingCalcCatFeatureHash(stringValue)
| } else if (nodeMode == TModeNode::BRANCH_EQ || nodeMode == TModeNode::BRANCH_NEQ) { | ||
| // For EQ/NEQ modes, we treat them as categorical (one-hot) feature splits | ||
| split.Type = ESplitType::OneHotFeature; | ||
| split.OneHotFeature.CatFeatureIdx = treesAttributes.nodes_featureids->ints(idx); |
There was a problem hiding this comment.
nodes_featureids->ints will contain an index in all features (both numerical and categorical, such indices are called flat in CatBoost) but split.OneHotFeature.CatFeatureIdx must contain an index only among categorical features.
Some special logic is needed to distinguish what features can be considered numerical and what can be considered categorical (I am not sure what it should be, perhaps check what features participate in equality comparisons (BRANCH_EQ, BRANCH_NEQ) with integer values only and consider them categorical and features that are compared only using BRANCH_GTE, BRANCH_GT, BRANCH_LT, BRANCH_LTE should be considered numerical) and then a special mapping from a flat index to a per-type index has to be created.
Something similar to what is done in CoreML import with InputIndexToPerTypeIndex, see here
| (*floatFeatures)[split.FloatFeature.FloatFeature].NanValueTreatment = | ||
| ENanValueTreatment::AsTrue; | ||
| } | ||
| floatFeatureBorders[split.FloatFeature.FloatFeature].insert(split.FloatFeature.Split); |
There was a problem hiding this comment.
Do not add copy-paste. Leave the logic with swapping and single code for numeric features as is and only add new logic for BRANCH_EQ / BRANCH_NEQ.
#863