Handle Unknown Categories Using OneHotEncoder

Nov 7, 2020

How will you deal with unknown categories which were not part of your training set?

Answer: set handle_unknown=’ignore’ in OneHotEncoder

Let’s consider below as training data set:

Here ‘Model’ is a categorical variable which we want to encode using OneHotEncoder.

enc = OneHotEncoder(handle_unknown=’ignore’,sparse=False)
enc_fit = enc.fit(vw_train[[‘Model’]])
enc_fit.transform(vw_train[[‘Model’]])

Now if in testing set you found new categories then above function will automatically handle it and encode it with all 0’s.

Ultimately it is assigning new category, let’s say ‘other’ to all the unknown categories as all of them will get same encoding.

Let’s consider below as testing data set:

Encode testing data set.

enc_fit.transform(vw_test[[‘Model’]])

It will encode all the unknown categories in same way. That means it is introducing new category from unknown categories.

Now if we will change handle_unknown to ‘error’, then it will give an error when found unknown category.

enc = OneHotEncoder(handle_unknown=’error’,sparse=False)
enc_fit = enc.fit(vw_train[[‘Model’]])
enc_fit.transform(vw_test[[‘Model’]])

When there is a requirement to handle unknown categories on frequent basis, then this is a good option to implement. Later on you can add unknown categories to training set and re-train your model.
Another option is to set handle_unknown=‘error’ and don’t make prediction at all when found unknown categories.

You can download full source code from my GitHub Repository.