Handle Unknown Categories Using OneHotEncoder

Priyanka Dave
Nov 7, 2020

How will you deal with unknown categories which were not part of your training set?

Answer: set handle_unknown=’ignore’ in OneHotEncoder

Example:

Let’s consider below as training data set:

vw_train

Here ‘Model’ is a categorical variable which we want to encode using OneHotEncoder.

Code snippet:

enc = OneHotEncoder(handle_unknown=’ignore’,sparse=False)

enc_fit = enc.fit(vw_train[[‘Model’]])

enc_fit.transform(vw_train[[‘Model’]])

Output:

vw_train_transformed

Now if in testing set you found new categories then above function will automatically handle it and encode it with all 0’s.

Ultimately it is assigning new category, let’s say ‘other’ to all the unknown categories as all of them will get same encoding.

Let’s consider below as testing data set:

vw_test

Encode testing data set.

Code snippet:

enc_fit.transform(vw_test[[‘Model’]])

Output:

vw_test_transformed

It will encode all the unknown categories in same way. That means it is introducing new category from unknown categories.

Now if we will change handle_unknown to ‘error’, then it will give an error when found unknown category.

Code snippet:

enc = OneHotEncoder(handle_unknown=’error’,sparse=False)

enc_fit = enc.fit(vw_train[[‘Model’]])

enc_fit.transform(vw_test[[‘Model’]])

Output:

Error

Conclusion:

  • When there is a requirement to handle unknown categories on frequent basis, then this is a good option to implement. Later on you can add unknown categories to training set and re-train your model.
  • Another option is to set handle_unknown=‘error’ and don’t make prediction at all when found unknown categories.

You can download full source code from my GitHub Repository.

--

--