Encoding

Download this notebook

Many ML models struggle with categorical input data, because they expect numeric input data. This notebook will present three methods to transform categorical into numeric data through a process called “encoding”: label encoding, ordinal encoding and one-hot encoding.

Warning: You should probably encode the full data set before doing a train-test-split. Otherwise, you may accidentally assign two different encoding schemes to the training and test data which will then, of course, cause trouble for the ML model.

import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
np.random.seed(42)
# Create some data
N = 1000

Wine = pd.DataFrame({"Age": np.random.randint(2,20,N),
                    "Alc%": np.round(np.random.uniform(11,13,N),1),
                    "Geography": np.random.choice( ["Italy", "France", "Germany"], N),
                   "Type": np.random.choice(["Red", "Rose", "White"],N)})


Target = np.log( Wine["Age"] + Wine["Alc%"] + 5.*(Wine["Geography"] == "France")  - 4.*(Wine["Type"] =="Rose") )
Target += np.random.uniform( - np.std(Target), np.std(Target))
Wine["Target"] = Target
Wine["HighQuality"] = 0
Wine.loc[ Wine["Target"] > np.mean(Target) + 0.25* np.std(Target) , "HighQuality"] = 1

del Wine["Target"]


#Categoricals = Wine.drop(["HighQuality"], axis = 1).columns[Wine.drop(["HighQuality"], axis = 1).dtypes==object]
Categoricals = Wine.columns[Wine.dtypes==object]

print("Categorical features",Categoricals)

Wine.head()
Categorical features Index(['Geography', 'Type'], dtype='object')
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

X_train, X_test, y_train, y_test = train_test_split(Wine.drop(["HighQuality"],axis =1), Wine["HighQuality"],
                                                    test_size=0.33, random_state=1)
print(X_train.head())
print(X_test.head())
     Age  Alc% Geography  Type
481   19  12.7   Germany  Rose
571   14  12.7   Germany  Rose
882   15  12.3    France   Red
294   17  11.9     Italy   Red
619   12  12.1    France  Rose
     Age  Alc% Geography   Type
507    2  11.1    France  White
818   16  11.0     Italy  White
452    7  11.3     Italy    Red
368   19  11.3     Italy    Red
242    7  12.2    France   Rose

So there are now two categorical features: the type and the geographic origin of the wine. How do we deal with that?

Model

def get_model(n):
    Model = Sequential()
    Model.add(Dense(10, input_dim=n, activation='relu', kernel_initializer='he_normal'))
    Model.add(Dense(1, activation='sigmoid'))
    return(Model)

Label Encoding

A simple solution is to count each different possible label for each feature and replace it with the integers 0, 1, … k

Wine_le = Wine.copy()

for column in Categoricals:
    Wine_le[column] = LabelEncoder().fit_transform(Wine_le[column])
Wine_le.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Wine.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

X_train, X_test, y_train, y_test = train_test_split(Wine_le.drop(["HighQuality"],axis =1), Wine_le["HighQuality"],
                                                    test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Model_le = get_model(n=X_train.shape[1])
Model_le.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_le.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_le.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.8909090909090909

Ordinal Encoding

In our ordinal encoding, the geographies have now been encoded as follows: France: 0, Germany: 1, Italy: 2 and the wine types as follows: Red: 0, Rose: 1, White: 2. Some models will now take this as “Italy is 2 distance units away from France, but only 1 from Germany”. Sometimes, such a mathematical interpretation can make sense, if there is an inherent order of the categories. Then, we should use Ordinal Encoding.

E.g. let’s suppose that we want to order the wine types by redness (White: 0, Rose: 1, Red: 2) and the geographies on the north-south-axis (Germany: 0, France: 1, Italy: 2). Below is a short example how to use ordinal encoding with a known order:

enc = OrdinalEncoder(categories=[['first','second','third','forth']])
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
[[1.]
 [0.]
 [2.]
 [3.]]
Wine_oe = Wine.copy()

enc = OrdinalEncoder(categories=[["Germany", "France", "Italy"], ["White", "Rose", "Red"]])
Wine_oe[["Geography", "Type"]] =  enc.fit_transform(Wine.loc[:,["Geography", "Type"]])
 
Wine_oe.head(), Wine.head()
(   Age  Alc%  Geography  Type  HighQuality
 0    8  12.3        1.0   0.0            1
 1   16  11.5        0.0   0.0            1
 2   12  11.3        2.0   2.0            1
 3    9  12.7        1.0   2.0            1
 4    8  13.0        0.0   0.0            0,
    Age  Alc% Geography   Type  HighQuality
 0    8  12.3    France  White            1
 1   16  11.5   Germany  White            1
 2   12  11.3     Italy    Red            1
 3    9  12.7    France    Red            1
 4    8  13.0   Germany  White            0)
X_train, X_test, y_train, y_test = train_test_split(Wine_oe.drop(["HighQuality"],axis =1), Wine_oe["HighQuality"],
                                                    test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Model_oe = get_model(n=X_train.shape[1])
Model_oe.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_oe.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_oe.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.8909090909090909

One-Hot Encoding

Another alternative is to create many binary dummy variables that encode, if one feature has a certain value or not. So if you have a categorical feature $X_i$ with k possible values $c_1,…,c_k $, you will create k dummy variables $X_i==c_1, … , X_i == c_k$. Therefore, you will have none of the issues about the model thinking that there is a meaning in whether the distance between two encoded variables is larger or smaller. Thus, One-Hot Encoding is usually the most accurate way to represent categorical data, but comes at the cost of creating (potentially very many) new dummy features.

This can be done via OneHotEncoder from sklearn, but it’s probably more convenient to use the pandas-intrinsic method pd.get_dummies().

Wine_ohe = Wine.copy()

Wine_ohe = pd.concat([Wine_ohe,pd.get_dummies(Wine_ohe[Categoricals])], axis = 1)
Wine_ohe.drop(Categoricals,axis = 1, inplace = True)
Wine_ohe
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

X_train, X_test, y_train, y_test = train_test_split(Wine_ohe.drop(["HighQuality"],axis =1), Wine_ohe["HighQuality"],
                                                    test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Model_ohe = get_model(n=X_train.shape[1])
Model_ohe.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_ohe.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_ohe.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.9545454545454546

The ordinal encoding can decrease the model’s accuracy, if the chosen order is actually not well-suited to represent the nature of the data. Often, the one-hot encoding yields the most accurate model, but it can inflate the size of the data and therefore the model training time.