Encoding
Many ML models struggle with categorical input data, because they expect numeric input data. This notebook will present three methods to transform categorical into numeric data through a process called “encoding”: label encoding, ordinal encoding and one-hot encoding.
Warning: You should probably encode the full data set before doing a train-test-split. Otherwise, you may accidentally assign two different encoding schemes to the training and test data which will then, of course, cause trouble for the ML model.
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
np.random.seed(42)
# Create some data
N = 1000
Wine = pd.DataFrame({"Age": np.random.randint(2,20,N),
"Alc%": np.round(np.random.uniform(11,13,N),1),
"Geography": np.random.choice( ["Italy", "France", "Germany"], N),
"Type": np.random.choice(["Red", "Rose", "White"],N)})
Target = np.log( Wine["Age"] + Wine["Alc%"] + 5.*(Wine["Geography"] == "France") - 4.*(Wine["Type"] =="Rose") )
Target += np.random.uniform( - np.std(Target), np.std(Target))
Wine["Target"] = Target
Wine["HighQuality"] = 0
Wine.loc[ Wine["Target"] > np.mean(Target) + 0.25* np.std(Target) , "HighQuality"] = 1
del Wine["Target"]
#Categoricals = Wine.drop(["HighQuality"], axis = 1).columns[Wine.drop(["HighQuality"], axis = 1).dtypes==object]
Categoricals = Wine.columns[Wine.dtypes==object]
print("Categorical features",Categoricals)
Wine.head()
Categorical features Index(['Geography', 'Type'], dtype='object')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
X_train, X_test, y_train, y_test = train_test_split(Wine.drop(["HighQuality"],axis =1), Wine["HighQuality"],
test_size=0.33, random_state=1)
print(X_train.head())
print(X_test.head())
Age Alc% Geography Type
481 19 12.7 Germany Rose
571 14 12.7 Germany Rose
882 15 12.3 France Red
294 17 11.9 Italy Red
619 12 12.1 France Rose
Age Alc% Geography Type
507 2 11.1 France White
818 16 11.0 Italy White
452 7 11.3 Italy Red
368 19 11.3 Italy Red
242 7 12.2 France Rose
So there are now two categorical features: the type and the geographic origin of the wine. How do we deal with that?
Model
def get_model(n):
Model = Sequential()
Model.add(Dense(10, input_dim=n, activation='relu', kernel_initializer='he_normal'))
Model.add(Dense(1, activation='sigmoid'))
return(Model)
Label Encoding
A simple solution is to count each different possible label for each feature and replace it with the integers 0, 1, … k
Wine_le = Wine.copy()
for column in Categoricals:
Wine_le[column] = LabelEncoder().fit_transform(Wine_le[column])
Wine_le.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Wine.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
X_train, X_test, y_train, y_test = train_test_split(Wine_le.drop(["HighQuality"],axis =1), Wine_le["HighQuality"],
test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Model_le = get_model(n=X_train.shape[1])
Model_le.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_le.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_le.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.8909090909090909
Ordinal Encoding
In our ordinal encoding, the geographies have now been encoded as follows: France: 0, Germany: 1, Italy: 2 and the wine types as follows: Red: 0, Rose: 1, White: 2. Some models will now take this as “Italy is 2 distance units away from France, but only 1 from Germany”. Sometimes, such a mathematical interpretation can make sense, if there is an inherent order of the categories. Then, we should use Ordinal Encoding.
E.g. let’s suppose that we want to order the wine types by redness (White: 0, Rose: 1, Red: 2) and the geographies on the north-south-axis (Germany: 0, France: 1, Italy: 2). Below is a short example how to use ordinal encoding with a known order:
enc = OrdinalEncoder(categories=[['first','second','third','forth']])
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
[[1.]
[0.]
[2.]
[3.]]
Wine_oe = Wine.copy()
enc = OrdinalEncoder(categories=[["Germany", "France", "Italy"], ["White", "Rose", "Red"]])
Wine_oe[["Geography", "Type"]] = enc.fit_transform(Wine.loc[:,["Geography", "Type"]])
Wine_oe.head(), Wine.head()
( Age Alc% Geography Type HighQuality
0 8 12.3 1.0 0.0 1
1 16 11.5 0.0 0.0 1
2 12 11.3 2.0 2.0 1
3 9 12.7 1.0 2.0 1
4 8 13.0 0.0 0.0 0,
Age Alc% Geography Type HighQuality
0 8 12.3 France White 1
1 16 11.5 Germany White 1
2 12 11.3 Italy Red 1
3 9 12.7 France Red 1
4 8 13.0 Germany White 0)
X_train, X_test, y_train, y_test = train_test_split(Wine_oe.drop(["HighQuality"],axis =1), Wine_oe["HighQuality"],
test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Model_oe = get_model(n=X_train.shape[1])
Model_oe.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_oe.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_oe.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.8909090909090909
One-Hot Encoding
Another alternative is to create many binary dummy variables that encode, if one feature has a certain value or not. So if you have a categorical feature $X_i$ with k possible values $c_1,…,c_k $, you will create k dummy variables $X_i==c_1, … , X_i == c_k$. Therefore, you will have none of the issues about the model thinking that there is a meaning in whether the distance between two encoded variables is larger or smaller. Thus, One-Hot Encoding is usually the most accurate way to represent categorical data, but comes at the cost of creating (potentially very many) new dummy features.
This can be done via OneHotEncoder from sklearn, but it’s probably more convenient to use the pandas-intrinsic method pd.get_dummies().
Wine_ohe = Wine.copy()
Wine_ohe = pd.concat([Wine_ohe,pd.get_dummies(Wine_ohe[Categoricals])], axis = 1)
Wine_ohe.drop(Categoricals,axis = 1, inplace = True)
Wine_ohe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
X_train, X_test, y_train, y_test = train_test_split(Wine_ohe.drop(["HighQuality"],axis =1), Wine_ohe["HighQuality"],
test_size=0.33, random_state=1)
X_train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Model_ohe = get_model(n=X_train.shape[1])
Model_ohe.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_ohe.fit(X_train, y_train,epochs = 50, verbose = False)
y_hat = (Model_ohe.predict(X_test) > 0.5).astype("int32")
11/11 [==============================] - 0s 3ms/step
print("Accuracy",np.mean(y_hat.reshape(1,-1)[0] == y_test))
Accuracy 0.9545454545454546
The ordinal encoding can decrease the model’s accuracy, if the chosen order is actually not well-suited to represent the nature of the data. Often, the one-hot encoding yields the most accurate model, but it can inflate the size of the data and therefore the model training time.