DataSciencePlus

Encoding

Download this notebook Many ML models struggle with categorical input data, because they expect numeric input data. This notebook will present three methods to transform categorical into numeric data through a process called “encoding”: label encoding, ordinal encoding and one-hot encoding. Warning: You should probably encode the full data set before doing a train-test-split. Otherwise, you may accidentally assign two different encoding schemes to the training and test data which will then, of course, cause trouble for the ML model.

How to Speed up Pandas?

Download this notebook Pandas is a frequently used package in Python and every aspiring Python data scientist should have some familiarity with it. However, Pandas has a lot of quirks to it which are not obvious to even advanced users. One of them is the different methods of how to apply a given function to every row of the data set. As we will see here, this can make an enormous speed difference of a factor of 1000.

Model Selection and Collinearity

Download this notebook This notebook uses a very simply dataset and model to show that problems can arise if the different features in your dataset are collinear or correlated with each other. Although the setup here is deliberately simple, this can obviously also occur in much more complex high-dimensional data and models and lead to very wrong interpretations of the model coefficients. import pandas as pd import numpy as np import matplotlib.

Pandas: Dos and Don'ts

Download this notebook This notebook will show you some examples of pandas code. Often, new and experienced users of this library will write unnecessarily complicated or slow code when pandas has a built-in functionality to do the same task for you. This notebook is supposed to make you aware of some of the functionalities that pandas has to offer. import pandas as pd import numpy as np import datetime Create some Data We create sample DataFrames with the champions in German men’s and women’s football from 2007 to 2014 and their average points per match.

SMOTE-NC

Download this notebook Synthetic Minority Oversampling TEchnique for Nominal and Continuous data What to do when our minority class is an extreme minority? This could lead to the classifier basically always predicting the majority class, but often, it is precisely the minority class that is actually interesting, e.g. in predicting cancer with image data or fraud with credit card data. We could just multiply instances of the minority class by a factor of, say, 100.