Sklearn

Clustering basics

Download this notebook This ipynb is based on a workshop organised by the tech consulting company TNG. The notebook explains the theoretical background of three widely used clustering algorithms and uses them on some sample data. As a final example, clusters are used to ‘‘discretise’’ the colours of an image. import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt #from plotting_utilities import * from sklearn.datasets import make_moons, make_blobs from sklearn.

Encoding

Download this notebook Many ML models struggle with categorical input data, because they expect numeric input data. This notebook will present three methods to transform categorical into numeric data through a process called “encoding”: label encoding, ordinal encoding and one-hot encoding. Warning: You should probably encode the full data set before doing a train-test-split. Otherwise, you may accidentally assign two different encoding schemes to the training and test data which will then, of course, cause trouble for the ML model.

K-Fold Cross Validation

Download this notebook You may already be familiar with the idea of splitting data into training and test data: You only train your model on the training data and then evaluate it on the unknown test data to see how good it deals with completely new data. Often, you also see a validation data set that is known to the Machine Learning engineer, but not known by the model during training process.

Logistic Regression

Download this notebook A typical task solved using machine learning algorithms is the assignment of instaces to different classes - a so-called classification. Despite the somewhat misleading name, logistic regression is a method to handle classification problems, in particular two-class classification problems. Calculating the probability that an instance belongs to a certain class, the classification based on logistic regression takes place. Stepping through this notebook, you will get familiar with the fundamental concepts of logistic regression.

Trees and Forests

Download this notebook This notebook gives an introduction to decision and regression trees and their aggregation as forests. We will implement a bagging (Bootstrap AGGregating) algorithm ourselves, construct a random forest ourselves and finally use a random forest from a standard package. Several data sets are used to compare the results of these algorithms. from sklearn.datasets import load_iris, load_breast_cancer from sklearn import tree import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import random Decision Trees In a classification of discrete classes, one intuitively might want to find some simple rules based on the instances’ features.

Using Linear Regression to Predict Bike Sharing Demand in Seoul

Download this notebook Over the past two decades, the sharing economy has strongly grown worldwide. In many business sectors, such as tourism and transportation, new sharing services, like airbnb or uber, provide a convenient supplement to the conventional options for customers. One sector that has been drastically affected by upcoming sharing services in the past few years is short distance transportation. Nowadays, in most cities, bycicles or electric scooters can be lend to quickly cover short distances.