Bootstrapping

Page content

Download this notebook

There is a story about a man who got stuck in quicksand and would have suffered a horrible death, but he managed to pull himself out of the sand by his boot straps. In German, Baron von Münchhausen had a similar story about how he pulled himself out of a swamp by his hair. Obviously, all of this is nonsense, but it coined the name of a statistical method - Bootstrapping - that seems to be similarly ludicrous, but actually works and is insanely useful for many otherwise difficult situations.

Bootstrapping means to take a sample of Data and resample it with replacement. That means some of the observations are now included more than once in the resampled data, whereas others do not occur at all. We do this $n_B$ times and end up with $n_B$ samples of our original data.

If there is any function $f$ that we want to calculate on our data (e.g. the 75% quantile), we can now evaluate this function for all $n_B$ samples: the values $f_1,\dots,f_{n_B}$ have an empirical mean that converges to the true value of this function, but much more usefully, the empricial standard deviation of the resampled data allows us to quantify the uncertainy of this estimate.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LR

Get Data

np.random.seed(1234)
X = np.random.normal(0,1,1000)
Y = X + np.random.normal(0,0.5,1000)
Data = np.concatenate((X.reshape(-1,1),Y.reshape(-1,1)),axis = 1)
Data.shape
plt.scatter(Data[:,0],Data[:,1])
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

png

If $X$ and $Y$ were independent, then $\sigma(X+Y) = \sqrt{ \sigma(X)^2 + \sigma(Y)^2 }$. But this does not hold here because of the linear dependence:

print("Standard deviation of X",np.std(Data[:,0]))
print("Standard deviation of Y",np.std(Data[:,1]))
print("Standard deviation of X+Y according to independence",np.sqrt(np.std(Data[:,0])**2 + np.std(Data[:,1])**2))
print("True Standard deviation of X+Y", np.std(Data[:,0]+Data[:,1]))
Standard deviation of X 0.9730661993772295
Standard deviation of Y 1.0964287375625763
Standard deviation of X+Y according to independence 1.4659515015591447
True Standard deviation of X+Y 2.0130751916821232

So numpy allows us to estimate the standard deviation of $X+Y$, but remember: This is only an estimate, so it is not perfectly accurate! Maybe we need to know the accuracy of our estimation, so what do we do now?

Resampling

For resampling, we use the np.random.choice function, which samples from a given array with replacement, as demonstrated here:

np.random.choice(np.arange(10), 10)
array([7, 3, 7, 5, 2, 0, 9, 3, 6, 9])
def Bootstrap(Array,nB = 100, Function = False):
    """
    Bootstrap resamples Array for nB times and, if necessary, applies the function Function to each new sample
    

    :Array: original data which we want to bootstrap
    :nB: how often we want to bootstrap
    :Function: set to False, if we only want to return the nB samples from Array. If we want to apply a function, 
               insert the function here (e.g. np.mean)
    :return: either list of samples or list of the function values
    """ 

    
    # this will be the output where we append a result for each resampling
    output = list()
    
    
    for i in range(nB):
        
        # resample from the original array
        Resample = Array[np.random.choice(np.arange(Array.shape[0]), 
                                      size=Array.shape[0],replace = True), :]
        
        # if we only want the full sample back, do this
        if Function == False:
            output += [Resample]
            
        # else apply a given function to the new sample    
        else:
            output += [ Function(Resample) ]  
    
    
    return( output )

Sanity Check

Let’s see if Bootstrapping returns the known result for the standard deviation of X

STD_X_BT = np.array(Bootstrap(Data[:,0].reshape(-1,1),
                     1000, np.std))
print("According to Bootstrapping, the empirical standard deviation of X should be in",
      np.round( np.mean(STD_X_BT),3) ,"±", np.round( np.std(STD_X_BT),3))
print("The actual standard deviation is", np.round(np.std(X),3))
According to Bootstrapping, the empirical standard deviation of X should be in 0.973 ± 0.022
The actual standard deviation is 0.973

And now for X+Y

STD_XY_BT = np.array(Bootstrap(Data,1000, lambda d: np.std(d[:,0]+d[:,1])))
# This is the very concise lambda notation for functions

print("According to Bootstrapping, the empirical standard deviation of X+Y should be in",
      np.round( np.mean(STD_XY_BT),3) ,"±", np.round( np.std(STD_XY_BT),3))
print("The actual standard deviation is", np.round(np.std(X+Y),3))
According to Bootstrapping, the empirical standard deviation of X+Y should be in 2.012 ± 0.044
The actual standard deviation is 2.013

Alternatively

BT_samples = np.array(Bootstrap(Data,1000 ))
STD_XY_BT = list()
for b in BT_samples:
    STD_XY_BT += [np.std(b[:,0]+b[:,1])]
STD_XY_BT = np.array(STD_XY_BT)



print("According to Bootstrapping, the empirical standard deviation of X*Y should be in",
      np.round( np.mean(STD_XY_BT),3) ,"±", np.round( np.std(STD_XY_BT),3))
print("The actual standard deviation is", np.round(np.std(X+Y),3))
According to Bootstrapping, the empirical standard deviation of X*Y should be in 2.013 ± 0.046
The actual standard deviation is 2.013

So now we have an estimation for the empirical standard deviation of X+Y and an uncertainty interval. That’s much more useful than before!

A More Interesting Example

These examples were pretty basic, but let’s test something more interesting. Suppose we assume a linear relationship $Y = f(X) = aX+b$ and want to give an estimate for $f(4)$ with some uncertainty interval. How do we do this?

With Bootstrapping, this is simple: We resample the data $(X/Y)$ $n_B$ times, then estimate one $\hat{f}(X)$ for each of these samples and evaluate it at $\hat{f}(4)$. Then, we get a sample of estimators for $\hat{f}(4)$ where we can easily apply np.mean and np.std or even plot a histogram to visualise the uncertainty.

BT_samples = np.array(Bootstrap(Data,1000 ))
BT_Prediction = list()

for sample in BT_samples:
    
    reg = LR().fit(sample[:,0].reshape(-1,1), sample[:,1])
    BT_Prediction += [reg.predict(np.array([4]).reshape(-1, 1))]
BT_Prediction = np.array(BT_Prediction)
plt.hist(BT_Prediction,30)
plt.xlabel("Estimation of f(4)")
plt.ylabel("Frequency")
plt.show()

png

print("According to Bootstrapping, the estimation of f(4) should be given by",
      np.round( np.mean(BT_Prediction),3) ,"±", np.round( np.std(BT_Prediction),3))
According to Bootstrapping, the estimation of f(4) should be given by 4.041 ± 0.064

This can, of course, be extended to much more complex problems. Maybe you have not one, but m features that predict Y in a polynomial model $Y = ax_1 + bx_1^2 + cx_2 + dx_2^2$, which you wish to predict at e.g. $(x_1 = 0.7, x_2 = -0.1)$. With Bootstrapping, it is straightforward to calculate the uncertainty of such a prediciton. Often, if you struggle to find an approach to estimating the statistical uncertainty of something, Bootstrapping might be just the right tool for you!