One of the biggest advances statistical modeling in the last 30 years has been the use of the bootstrap. For those interested in learning about the bootstrap in more detail, a good place to start is an article by UCSD math professor Dimitris N. Politis which I will summarize here. For more detailed information, one may want to look at An Introduction to the Bootstrap by Efron and Tibshirani.
Suppose we have n observation of a random variable X. We can group these as a vectors so that X=(X1,…,XN), where each Xi are iid with distribution F. If we want to estimate a parameter θ(F) from the data, we can use a statistic T(X) as an approximation. If we assume that F~Normal, we can use traditional statistics to estimate T(X) as well as the confidence interval around θ(F). If we do not know the distribution of F (which a researcher problem does not in reality), then classical statical theory may be less reliable and a bootstrap methodology may be more robust. Bootstrapping methodology allows the researcher to better estimate F, especially if there is significant skewness to the F distribution.
The bootstrap procedure creates a new sample, by randomly sampling each observation in X with replacement until we have a new vector with N observations. We repeat this B times to create our bootstrap data set. Let’s look at an example..
Pretend we have data on how many push up I have completed each day over a week. I want to estimate the median number of push-ups I do each day. In this sample, N=15 and since we will create ten bootstrap samples, B=10.
The median of the actual data we have is 22. But we can also calculate the median using a bootstrap methodology. We first randomly choose one of the data points and put it as the first data point of B1 (the bootstrap sample number 1), we then resample with replacement and put another number as the 2nd observation of sample B1. We can see that data points often repeat. For instance in B1 observations X10 repeats twice. We see that the median varies across the 10 bootstrapping samples, but the average value for the median using the bootstrap methodology is 22.8.
We can also calculate the the bootstrap variance (3.36) and standard deviation (1.83). This are calculated according to the formulas:
- Variance: B-1ΣiT(X*i)2 – [B-1ΣiT(X*i)]2
- S.D. = (Var)1/2
Here, T(X*i) is the median for each bootstrap sample i. Since there are 10 bootstrap samples i=1,…,10. To calculate the variance, one simply averages the squared median over the 10 bootstrap samples and then you subtract the squared average median of the 10 samples.
- Politis, Dimitris “Computer-intensive methods in statistical analysis“