Standard Deviation vs. Sample Size

An intuitive relationship.

Standard deviation is a statistic that simply describes how much variability (taking the data values and the mean into account) there is in a data set. Here is the formula, where x represents the value of interest, represents the mean, and n represents the number of data values in the set (sample size):

formula

Standard deviation is important as it allows you to essentially assess the level of stability in the data. To elaborate, if there is a relatively high standard deviation, then there is high variability in the data. If the standard deviation is relatively low, then there is low variability.

For example, suppose you want to know the average number of plants per household in your country. You will consider different sample sizes of households. The first sample size of households that you take might be relatively small, so probably a few houses from your neighborhood. Let’s assume that your community is pretty enthusiastic about gardening and the climate is fairly ideal. So, the average number of plants per household in your community might be drastically higher than the national average of plants per household, thus a high variability. Therefore, there is an indication of sampling bias in your community sample, since you are assessing households with similar levels of gardening enthusiasm and in the same climate.

If you increase your sample size to households in your state or province, then you are adding more people with varying opinions on gardening and different climates, so there will be less variability when comparing the state average to the national average. Therefore, the overall bias in your sample will reduce, as the sample size increases.

To test this out mathematically, I thought of graphing the standard deviation vs. the sample size. The problem with this is that the standard deviation formula contains two variables, x (which is just a data value), and n (which represents the sample size). In order to avoid the complexity of dealing with the 3D plane, I decided to essentially combine x and n into one variable since the domains of x and n are very similar. While x ∈ (-∞, +∞), n ∈ (1, +∞), it’s evident that the domain of one of the variables must be restricted. The obvious choice was to reduce the domain of x to the same of that of n, since n must stay greater than 1 (you can’t have a negative sample size, and you can’t have a sample size of 1 since that would make the expression undefined). Therefore, my “modification” to the standard deviation function has some limitations, as it leaves out negative x-values, but I guess for most practical applications, this works.

Here is my modified formula of the standard deviation, where n represents the sample size, and S represents the standard deviation function:

formula

In order to graph this, I needed to further tweak the formula. So, I set the value of (which is the constant, ) to 1.25. I also replaced the x in the numerator with n.

Here is the graph-ready formula:

formula

Here is the graph of the modified formula, where sample size is on the horizontal axis (n-axis) and standard deviation is on the vertical axis (S-axis):

formula

(graphed with Desmos)


I think this visual representation shows how standard deviation and sample size are related. We see that with a relatively low sample size, there is a very high standard deviation, due to lots of potential bias. However, as we increase the sample size, the standard deviation decreases exponentially, but never reaches 0. Although the overall bias is reduced when you increase the sample size, there will always be some instances where the bias could possibly affect the stability of your distribution. This can be expressed by the following limit:

formula

I’d also like to mention that I used the summation of sample sizes from 1 to 100 in the formula and in the graph. Depending upon the sample size, the overall shape of this graph will be retained, as it is essentially “scaled up” or “scaled down” as you sum the (n - 1.25)2 term in the numerator.