Statistical Concepts Useful in Life
Is statistics a dry subject best forgotten once you graduate? Not at all — many of its concepts are useful in life:
Variance
Variance captures how much change is expected. Being an employee has a low variance, since you get paid the same amount every month. A consultant has high variance, earning a lot in one month and nothing for months afterward. A founder of a startup has extreme variance, where he earns nothing for years but, if his startup hits the jackpot, doesn’t have to work again in life.
You should opt for a low-variance strategy (e.g., job) if you’re feeling stressed or that life is too risky, and a high-variance strategy (e.g., consulting) if you’re ambitious and a job is boring.
Robust estimator
An estimator is some function like a mean or median that takes multiple inputs and gives one output that represents a typical value. For example, if you have the weights of 10 cars, an estimator gives you one number that tells you how heavy a typical car is.
A robust estimator is less likely to give the wrong conclusion if a few data points are wrong. Suppose you want to determine the average height of a person. So you survey 3 people1. You’ve noted down their heights in feet as (5, 5, 6). Both the median and mean work reasonably to determine a typical height: 5 and 5.3 feet respectively. Now suppose 6 was mistyped as 60, or one respondent deliberately entered an unreasonable number to mess with you. Now the data set is (5, 5, 60). The mean is now 23 feet, which leads us to a ridiculous conclusion that a typical person is 23 feet tall! By contrast, the median is still 5 feet, which is a reasonable conclusion. That’s why we say that the median is more robust than the mean — it’s less affected by one bad data point. This is similar to how a robust server can’t be brought down by one bad user. The more robust it is, the more bad users it can handle. Similarly, the more robust an estimator is, the more bad inputs it can handle while still giving a reasonable output.
Max is even less robust: in this case, it would return 60, which is even worse than 23 that mean returns!
So basically median is more robust than mean which is more robust than min / max.
Median is the most reliable estimator, because it can handle all but the center two values being wrong. Going back to the example of people’s heights, if you have a data set (5, 5.1, 5.2, 5.8, 6, 6.1), since it has an even number of values, the median is the average of the middle two values 5.2 and 5.8, which is 5.5. Now if you were to change the first two and last two values to ridiculous numbers like (0, 0.1, 5.2, 5.8, 90, 900), the median would still be the same.
An Nth percentile is more robust than Mth percentile if N is closer to 50 than M. So a 50th percentile is more robust than a 60th.
More the data points, less the error
You calculate an estimate as f(input), where f is the estimator. To get an accurate estimate, you want a good estimator and good input.
Coefficient of variation
is an indication of how much the values in a data set are different from the actual values. Its formula is
sqrt ( average ( (readings[i] - actual[i]) ^ 2 ))) / mean
Lower the CV, the more accurate your measurement. The lowest CV is 0, which means that every reading is 100% accurate. Say you’re trying to measure how accurate a thermometer is, and you take three readings (24, 25, 27). If the the actual temperatures at those times were indeed (24, 25, 27) respectively, as determined using an accurate reference thermometer, then you have a perfect thermometer whose CV is 0.
You can’t calculate CV unless you know what the actual values are. In the above example, if I gave you the readings of the thermometer (24, 25, 27) without telling you what the actual temperatures were, and asked you to calculate the CV, you wouldn’t be able to. You can’t tell how accurate the thermometer is without comparing it to a reference thermometer.
Standard deviation
is an indication of how different the values in a data set are from each other. (24, 25, 27) has higher standard deviation than (25, 25, 25), whose standard deviation is 0, since the values don’t differ from each other at all. This is the lowest possible standard deviation.
Keep in mind that the standard deviation doesn’t mean how accurate the data is. In this example, a thermometer that returns (25, 25, 25) is less accurate than a thermometer that returns (25, 26, 27) if the actual temperatures were (25, 28, 29). The second thermometer is closer to the real temperatures while the first thermometer’s readings are closer to each other. These are two different concepts. CV measures the first concept, while standard deviation measures the second.
Monte Carlo simulation
… is a technique where you simulate thousands of scenarios. Say you’re asked, “How much money do you need to retire if you’re going to live for 30 years more and spend 1 lakh rupees per month?” This depends on the investment returns per year. A simplistic approach is to assume that it’s the same every year. Then you can easily calculate a table that says for each return, how much corpus is needed, such as “If you’re going to get 10% return, you need 10 crores, but if you’re going to get 6% return, you need 20 crores.” But this is not accurate, since returns are not the same year to year. You need to model various scenarios like low investment returns immediately after retirement and high later vs the other way around. This is where a Monte Carlo simulation comes in. You can run thousands upon thousands of such scenarios and present a conclusion like “If you retire with 15 crore, you have a 99% of chance that the money lasts 30 years”, which is a more accurate conclusion than a blanket yes or no. Reality is probabilistic, not binary.
To keep this example simple.