by Guangming Lang
1 min read


  • da

Data are measurements on a set of items we’re interested in. The complete set is called the population because it includes all possible such items. For example, all males living in the US in 2014. A population can be small, large or infinitely large. Most of time, it is either large or infinitely large, as a result, it’s expensive or impossible to measure every item in the population. So people invented statistics to allow them to confidently say something about the population by only measuring a small number of items, a sample. It turns out how a sample is selected directly affects the quality of our inference about the population. A good sampling method produces an unbiased sample that well represents the population. To do this, you have to know the probability with which each member of the population will be included in the sample before the sample is drawn. This probability doesn’t have to be the same for all members of the population. And you draw your sample according to this probability. The resulting sample is called a random sample. It is easy to mistake it with a sample of convenience. For example, the first 10 people that walk by you on a street is NOT a random sample because first of all, it’s pretty hard to precisely say what the population is, and secondly, there’s no way you can calculate the chance with which each person in that population walks by you at that moment. When using a sample of convenience to draw conclusions about the population, the conclusions are likely biased. So always ask how the sample is taken before starting the analysis.