Suppose that a college offers 65 classes in a given semester, with the following distribution of sizes: size count 5- 9 8 8 14 4 6 12 8 3 2.
But if you survey a group of students, ask them how many students are in their classes, and compute the mean, you would think the average class size was bigger. For each class size, x, we multiply the probability by x, the number of students who observe that class size. The result is a new Pmf that represents the biased distribution.
In the biased distribution there are fewer small classes and more large ones. The mean of the biased distribution is Distribution of class sizes, actual and as observed by students. It is also possible to invert this operation. An alternative is to choose a random sample of students and ask how many students are in their classes. You can also provide row names. The set of row names is called the index; the row names themselves are called labels.
If you know the integer position of a row, rather than its label, you can use the iloc attribute, which also returns a Series.
My advice: if your rows have labels that are not simple integers, use the labels consistently and avoid using integer positions. Exercises Solutions to these exercises are in chap03soln. Something like the class size paradox appears if you survey children and ask how many children are in their family.
Families with many children are more likely to appear in your sample, and families with no children have no chance to be in the sample.
Now compute the biased distribution we would see if we surveyed the children and asked them how many children under 18 including themselves are in their household. Plot the actual and biased distributions, and compute their means. As a starting place, you can use chap03ex. To test these methods, check that they are consistent with the methods Mean and Var provided by Pmf. To address this version of the question, select respondents who have at least two babies and compute pairwise differences.
Does this formulation of the question yield a different result? Hint: use nsfg. In most foot races, everyone starts at the same time. If you are a fast runner, you usually pass a lot of people at the beginning of the race, but after a few miles everyone around you is going at the same speed.
When I ran a long-distance miles relay race for the first time, I noticed an odd phenomenon: when I overtook another runner, I was usually much faster, and when another runner overtook me, he was usually much faster. Then I realized that I was the victim of a bias similar to the effect of class size. The race was unusual in two ways: it used a staggered start, so teams started at different times; also, many teams included runners at different levels of ability. As a result, runners were spread out along the course with little relationship between speed and location.
When I joined the race, the runners near me were pretty much a random sample of the runners in the race. So where does the bias come from? I am more likely to catch a slow runner, and more likely to be caught by a fast runner. But runners at the same speed are unlikely to see each other.
To test your function, you can use relay. Compute the distribution of speeds you would observe if you ran a relay race at 7. The code for this chapter is in cumulative. But as the number of values increases, the probability associated with each value gets smaller and the effect of random noise increases. For example, we might be interested in the distribution of birth weights. Figure shows the PMF of these values for first babies and others. Overall, these distributions resemble the bell shape of a normal distribution, with many values near the mean and a few values much higher and lower.
But parts of this figure are hard to interpret. There are many spikes and valleys, and some apparent differences between the distributions.
It is hard to tell which of these features are meaningful. Also, it is hard to see overall patterns; for example, which distribution do you think has the higher mean? These problems can be mitigated by binning the data; that is, dividing the range of values into non-overlapping intervals and counting the number of values in each bin. Binning can be useful, but it is tricky to get the size of the bins right.
If they are big enough to smooth out noise, they might also smooth out useful information. An alternative that avoids these problems is the cumulative distribution function CDF , which is the subject of this chapter.
Percentiles If you have taken a standardized test, you probably got your results in the form of a raw score and a percentile rank. In this context, the percentile rank is the fraction of people who scored lower than you or the same. If you are given a value, it is easy to find its percentile rank; going the other way is slightly harder.
The result of this calculation is a percentile. For example, the 50th percentile is the value with percentile rank In the distribution of exam scores, the 50th percentile is This implementation of Percentile is not efficient. To summarize, PercentileRank takes a value and computes its percentile rank in a set of values; Percentile takes a percentile rank and computes the corresponding value. CDFs Now that we understand percentiles and percentile ranks, we are ready to tackle the cumulative distribution function CDF.
The CDF is the function that maps from a value to its percentile rank. The CDF is a function of x, where x is any value that might appear in the distribution. To evaluate CDF x for a particular value of x, we compute the fraction of values in the distribution less than or equal to x.
As an example, suppose we collect a sample with the values [1, 2, 2, 3, 5]. If x is less than the smallest value in the sample, CDF x is 0. If x is greater than the largest value, CDF x is 1.
Figure is a graphical representation of this CDF. The CDF of a sample is a step function. The bracket operator is equivalent to Prob.
Value p Given a probability p, computes the corresponding value, x; that is, the inverse CDF of p. The Cdf constructor can take as an argument a list of values, a pandas Series, a Hist, Pmf, or another Cdf. Cdf live. Cdf cdf thinkplot. One way to read a CDF is to look up percentiles. The CDF also provides a visual representation of the shape of the distribution. Common values appear as steep or vertical sections of the CDF; in this example, the mode at 39 weeks is apparent.
There are few values below 30 weeks, so the CDF in this range is flat. For example, here is the code that plots the CDF of birth weights for first babies and others. Cdf firsts. Cdf others. Compared to Figure , this figure makes the shape of the distributions, and the differences between them, much clearer. We can see that first babies are slightly lighter throughout the distribution, with a larger discrepancy above the mean.
The Cdf class provides these two methods:. Percentile p Given a percentile rank rank, computes the corresponding value, x. Percentile can be used to compute percentile-based summary statistics. For example, the 50th percentile is the value that divides the distribution in half, also known as the median.
Like the mean, the median is a measure of the central tendency of a distribution. But Percentile 50 is simple and efficient to compute.
Another percentile-based statistic is the interquartile range IQR , which is a measure of the spread of a distribution. The IQR is the difference between the 75th and 25th percentiles. More generally, percentiles are often used to summarize the shape of a distribution. Random Numbers Suppose we choose a random sample from the population of live births and look up the percentile rank of their birth weights. What do you think the distribution will look like? Then we generate a sample and compute the percentile rank of each value in the sample.
PercentileRank x for x in sample]. Finally we make and plot the Cdf of the percentile ranks. Cdf ranks thinkplot. The CDF is approximately a straight line, which means that the distribution is uniform. That outcome might not be obvious, but it is a consequence of the way the CDF is defined. So, regardless of the shape of the CDF, the distribution of percentile ranks is uniform.
This property is useful, because it is the basis of a simple and efficient algorithm for generating random numbers with a given CDF. Percentile to find the value in the distribution that corresponds to the percentile rank you chose.
Cdf provides an implementation of this algorithm, called Random: class Cdf: def Random self : return self. Percentile random. Cdf also provides Sample, which takes an integer, n, and returns a list of n values chosen at random from the Cdf. For example, people who compete in foot races are usually grouped by age and gender.
To compare people in different age groups, you can convert race times to percentile ranks. Assuming that my percentile rank in my division is the same, how much slower should I expect to be? I can answer that question by converting my percentile rank in M to a position in M There were people in M, so I would have to come in between 17th and 18th place to have the same percentile rank. Exercises For the following exercises, you can start with chap04ex.
My solution is in chap04soln. How much did you weigh at birth? Using the NSFG data all live births , compute the distribution of birth weights and use it to find your percentile rank.
Otherwise use the distribution for others. If you are in the 90th percentile or higher, call your mother back and apologize. The numbers generated by random. Generate numbers from random. Glossary percentile rank The percentage of values in a distribution that are less than or equal to a given value.
CDF x is the fraction of the sample less than or equal to x. The distributions we have used so far are called empirical distributions because they are based on empirical observations, which are necessarily finite samples.
The alternative is an analytic distribution, which is characterized by a CDF that is a mathematical function. In this context, a model is a simplification that leaves out unneeded details. This chapter presents common analytic distributions and uses them to model data from a variety of sources. The code for this chapter is in analytic. In the real world, exponential distributions come up when we look at a series of events and measure the times between events, called interarrival times.
If the events are equally likely to occur at any time, the distribution of interarrival times tends to look like an exponential distribution. CDFs of exponential distributions with various parameters. As an example, we will look at the interarrival time of births. On December 18, , 44 babies were born in a hospital in Brisbane, Australia.
Figure left shows the CDF. It seems to have the general shape of an exponential distribution, but how can we tell? For data from an exponential distribution, the result is a straight line.
Cdf computes the complementary CDF before plotting. Figure right shows the result. It is not exactly straight, which indicates that the exponential distribution is not a perfect model for this data.
Most likely the underlying assumption—that a birth is equally likely at any time of day—is not exactly true. With that simplification, we can summarize the distribution with a single parameter. The Normal Distribution The normal distribution, also called Gaussian, is commonly used because it describes many phenomena, at least approximately. Its CDF is defined by an integral that does not have a closed form solution, but there are algorithms that evaluate it efficiently.
One of them is provided by SciPy: scipy. This result is correct: the median of the standard normal distribution is 0 the same as the mean , and half of the values fall below the median, so CDF 0 is 0. Figure shows CDFs for normal distributions with a range of parameters. In the previous chapter, we looked at the distribution of birth weights in the NSFG. Figure shows the empirical CDF of weights for all live births and the CDF of a normal distribution with the same mean and variance.
CDF of normal distributions with a range of parameters. Below the 10th percentile there is a discrepancy between the data and the model; there are more light babies than we would expect in a normal distribution. If we are specifically. Normal Probability Plot For the exponential distribution, and a few others, there are simple transformations we can use to test whether an analytic distribution is a good model for a dataset.
For the normal distribution there is no such transformation, but there is an alternative called a normal probability plot. There are two ways to generate a normal probability plot: the hard way and the easy way. Sort the values in the sample. Plot the sorted values from the sample versus the random values.
If the distribution of the sample is approximately normal, the result is a straight line with intercept mu and slope sigma. NormalProbability sample. To test NormalProbability I generated some fake samples that were actually drawn from normal distributions with various parameters. Figure shows the results. The lines are approximately straight, with values in the tails deviating more than values near the mean.
It plots a gray line that represents the model and a blue line that represents the data. NormalProbability weights thinkplot. Normal probability plot for random samples from normal distributions. FitLine takes a sequence of xs, an intercept, and a slope; it returns xs and ys that represent a line with the given parameters, evaluated at the values in xs. NormalProbability returns xs and ys that contain values from the standard normal distribution and values from weights.
If the distribution of weights is normal, the data should match the model. Figure shows the results for all live births, and also for full term births pregnancy length greater than 36 weeks.
Both curves match the model near the mean and deviate in the tails. The heaviest babies are heavier than what the model expects, and the lightest babies are lighter. When we select only full term births, we remove some of the lightest weights, which reduces the discrepancy in the lower tail of the distribution.
This plot suggests that the normal model describes the distribution well within a few standard deviations from the mean, but not in the tails. Whether it is good enough for practical purposes depends on the purposes. The CDF of the lognormal distribution is the same as the CDF of the normal distribution, with log x substituted for x. Normal probability plot of birth weights. If a sample is approximately lognormal and you plot its CDF on a log-x scale, it will have the characteristic shape of a normal distribution.
To test how well the sample fits a lognormal model, you can make a normal probability plot using the log of the values in the sample. I was tipped off to this possibility by a comment without citation. Among the data they collected are the weights in kilograms of , respondents. Figure left shows the distribution of adult weights on a linear scale with a normal model.
Figure right shows the same distribution on a log scale with a lognormal model. The lognormal model is a better fit, but this representation of the data does not make the difference particularly dramatic.
Figure shows normal probability plots for adult weights, w, and for their logarithms, log10 w. Now it is apparent that the data deviate substantially from the normal model. The lognormal model is a good match for the data within a few standard deviations of the mean, but it deviates in the tails. I conclude that the lognormal distribution is a good model for this data.
The Pareto Distribution The Pareto distribution is named after the economist Vilfredo Pareto, who used it to describe the distribution of wealth. Since then, it has been used to describe phenomena in the natural and social sciences including sizes of cities and towns, sand particles and meteorites, and forest fires and earthquakes. The CDF of the Pareto distribution is:. Normal probability plots for adult weight on a linear scale left and log scale right. There is a simple visual test that indicates whether an empirical distribution fits a Pareto distribution: on a log-log scale, the CCDF looks like a straight line.
If you plot the CCDF of a sample from a Pareto distribution on a linear scale, you expect to see a function like:. The repository also contains populations. CDFs of Pareto distributions with different parameters. Figure shows the CCDF of populations on a log-log scale. Figure shows the CDF of populations and a lognormal model left , and a normal probability plot right. Both plots show good agreement between the data and the model.
Neither model is perfect. Which model is appropriate depends on which part of the distribution is relevant. Two notes about this implementation: I called the parameter lam because lambda is a Python keyword.
Also, since log 0 is undefined, we have to be a little careful. The implementation of random. Why Model? At the beginning of this chapter, I said that many real world phenomena can be modeled with analytic distributions.
For example, an observed distribution might have measurement errors or quirks that are specific to the sample; analytic models smooth out these idiosyncrasies. Analytic models are also a form of data compression. Olin College of Engineering and writer of free textbooks.
Olin College of Engineering since In he was also Visiting Scientist at Google Inc. Downey, Franklin W. Content Accuracy rating: 5 Professor Downey is a senior engineer and a data scientist. Clarity rating: 5 IMHO, the book is very clear for anybody with some background in computer science and programming. Consistency rating: 5 Consistency is to the extreme. Modularity rating: 5 The book is modular in the sense that we can read sections that we are not familiar and skip parts that we are not familiar.
Interface rating: 5 I was unable to find any interface errors. Grammatical Errors rating: 5 I was unable to find any grammatical errors. Cultural Relevance rating: 5 The book is culturally neutral. Images Donate icon An illustration of a heart shape Donate Ellipses icon An illustration of text ellipses. Think Stats v 2. EMBED for wordpress. Want more? Recent Books. Miscellaneous Books. Computer Languages. Computer Science. Electrical Engineering. With code puzzles, you will learn faster, smarter, and better.
Coffee Break Python Slicing is all about growing your Python expertise - one coffee at a time. The focus lies on the important slicing technique to access consecutive data ranges. Understanding slicing thoroughly is crucial for y
0コメント