Mean, Median and Mode
Total Page:16
File Type:pdf, Size:1020Kb
Mean, Median and Mode A CDF and pdf are completely described by their equations. But we often don't know the equation of the CDF/pdf, or want their equation summarized in a way we find easier to handle. These less informative summaries are called "measures". There are many different measures. The ones discussed on this page deal with how to describe what can be called the center, or most likely value, of a random variable. In other words, if you have to replace a random variable with a number, this these measures are in some sense the best number to use. As with many types of measures, there is no single best way to do this. There are some options, and which option you pick often depends on what questions you hope to answer, or what form you want your answer to take. The three measures of center described below are the mean, median and mode. Mean (Average, Expected Value, Center of Gravity, Center of Mass) The mean is a very widely used measure, and is also known by the name average and expected value. If a discrete random variable can take on values of x1, x2, … xM, and they all occur equally likely, then the mean is the sum of the possible values, divided by M. In this case, the probability of any particular y value will be 1/M, so we could rewrite the mean as The expression on the far right is handy, because it can be used when the probability of the various values yi are not all equal to each other. The more likely EE 5440 Page 1 probability of the various values yi are not all equal to each other. The more likely values will have a bigger impact on the mean than the less likely values. The definition for the mean can easily be extended to cover continuous random variables. Simply replace the summation with an integral, and the probability with the pdf. It is easy to spot the mean in some pdf curves. For example, in a gaussian curve the mean is the peak of the pdf. But in other pdf, like the exponential, there is no clear distinguishing feature to identify the mean. The same is true for the CDF. In a Gaussian CDF, the mean is where the CDF hits 0.5. But this is not true for other CDF. In general, it may not be easy to quickly identify the mean from a pdf or CDF curve. The mean is sometimes called the center of mass or center of gravity. If the pdf describes the density of a rod along its length, then the mean is the point at which the rod will balance. Means are easy to calculate, and clearly defined for both discrete and continuous random variables. While they can be used for any pdf or CDF, they are typically most useful when the pdf is concentrated around one point, and falls off smoothly and monotonically from there. Means are sometimes thought to be inappropriately influenced by outliers. These are possible outcomes of our experiment which are highly unlikely (so fX(y) or P(y) is small) but which have very large, or unusual values for y. The mean also is unreasonable for some discrete probability distributions, because the mean will end up being a value the random variable can never assume. Outlier Example: I sell coffee (which costs me $0.01 per cup to make) in my shop. The shop can easily handle 1000 people per day. I'd like to charge them as much as possible, but not so much that they quit coming. There are 10 000 people who pass by my shop each day. If I select one at random, I would find their income is a uniformly distributed random variable, between $45,000 / year and $55,000 / year. A business magazine suggests people will pay 1/10,000 of their annual salary for a EE 5440 Page 2 A business magazine suggests people will pay 1/10,000 of their annual salary for a cup of coffee. The mean income of my customers is $50,000, so I select a price of $5/cup, and business is good. A rich person who makes $1,000,000,000 / year moves to town. Nobody else's income changes. The 10,001 people who pass by my shop each day now have a mean income near %150,000 / year. Does it make sense to raise the price of my coffee to $15/cup, since the mean income just went up by a factor of 3 when one person moved in? Answer: Probably not. You may lose nearly all your business because most people consider your coffee three times too expensive for them. The rich person might by one $15 cup per day, but that's not going to make up for the money you lose. The mean is probably a bad measure of the center, when you have an outlier like this. Meaningless Mean Example: If you pick a family at random from my neighborhood, there's a 45% chance they have no children, a 30% chance they have 1, 15% chance they have 2, and 10% chance they have 3. People will often find it odd if I say the typical family has 0.9 children. While 0.9 is the mean number of children, no families have that number of children, and it does not even make sense to talk about an individual family having a fractional number of children. Median (Half-life) There is a 50% chance a random variable will be above its median value, and a 50% chance it will be below it. P(X>x_median)=P(X<x_median)=0.5=50% The median is very easy to spot on a CDF, because FX(xmedian)=0.5 EE 5440 Page 3 It is easy to think about on the pdf curve, but may be a little more difficult to calculate. The median is the point that has one half of the area of the pdf to the right, and one half of the pdf area to the left The median is not well defined for some discrete random variables. Consider a binary random variable that has a 25% chance of being 10 and a 75% chance of being 20. Medians are useful when you want to limit how much a highly unlikely event, at a EE 5440 Page 4 Medians are useful when you want to limit how much a highly unlikely event, at a highly unusual value. Return to the outlier example above. When the very rich person moves to town, the median income barely changes. Mode The mode is the most common value of random variable X. It is easy to spot on the pdf, as it is the value of y where fX(x) hits the largest value. Note: we don't care what the largest value of fX(x) is at that point, we only care about where it occurs You can also think of this as the location of a local maximum in the pdf. The mode is harder to spot in the CDF, as it is the point of maximum slope. There may be more than one mode. These are known as bimodal, trimodal, or in general multi-modal distributions. In these cases, the pdf simply has to have a local maximum to qualify as being a mode. The pdf may, but does not have to, reach the same value for each of the modes. EE 5440 Page 5 In a discrete random variable, it is the value that occurs more frequently than any other. There may be multiple values that share this property, in which case it is again a multimodal distribution. Modes are commonly used for discrete random variables, where one wants the answer to be one of the possibilities. In these cases, the mode is the location where the CDF has the largest jump. Return to the meaningless mean example above, where 45% of the families have 0 children, 30% have 1, 15% have 2 and 10% have 3. The mode is 0. That means that if you pick a family at random, all families being equally likely to be selected, you are more likely to select a family with zero children than you are a family with one child. That last statement applies if you compare changes of selecting zero or two children, and also if you select zero or three children. Despite the mode being 0, you are more likely to select a child with one or more children, than you are a family with no children. EE 5440 Page 6 .