CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 15
A robust measure of location is one that is not affected by a few extremely large or extremely small values. Values of a numeric variable that lie a great distance from most of the other values are called outliers. Outliers might be the result of mistakes in measuring or recording data, perhaps from misplacing a decimal point. The mean is not a robust location measure. It can be affected significantly by a single extreme outlier if that outlying value is extreme enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed mean might be preferred to the mean as a reliable location measure. The median is very insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more than 5% of the data values.
2.1.7 The Five Number Summary
The five number summary is a convenient way of summarizing numeric data. The five numbers are the minimum value, the first quartile (25th percentile), the median, the third quartile (75th percentile), and the maximum value. Sometimes the mean is also included, which makes it a six number summary.
Example 2.2. The natural logarithms y of the data values x in Example 1 are, to two places:
-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55
It is sometimes advantageous to transform data in some way, i.e., to define a new variable y as a function of the old variable x. In this case, we have transformed the reaction times x with the natural logarithm transformation. We might want to do this to so that we can more easily apply certain statistical inference procedures you will learn about later. The six number summary of the transformed data y is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400
2.1.8 The Mode
The mode of a variable is its most frequently occurring value. With numeric variables the mode is less important than the mean and median for descriptive purposes or for statistical inference. For factor variables the mode is the most natural way of choosing a ”most representative” value. We hear this frequently in the media, in statements such as ”Financial problems are the most common cause of marital strife”. For grouped numeric data the modal class interval is the class interval having the highest absolute or relative frequency. In Example 1, the modal class interval is the interval (1,2].
1. Find the mean and median of the reaction time data in Example 1.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 16
2. Find the quartiles of the reaction time data. There is more than one acceptable answer.
3. The 40th value x40 of the reaction time data has a value of 2.32. Replace that with 232.0. Recalculate the mean and median. Comment.
4. Construct a frequency table like the one in Example 1 for the log-transformed reaction times of Example 2. Use 5 class intervals of equal length beginning at -3 and ending at 2. Draw an absolute frequency histogram.
5. Estimate the mean and median of the grouped log-transformed reaction times by using the tech- niques discussed in Example 1. Compare your answers to the summary in Example 2.
6. Repeat exercises 1, 2, and the histogram of exercise 4 by using R.
7. Let x be a numeric variable with values x1, . . . , xn−1, xn. Let x̄n be the average of all n val- ues and let x̄n−1 be the average of x1, . . . , xn−1. Show that x̄n = (1− 1n )x̄n−1 +
1 nxn. What happens
if xn →∞ while all the other values of x are fixed?
2.2 Measures of Variability or Scale
2.2.1 The Variance and Standard Deviation
Let x be a population variable with values x1, x2, . . . , xn. Some of the values might be repeated. The variance of x is
var(x) = σ2 = 1
(xi − µ(x))2.
The standard deviation of x is sd(x) = σ =
When x1, x2, . . . , xn are values of x from a sample rather than the entire population, we modify the definition of the variance slightly, use a different notation, and call these objects the sample variance and standard deviation.
s2 = 1
(xi − x̄)2,
s = √ s2.
The reason for modifying the definition for the sample variance has to do with its properties as an estimate of the population variance.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 17
Alternate algebraically equivalent formulas for the variance and sample variance are
σ2 = 1
x2i − µ(x)2,
s2 = 1
n− 1 (
x2i − nx̄2).
These are sometimes easier to use for hand computation.
The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a+ bx, where a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system of measurement units, and the standard deviation takes account of it in a natural way.
2.2.2 The Coefficient of Variation
For a variable that has only positive values, it may be more important to measure the relative vari- ability than the absolute variability. That is, the amount of variation should be compared to the mean value of the variable. The coefficient of variation for a population variable is defined as
cv(x) = sd(x)
For a sample of values of x we substitute the sample standard deviation s and the sample average x̄.
2.2.3 The Mean and Median Absolute Deviation
Suppose that you must choose a single number c to represent all the values of a variable x as accurately as possible. One measure of the overall error with which c represents the values of x is
√√√√ 1 n
(xi − c)2.
In the exercises, you are asked to show that this expression is minimized when c = x̄. In other words, the single number which most accurately represents all the values is, by this criterion, the mean of the variable. Furthermore, the minimum possible overall error, by this criterion, is the standard deviation. However, this is not the only reasonable criterion. Another is
h(c) = 1
|xi − c|.
It can be shown that this criterion is minimized when c = median(x). The minimum value of h(c) is called the mean absolute deviation from the median. It is a scale measure which is somewhat more robust(less affected by outliers) than the standard deviation, but still not very robust. A related very robust measure of scale is the median absolute deviation from the median, or mad :
mad(x) = median(|x−median(x)|).
Go to TOC