CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 18
2.2.4 The Interquartile Range
The interquartile range of a variable x is the difference between its 75th and 25th percentiles.
IQR(x) = q(x, .75)− q(x, .25).
It is a robust measure of scale which is important in the construction and interpretation of boxplots, discussed below.
All of these measures of scale are valid for comparison of the ”spread”or variability of numeric variables about a central value. In general, the greater their values, the more spread out the values of the variable are. Of course, the standard deviation, median absolute deviation, and interquartile range of a variable will be different numbers and one must be careful to compare like measures.
Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical representation of the five number summary. The boxplot below depicts the sensory response data of the preceding section without the log transformation.
> boxplot(reacttimes$Times,horizontal=T,xlab=”Reaction Times”)
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 19
0 1 2 3 4
The central box in the diagram encloses the middle 50% of the numeric data. Its left and right bound- aries mark the first and third quartiles. The boldface middle line in the box marks the median of the data. Thus, the interquartile range is the distance between the left and right boundaries of the central box. For construction of a boxplot, an outlier is defined as a data value whose distance from the nearest quartile is more than 1.5 times the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They extend from the quartiles to the most extreme values in either direction that are not outliers.
This boxplot shows a number of interesting things about the response time data.
(a) The median is about 1.5. The interquartile range is slightly more than 1.
(b) The three largest values are outliers. They lie a long way from most of the data. They might call for special investigation or explanation.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 20
(c) The distribution of values is not symmetric about the median. The values in the lower half of the data are more crowded together than those in the upper half. This is shown by comparing the distances from the median to the two quartiles, by the lengths of the whiskers and by the presence of outliers at the upper end .
The asymmetry of the distribution of values is also evident in the histogram of the preceding sec- tion.
1. Find the variance and standard deviation of the response time data. Treat it as a sample from a larger population.
2. Find the interquartile range and the median absolute deviation for the response time data.
3. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with the answers from problems 1 and 2.
4. Make a boxplot of the log-transformed reaction time data. Is the transformed data more sym- metrically distributed than the original data?
5. Show that the function g(c) in section 2.2.3 is minimized when c = µ(x). Hint: Minimize g(c)2.
6. Find the variance, standard deviation, IQR, mean absolute deviation and median absolute de- viation of the variable ”Ozone” in the data set ”airquality”. Use R or Rstudio. You can address the variable Ozone directly if you attach the airquality data frame to the search path as follows:
The R functions you will need are ”sd” for standard deviation, ”var” for variance, ”IQR” for the interquartile range, and ”mad” for the median absolute deviation. There is no built-in function in R for the mean absolute deviation, but it is easy to obtain it.
2.3 Jointly Distributed Variables
When two or more variables are jointly distributed, or jointly observed, it is important to understand how they are related and how closely they are related. We will first consider the case where one variable is numeric and the other is a factor.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 21
2.3.1 Side by Side Boxplots
Boxplots are particularly useful in quickly comparing the values of two or more sets of numeric data with a common scale of measurement and in investigating the relationship between a factor variable and a numeric variable. The figure below compares placement test scores for each of the letter grades in a sample of 179 students who took a particular math course in the same semester under the same instructor. The two jointly observed population variables are the placement test score and the letter grade received. The figure separates test scores according to the letter grade and shows a boxplot for each group of students. One would expect to see a decrease in the median test score as the letter grade decreases and that is confirmed by the picture. However, the decrease in median test scores from a letter grade of B to a grade of F is not very dramatic, especially compared to the size of the IQRs. This suggests that the placement test is not especially good at predicting a student’s final grade in the course. Notice the two outliers. The outlier for the ”W” group is clearly a mistake in recording data because the scale of scores only went to 100.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 22
A B C D F W
Suppose x and y are two jointly distributed numeric variables. Whether we consider the entire population or a sample from the population, we have the same number n of observed values for each variable. If we plot the n points (x1, y1), (x2, y2), . . . , (xn, yn) in a Cartesian plane, we obtain a scatterplot or a scatter diagram of the two variables. Below are the first 6 rows of the ”Payroll” data set. The column labeled ”payroll” is the total monthly payroll in thousands of dollars for each company listed. The column ”employees” is the number of employees in each company and ”industry” indicates which of two related industries the company is in. A scatterplot of all 50 values of the two variables ”payroll” and ”employees” is also shown.
payroll employees industry
1 190.67 85 A
2 233.58 109 A
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 23
3 244.04 130 B
4 351.41 166 A
5 298.60 154 B
6 241.43 124 B
50 100 150
The scatterplot shows that in general the more employees a company has, the higher its monthly payroll. Of course this is expected. It also shows that the relationship between the number of employees and the payroll is quite strong. For any given number of employees, the variation in payrolls for that number is small compared to the overall variation in payrolls for all employment levels. In this plot, the data from industry A is in black and that from industry B is red. The plot shows that for employees ≥ 100, payrolls for industry A are generally greater than those for industry B at the same level of employment.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 24
2.3.3 Covariance and Correlation
If x and y are jointly distributed numeric variables, we define their covariance as
cov(x, y) = 1
(xi − µ(x))(yi − µ(y)).
If x and y come from samples of size n rather than the whole population, replace the denominator n by n − 1 and the population means µ(x), µ(y) by the sample means x̄, ȳ to obtain the sample covariance. The sign of the covariance reveals something about the relationship between x and y. If the covariance is negative, values of x greater than µ(x) tend to be accompanied by values of y less than µ(y). Values of x less than µ(x) tend to go with values of y greater than µ(y), so x and y tend to deviate from their means in opposite directions. If cov(x, y) > 0, they tend to deviate in the same direction. The strength of these tendencies is not expressed by the covariance because its magnitude depends on the variability of each of the variables about its mean. To correct this, we divide each deviation in the sum by the standard deviation of the variable. The resulting quantity is called the correlation between x and y:
cor(x, y) = cov(x, y)
sd(x) ∗ sd(y) .
The correlation between payroll and employees in the example above is 0.9782 (97.82 %).
Theorem 2.1. The correlation between x and y satisfies −1 ≤ cor(x, y) ≤ 1. cor(x, y) = 1 if and only if there are constants a and b > 0 such that y = a+ bx. cor(x, y) = −1 if and only if y = a+ bx with b < 0.
A correlation close to 1 indicates a strong positive relationship (tending to vary in the same direction from their means) between x and y while a correlation close to −1 indicates a strong negative rela- tionship. A correlation close to 0 indicates that there is no linear relationship between x and y. In this case, x and y are said to be (nearly) uncorrelated. There might be a relationship between x and y but it would be nonlinear. The picture below shows a scatterplot of two variables that are clearly related but very nearly uncorrelated.
Go to TOC