1.4 Computing in Statistics
Even moderately large data sets cannot be managed effectively without a computer and computer software. Furthermore, much of applied statistics is exploratory in nature and cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as Microsoft Excel, are designed to manipulate data in tabular form and have functions for performing the common tasks of statistics. In addition, many add-ins are available, some of them free, for enhancing the graphical and statistical capabilities of spreadsheet programs. Some of the exercises and examples in this text make use of Excel with its built-in data analysis package. Because it is so common in the business world, it is important for students to have some experience with Excel or a similar program.
The disadvantages of spreadsheet programs are their dependence on the spreadsheet data format with cell ranges as input for statistical functions, their lack of flexibility, and their relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of
Go to TOC
CHAPTER 1. BACKGROUND 9
the best known commercial packages are Minitab, SAS, SPSS, Splus, Stata, and Systat. The package used in this text is called R. It is an open source implementation of the same language used in Splus and may be downloaded free at
After downloading and installing R we recommend that you download and install another free package called Rstudio. It can be obtained from
Rstudio makes importing data into R much easier and makes it easier to integrate R output with other programs. Detailed instructions on using R and Rstudio for the exercises will be provided.
Data files used in this course are from four sources. Some are local in origin and come from student or course data at the University of Houston. Others are simulated but made to look as realistic as possible. These and others are available at
http://www.math.uh.edu/ charles/data .
Many data sets are included with R in the datasets library and other contributed packages. We will refer to them frequently. The main external sources of data are the data archives maintained by the Journal of Statistics Education.
and the Statistical Science Web:
1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types.
2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not the language preceding the column headings. Copy and paste the data into a plain text file, for example with Notepad in Windows. Import the text file into Excel or another spread sheet program. Create a new folder or directory named “math3339” and save both files there.
3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next, find the name of the directory where you saved the text file and double click on the name of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you like.) Import the file into R with the command
Go to TOC
CHAPTER 1. BACKGROUND 10
and display it with the command
Click on the file menu at the top again and select “Exit”. At the prompt to save your workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved workspace will be restored.
If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the ”Import Dataset” tab. This will open a window on your file system and allow you to select the file you saved in Exercise 2. The dialog box allows you to rename the data and make other minor changes before importing the data as a data frame in R.
4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find the data set ”airquality” and click on it. Read about it. If you are using R alone, type
at the command prompt > in the Console window.
to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric vari- ables?
5. A random experiment consists of throwing a standard 6-sided die and noting the number of spots on the upper face. Describe the sample space of this experiment.
6. An experiment consists of replicating the experiment in exercise 4 four times. Describe the sample space of this experiment. How many possible outcomes does this experiment have?
Go to TOC
Descriptive and Graphical Statistics
A large part of a statistician’s job consists of summarizing and presenting important features of data. Simply looking at a spreadsheet with 1000 rows and 50 columns conveys very little information. Most likely, the user of the data would rather see numerical and graphical summaries of how the values of different variables are distributed and how the variables are related to each other. This chapter concerns some of the most important ways of summarizing data.
2.1 Location Measures
2.1.1 The Mean
Suppose that x is the name of a numeric variable whose values are recorded either for the entire population or for a sample from that population. Let the n recorded values of x be denoted by x1, x2, . . . , xn. These are not necessarily distinct numbers. The mean or average of these values is
x̄ = 1
When the values of x for the entire population are included, it is customary to denote this quantity by µ(x) and call it the population mean. The mean is called a location measure partly because it is taken as a representative or central value of x. More importantly, it behaves in a certain way if we change the scale of measurement for values of x. Imagine that x is temperature recorded in degrees Celsius and we decide to change the unit of measurement to degrees Fahrenheit. If yi denotes the Fahrenheit temperature of the ith individual, then yi = 1.8xi + 32. In effect, we have defined a new variable y by the equation y = 1.8x + 32. The means of the new and old variables have the same relationship as the individual measurements have.
ȳ = 1
yi = 1
(1.8xi + 32) = 1.8x̄+ 32
In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄. Other location measures introduced below behave in the same way.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 12
When there are repeated values of x, there is an equivalent formula for the mean. Let the m distinct values of x be denoted by v1, . . . , vm. Let ni be the number of times vi is repeated and let fi = ni/n. Note that
∑m i=1 ni = n and
∑m i=1 fi = 1. Then the average is given by
The number ni is the frequency of the value vi and fi is its relative frequency.
2.1.2 The Median and Other Quantiles
Let x be a numeric variable with values x1, x2, . . . , xn. Arrange the values in increasing order x(1) ≤ x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such that at least half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x). This conveys the essential idea but unfortunately it may define an interval of numbers rather than a single number. The ambiguity is usually resolved by taking the median to be the midpoint of that interval. Thus, if n is odd, n = 2k+1, where k is a positive integer,
median(x) = x(k+1)
, while if n is even, n = 2k,
median(x) = x(k) + x(k+1)
Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p and the fraction of values of x that are ≥ q(x, p) is at least 1−p. For example, at least 80 percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software packages have rules for resolving the ambiguity, but the details are usually not important.
The median is the 50th percentile, i.e., the 0.5 quantile. The 25th and 75th percentiles are called the first and third quartiles. The 10th, 20th, 30th, etc. percentiles are called the deciles. The median is a location measure as defined in the preceding section.
2.1.3 Trimmed Means
Trimmed means of a variable x are obtained by finding the mean of the values of x excluding a given percentage of the largest and smallest values. For example, the 5% trimmed mean is the mean of the values of x excluding the largest 5% of the values and the smallest 5% of the values. In other words, it is the mean of all the values between the 5th and 95th percentiles of x. A trimmed mean is a location measure.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 13
2.1.4 Grouped Data
Sometimes large data sets are summarized by grouping values. Let x be a numeric variable with values x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that all the values of x are between c0 and cm. For each i, let ni be the number of values of x (including repetitions) that are in the interval (ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci. A frequency table of x is a table showing the class intervals (ci−1, ci] along with frequencies ni with which the data values fall into each interval. Sometimes additional columns are included showing the relative frequencies fi = ni/n, the cumulative relative frequencies Fi =
∑ j≤i fj , and the midpoints of the intervals.
Example 2.1. The data below are 50 measured reaction times in response to a sensory stimulus, arranged in increasing order. A frequency table is shown below the data.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
Interval Midpoint ni fi Fi (0,1] 0.5 11 0.22 0.22 (1,2] 1.5 22 0.44 0.66 (2,3] 2.5 11 0.22 0.88 (3,4] 3.5 4 0.08 0.96 (4,5] 4.5 2 0.04 1.00
If only a frequency table like the one above is given, the mean and median cannot be calculated exactly. However, they can be estimated. If we take the midpoint of an interval as a stand-in for all the values in that interval, then we can use the formula in the preceding section for calculating a mean with repeated values. Thus, in the example above, we would estimate the mean as
0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78
Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi, we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation between the cumulative frequencies at 1 and 2. In other words, we estimate the median as
1 + .50− .22 .66− .22
(2− 1) = 1.636.
A cruder estimate of the median is just the midpoint of the interval that contains the median, in this case 1.5. We leave it as an exercise to calculate the mean and median from the data of Example 1 and to compare them to these estimates.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 14
The figure below is a histogram of the reaction times.
> hist(reacttimes$Times,breaks=0:5,xlab=”Reaction Times”,main=” “)
0 1 2 3 4 5
The histogram is a graphical depiction of the grouped data. The end points ci of the class intervals are shown on the horizontal axis. This is an absolute frequency histogram because the heights of the vertical bars above the class intervals are the absolute frequencies ni. A relative frequency histogram would show the relative frequencies fi. A density histogram has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus,in a density histogram the area of the bar is equal to the relative frequency. If all class intervals have the same length, these types of histograms all have the same shape and convey the same visual information.
Go to TOC