# Computer Science

Summary statistics give us some sense of the data:

Mean vs. Median.

Standard deviation.

Quartiles, Min/Max.

Correlations between variables.

summary(data)

x y

Min. :-3.05439 Min. :-3.50179

1st Qu.:-0.61055 1st Qu.:-0.75968

Median : 0.04666 Median : 0.07340

Mean :-0.01105 Mean : 0.09383

3rd Qu.: 0.56067 3rd Qu.: 0.88114

Max. : 2.60614 Max. : 4.28693

Visualization gives us a more holistic sense

3Module 3: Basic Data Analytic Methods Using R

In the previous lesson, we saw how to examine data in R, including how to generate the descriptive statistics: averages, data ranges, and quartiles (which are included in the summary() report).

We also saw how to compute correlations between pairs of variables of interest. These statistics do give us a sense of a data: an idea of its magnitude and range, and some obvious dirty data (missing values, values with obviously wrong magnitude or sign).

Visualization, however, gives us a succinct, more holistic view of the data that we may not be able to get from the numbers and summaries alone. It is an important facet of the initial data exploration. Visualization helps you assess data cleanliness, and also gives you an idea of potentially important relationships in the data before going on to build your models.

Module 3: Basic Data Analytic Methods Using R 3

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Anscombe’s Quartet

Module 3: Basic Data Analytic Methods Using R

Property Values

Mean of x in each case 9

Exact variance of x in each case

11

Exact mean of y in each case

7.5 (to 2 d.p)

Variance of Y in each case 4.13 (to 2 d.p)

Correlations between x and y in each case

0.816

Linear regression line in each case

Y = 3.00 + 0.500x (to 2 d.p and 3 d.p resp.)

4 data sets, characterized by the following. Are they the same, or are they different?

i

X y

10.00 8.04

8.00 6.95

13.00 7.58

9.00 8.81

11.00 8.33

14.00 9.96

6.00 7.24

4.00 4.26

12.00 10.84

7.00 4.82

5.00 5.68

ii

x y

10.00 9.14

8.00 8.14

13.00 8.74

9.00 8.77

11.00 9.26

14.00 8.10

6.00 6.13

4.00 3.10

12.00 9.13

7.00 7.26

5.00 4.74

iii

x y

10.00 7.46

8.00 6.77

13.00 12.74

9.00 7.11

11.00 7.81

14.00 8.84

6.00 6.08

4.00 5.39

12.00 8.15

7.00 6.42

5.00 5.73

iv

x y

8.00 6.58

8.00 5.76

8.00 7.71

8.00 8.84

8.00 8.47

8.00 7.04

8.00 5.25

19.00 12.50

8.00 5.56

8.00 7.91

8.00 6.89

4

Anscombe’s Quartet is a synthesized example by the statistician F. J. Anscombe. Look at the properties and values of these four data sets. Based on standard statistical measures of mean, variance, and correlation (our descriptive statistics), these data sets are identical. Or are they?

Module 3: Basic Data Analytic Methods Using R 4

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Moral: Visualize Before Analyzing!

5Module 3: Basic Data Analytic Methods Using R

However, if we visualize each data set using a scatterplot and a regression line superimposed over each plot, the datasets appear quite different. Dataset 1 is the best candidate for a regression line, although there is a lot of variation. Dataset 2 is definitely non-linear. Dataset 3 is a close match, but over predicts at higher value of x and has an extreme outlier. And Dataset 4 isn’t captured at all by a simple regression line.

Assuming we have datasets represented by data frames s1, s2, s3, and s4, we can generate these plots in R by using the following code:

R-Code

plot(s1) plot(lm(s1$y ~ s1$x))

…

(Yes, a loop is possible but requires more advanced data manipulation: for information, consult the R “eval” function if interested). We also must take care to overwrite the preceding graph in each instance.

Code to produce these graphs is included in the script AnscombePlot.R. Note that the dataset for these plots are included in the standard R distribution. Type data() for a list of dataset included in the base distribution. data(name) will make that dataset available in your workspace.

Module 3: Basic Data Analytic Methods Using R 5

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Visualizing Your Data

• Examining the distribution of a single variable

• Analyzing the relationship between two variables

• Establishing multiple pair wise relationships between variables

• Analyzing a single variable over time

• Data exploration versus data presentation

Module 3: Basic Data Analytic Methods Using R 6

In a previous lesson, we’ve looked at how you can characterize your data by using traditional statistics. But we also showed how datasets could appear identical when using descriptive statistics, and yet look completely different when visualizing the data via a plot.

Using visual representations of data is the hallmark of exploratory data analysis: letting the data speak to us rather than necessarily imposing an interpretation on the data a priori. In the rest of this lesson, we are going to examine ways of displaying data so that we can better understand the underlying distributions of a single variable or the relationships between two or more variables.

Although data visualization is a powerful tool, the results we obtain may not be suitable when it comes time for us to “tell a story” about the data. Our last slide will discuss what kind of presentations are most effective.

Module 3: Basic Data Analytic Methods Using R 6

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Examining the Distribution of a Single Variable

Graphing a single variable

• plot(sort(.)) – for low volume data

• hist(.) – a histogram

• plot(density(.)) – densityplot A “continuous histogram“

• Example Frequency table of household

income

Module 3: Basic Data Analytic Methods Using R 7

R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot() and dotplot().

The example included is a frequency table of household income. We can certainly see a concentration of households in the leftmost portion of the graph.

Module 3: Basic Data Analytic Methods Using R 7

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Examining the Distribution of a Single Variable

Graphing a single variable

• plot(sort(.)) – for low volume data

• hist(.) – a histogram

• plot(density(.)) – densityplot A “continuous histogram“

• Example Frequency table of household

income

rug() plot emphasizes distribution

Module 3: Basic Data Analytic Methods Using R 8

R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot(), dotplot() and stem().

The example included is a frequency table of log10 of household income. We can certainly see a concentration of households in the rightmost portion of the graph. The rug() function creates a 1-dimensional density plot as well: notice how it emphasizes the area under the curve.

Module 3: Basic Data Analytic Methods Using R 8

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

A sense of the data range

• If it’s very wide, or very skewed, try computing the log

Outliers, anomalies

• Possibly evidence of dirty data

Shape of the Distribution

• Unimodal? Bimodal?

• Skewed to left or right?

• Approximately normal? Approximately lognormal?

Example – Distribution of purchase size ($)

• Range from 0 to > $10K, right skewed

• Typical of monetary data

• Plotting log of data gives better sense of distribution

• Two purchasing distributions ~ $55

~ $2900

What are we looking for?

Module 3: Basic Data Analytic Methods Using R 9

When viewing the variables during the data exploration phase, you are looking for a sense of the data range, and whether the values are strongly concentrated in a certain range. If the data is very skewed, viewing the log of the data (if it’s all positive) can help you detect structure that you might otherwise miss in a regularly scaled graph.

This is your chance to look for obvious signs of dirty data (outliers or unlikely looking values). See if the data is unimodel or multimodal: that gives you an idea of how many distinct populations (with distinct behavior patterns) might be mixed into your overall population. Knowing if the data is approximately normal (or can be transformed to approximately normal – for example, by taking the log) is important, since many modeling techniquest assume that the data is approximately normal in distribution.

For our example, we can look at the densityplot of purchase sizes (in $ US) of customers at our online retail site. The range here is extremely wide – from around $1 US to over $10,000 US. Extreme ranges like this are typical of monetary data, like income, customer value, tax liabilities, bank account sizes, etc. (In fact, all of this kind of data is often assumed to be distributed lognormally – that is, its log is a normal distribution).

The data range makes it really hard for us to see much detail, so we take the log of it, and then density plot it. Now we can see that there are (at least) two distinct population in our customer base: One population that makes small to medium size purchases (median purchase size about $55 US) and one that makes larger purchases (median purchase size about $2900 US). Can you see those two populations in the top graph?

The plots shown were made using the lattice package. If the data is in the vector purchase_size, then the lattice plot is: library(lattice)

densityplot(purchase_size) # top plot

# bottom plot as log10 is actually

# easier to read, but this plot is in natural log

densityplot(log(purchase_size)

(The commands were actually more complicated than that, but these commands give the basic equivalent)

Module 3: Basic Data Analytic Methods Using R 9

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Evidence of Dirty Data

Module 3: Basic Data Analytic Methods Using R 10

Missing

values?

Mis-entered

data?

Inherited

accounts?

Here’s an example of how dirty data might manifest itself in your visualizations. We are looking at the age distribution of account holders at our bank. Mean age is about 40, approximately normally distributed with a standard deviation of about 15 years or so, which makes sense.

We see a few accounts with accountholder age < 10; unusual, but plausible. These could be custodial accounts, or college savings accounts set up by the parents of young children. We probably want to keep them for our analysis.

There is a huge spike of customers who are zero years old – evidence of missing data. We may need to eliminate these accounts from analysis (depending on how important we think age will be), or track down how to get the appropriate age data.

The customers with negative age are probably either missing data, or mis-entered data. The customers who are older than 100 are possibly also mis-entered data, or these are accounts that have been passed down to the heirs of the original accountholders (and not updated).We may want to exclude them as well, or at least threshold the age that we will consider in the analysis.

If this data is in a vector called age, then the plot is made by:

hist(age, breaks=100, main=”Accountholder age distribution”,

xlab=”age”, col=”gray”)

Module 3: Basic Data Analytic Methods Using R 10

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

“Saturated” Data

Module 3: Basic Data Analytic Methods Using R 11

Do we really have no mortgages older than 10 years?

Or does the year 2004 in the origination field mean “2004 or prior”?

Here’s another example of dirty (or at least, “incompletely documented” data). We are looking at the age of mortgages in our bank’s home loan portfolio. The age is calculated by subtracting the origination date of the loan from “today” (2013).

The first thing we notice is that we don’t seem to have loans older than 10 years old – and we also notice that we have a disproportionate number of ten year old loans, relative to the age distribution of the other loans.

One possible reason for this is that the date field for loan origination may have been “overloaded” so that “2004” is actually a beacon value that means “2004 or prior” rather than literally 2004. (This sometimes happens when data is ported from one system to another, or because someone, somewhere, decided that origination dates prior to 2004 are not relevant).

What would we do about this? If we are analyzing probability of default, it is probably safe to eliminate the data (or keep the assumption that the loans are 10 years old), since 10 year old mortgages default quite rarely (most defaults occur before about the 4th year). For different analyses, we may need to search for a source of valid origination dates (if that is possible).

If the data is in the vector mortgage, the plot is made by:

hist(mortgage, breaks=10, main=”Portfolio Distribution, Years

since origination”, xlab=”Mortgage Age”, col=”grey”)

Module 3: Basic Data Analytic Methods Using R 11

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Analyzing the Relationship Between Two Variables

Module 3: Basic Data Analytic Methods Using R 12

How? • Two Continuous Variables (or two discrete variables)

Scatterplots

LOESS (fit smoothed line to the data)

Linear models: graph the correlation

Binplots, hexbin plots More legible color-based plots for high

volume data

• Continuous vs. Discrete Variable Jitter, Box and whisker plots, Dotplot or

barchart

Example: • Household income by region (ZIP1) • Scatterplot with jitter, with box-and-whisker overlaid • New England (0) and West Coast (9) have highest

mean household income

Scatterplots are a good first visualization for the relationship between two variables, especially two continuous variables. Since you are looking for the relationship between the two variables, it can often be helpful to fit a smoothing curve through the data, for example loess or a linear regression. We’ll see an example of that a little later on.

For very high volume data, scatterplots are problematic; with too much data on the page, the details can get lost. Sometime the jitter() function can create enough (uniform) variation to see the associations more clearly. Hexbin plots are a good alternative: you can think of hexbin plots as two dimensional histograms that use color or grayscale to encode bin heights.

There are other alternatives for plotting continuous vs. discrete variables. Dotplots and barcharts plot the continuous value as a function of the discrete value when the relationship is one-to-one. Box and whisker plots show the distribution of the continuous variable for each value of the discrete variable.

The example here is of logged household incomes as a function of region (first digit of the zip). (Logged in this case means data that uses the logarithm of the value instead of the value itself.) In this example, we have also plotted the scatterplot beneath the box-and-whisker, with some jittering so each line of points widens into a strip. The “box” of the box and whisker shows the range that contains the central 50% of the data; the line inside the box is the location of the median. The “whiskers” give you an idea of the entire range of the data. Usually, box and whiskers also show “outliers” that lie beyond the whiskers, but they are turned off in this graph. This graphs shows how household income varies by region. The highest median incomes are in New England (region 0) and on the West Coast (region 9). New England is slightly higher, but the boxes for the two regions overlap enough that the difference between the two regions probably is not significant. The lowest household incomes tend to be in region 7 (TX, OK, Ark, LA).

Module 3: Basic Data Analytic Methods Using R 12