# Computer Science

• Is there a relationship between the two variables? Linear? Quadratic?

Exponential?

Try semi-log or log-log plots

Is it a cloud?

Round? Concentrated? Multiple Clusters?

• How? Scatterplots

• Example Red line: linear fit

Blue line: LOESS

Fairly linear relationship, but with wide variance

Two Variables: What are we looking for?

Module 3: Basic Data Analytic Methods Using R 14

We are looking for a relationship between the two variables. If the functional relationship between the variables is somewhat pronounced, the data lies roughly along a curve: a straight line, a parabola, or an exponential curve. If y is related exponentially to x, then the plot of (x, log(y)) will be approximately linear. If the data is more like a cloud, the relationship is weaker.

In the example here, the relationship seems approximately linear; we’ve plotted the regression line in red. There are times when a standard regression line just doesn’t capture the relationship. In this case, the loess() function in R (also lowess()) will fit a non-linear line to the data. Here we’ve drawn the loess curve in blue.

R-Code

Assume a dataset named ds with variables cesd and mcs. The R code to generate the above plot is as follows.

with(ds,

{

plot(mcs ~ cesd)

abline(lm(mcs ~ cesd), lcol=“red”)

lines(lowess(mcs ~ cesd), lcol=“blue”)

} )

Module 3: Basic Data Analytic Methods Using R 14

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Two Variables: High Volume Data – Plotting

Module 3: Basic Data Analytic Methods Using R 15

Scatterplot: Overplotting makes it difficult

to see structure

Hexbinplot: Now we see where the data is

concentrated.

When we have too much data, the structure becomes difficult to see in a scatterplot. Here, we are plotting logged household income against years of education. The “blob” that we get on the scatterplot on the left suggests a somewhat linear relationship (this suggests, but the way, that an extra year of education multiplies your expected income by 10^M, where M is the slope of the regression line). However, we can’t really see the structure of how the data is distributed.

On the right we have plotted the same data using a hexbinplot. Hexbinplots are a bit like 2-d histograms, where shading tells us how populated the bin is. Now we can see that the data is more densely clustered in a streak that runs through the center of the data cloud, roughly along the regression line. The biggest concentration is around 12 years of education, extending about to about 15 years.

Notice also the outlier data at MeanEducation = 0. Missing data perhaps?

<Continued>

Module 3: Basic Data Analytic Methods Using R 15

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

• Why? Examine many two-way

relationships quickly

• How? pairs(ds) can generate a plot of

each pairs of variables

• Example Iris Characteristics

Strong linear relationship between petal length and width

Petal dimensions discriminate species more strongly than sepal dimensions

Establishing Multiple Pairwise Relationships Between Variables

Module 3: Basic Data Analytic Methods Using R 17

There are times when it’s useful to see multiple values of a dataset in context in order to visually represent data relationships so as to magnify differences or to show patterns hidden within the data that summary statistics don’t reveal. In the graphic represented above, the variable sepal length, sepal width, petal length and petal width are compared with three species of irises (the key is not listed in the graphic). Colors are used to represent the different species, allowing us to compare differences across species for a particular combination of variables.

Consider the values encoded in the second square from the top right, where sepal length is compared with petal length. Values for petal length are encoded across the bottom; values for sepal length are encoded on the right hand side of the graphic. We can observe that the green and blue species are well matched, although the blue species has longer petals in the main. The petal length for the red species, however, remain markedly the same, and vary only in the lower half of sepal length values. As an exercise, imagine fitting a regression line to each of these individual graphs. What would you make of the relationship between sepal length and sepal width?

The R code for generating the plot is:

pairs(iris[1:4], main = “Anderson’s Iris Data — 3 species”,

pch = 21, bg = c(“red”, “green3”, “blue”)[unclass(iris$Species)]

)

and uses the iris dataset included with the R standard distribution. Here colors include the species, as well as proving the spirit of APL is alive and well.

Module 3: Basic Data Analytic Methods Using R 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

What?

• Looking for … Data range

Trends

Seasonality

How?

• Use time series plot

Example

•International air travel (1949-1960)

• Upward trend: growth appears superlinear

• Seasonality Peak air travel around Nov. with smaller

peaks near Mar. and June

Analyzing a Single Variable over Time

Module 3: Basic Data Analytic Methods Using R 18

Visualizing a variable over time is the same as visualizing any pair of variables, but in this case we are looking for some specific patterns.

Data range, of course, tells us how much our y variable has increased or decreased over the period of time we are considering. We want to get a feeling for the growth rate, and whether or not we see and changes in that growth rate. We are also looking for seasonality: a regular pattern in the fluctuations over a fixed period of time. We can think of those patterns as marking “seasons“.

In the air travel data example that we show, we can see that air travel peaks regularly around Nov/Dec (the holiday season), with a smaller peak around the middle of the year (summer travel) and an even smaller one near the beginning of the year (spring break?).

We can also see that the number of air passengers increased steadily from 1949 to 1960, and that the growth appears to be faster than linear, at least during peak travel season.

Module 3: Basic Data Analytic Methods Using R 18

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Exploration vs. Presentation

Module 3: Basic Data Analytic Methods Using R 19

Data Exploration:

This tells you what you need to know.

Presentation:

This tells the stakeholders what they need to know.

Finally, we want to touch on the difference between using visualization for data exploration, and for presenting results to stakeholders. The plots and tips that we’ve discussed try to make the details of the data as clear as possible for the data scientist to see structure and relationships. These technical graphs don’t always effectively convey the information that needs to be conveyed to non-technical stakeholders. For them, we want crisp graphics that focus on the message we want to convey.

We will touch more on this topic in Module 6, but for right now we’ll share a small example. The top graph shows the density plot of logged account values for our bank. This graph gives us, as data scientists, information that can be relevant to downstream analysis. The account values are distributed approximately lognormally, in the range from 100 to 10M dollars. The median account value is in the area of $30,000 (10^4.5), with the bulk of the accounts between $1000 US and $1M US dollars.

It would be hard to explain this graph to stakeholders. For one thing, densityplots are fairly technical, and for another, it is awkward to explain why you are logging the data before showing it. You can convey essentially the same information by partitioning the data into “log-like” bins, and presenting the histogram of those bins, as we do in the bottom plot. Here, we can see that the bulk of the accounts are in the 1000-1M range, with the peak concentration in the 10-50K range, extending out to about 500K. This gives the stakeholders a better sense of the customer base than the top graphic would.

[Note – the reason that the lower graph isn’t symmetric like the upper graph is because the bins are only “log-like”. They aren’t truly log10 scaled. Log10 scaled bins would be closer to: 1-3K, 3K-10K, 10K- 30K….. As an exercise, we could try splitting the bins that way, and we would see that the resulting bar chart would be symmetric. The bins we chose, however, might seem more “natural” to the stakeholders.]

Module 3: Basic Data Analytic Methods Using R 19

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

• Do you think the regression line sufficiently captures the relationship between the two variables? What might you do differently?

• In the Iris slide example, how would you characterize the relationship between sepal width and sepal length?

• Did you notice the use of color in the Iris slide? Was it effective? Why or why not?

Module 3: Basic Data Analytic Methods Using R 21

Please take a moment to answer these questions.

Module 3: Basic Data Analytic Methods Using R 21

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R: Analysis

During this lesson the following topics were covered: • Justifying why we visualize data • Using plots and graphs to determine:

• Shape of a single variable • “dirty” data or “saturated” data • Relationship between two or more variables • Relationship between multiple variables • A single variable over time

• Data exploration versus Presentation

Summary

Module 3: Basic Data Analytic Methods Using R 22

This slide captures the key topics from this lesson.

Module 3: Basic Data Analytic Methods Using R 22