# Computer Science

CSC334/424 Assignment #2 Dr. John McDonald

Note: For each of the analysis problems, include a copy of the full analysis in your report along with your conclusions.

1) (Due Sunday September 27) Post to the final project forum with one of the following.

An introduction with what kind of data you are interested in looking at. If you already have a dataset, give a short description of the dataset, along with a description of its scope (# metric variables, #categorical variables, #samples, multiple related tables?)

A response to one or more other posts expressing interest in their project idea

A post with a fully formed group (I know some of you have already formed a group). In this case, you should also give a description of your dataset along the lines of a)

In addition, as you are forming your groups, remember the following requirements for datasets and groups

a. Your group should have 5 people in it. I will consider making exceptions if a group cannot find a 5th person, but you will have to contact me about that, because I may have to give you a 5th depending on how the group divisions come out.

b. Your group should have at least one in-class student and one on-line student. This helps me check in with each group if I have at least one in-class student in each, and also helps get remote students involved with those of you in Chicago. I will consider making exceptions here, but your group will need to contact me to discuss it.

c. Your dataset should be a real and rich dataset with at least 10 to 20 variables mixed between categorical and metric. It should have at least (10 * #var) samples (we will see that some techniques like PCA require this for significance/stability). So the more variables your dataset has the larger the sample size should be. See me if you have any doubts about this.

2) (Due Wednesday September 30) Do one of the following

Finalize your choice of a group that is forming online. Your group should post its final composition to the “Group Finalization” forum. Create a thread and have each member of your group post to that thread. (This helps me with tracking who has not found a group). When each member posts here, they should include whether they are an online or in-class student.

Post your name, a list of three areas of data interest, and whether you are in- class or online, to the alternate group formation site (I will form groups out of the remaining people, either creating new groups, or filling empty slots in existing groups).

We will finalize group formation by next Friday.

3) (Due Wednesday September 30) Answer each of the following by hand for the following matrices/vectors, and then verify your answers with R code:

.4 .88 .28 1 1 1

, .88 1.1 .98 , 2 1 2.5

.28 .98 2.26 3

M N v

Compute the eigenvalues and eigenvectors of M

Verify that v is an eigenvector of N. What is the corresponding eigenvalue (note, you do not need to solve for it. Hint: what does it mean that v is an eigenvector?)

(extra credit) Find another eigenvector for N. Note: for this, you will need to look up the formula for a 3×3 determinant, and the resulting eigenvalue equation will be a 3rd degree polynomial. The helpful thing is that you already know what one solution is, so you can factor that out.

4) (Reflection) Post a comment on the lectures 3 & 4 forum regarding some topic covered during these lectures.

5) (Regression analysis) The Housing dataset (under the course documents for week 3) contains housing values in the suburbs of Boston. The detailed explanation concerning the input and output variables can be fetched from the UCI machine learning repository http://archive.ics.uci.edu/ml/datasets/Housing: (Note that in R, you can load in this file with simply “read.table(“housing.dat”)”. If you try to specify a separator, R will get confused by the multiple spaces between fields.

1. CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centers 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk – 0.63)^2 where Bk is the proportion of African Americans by town 13. LSTAT: % lower status of the population 14. MEDV: Median value of owner-occupied homes in $1000’s (output variable)

Fit a linear regression model and report goodness of fit, the utility of the model, the estimated coefficients, their standard errors, and statistical significance. Interpret your results.

Perform a feature selection on this data by using the forward selection method of the regression analysis. Analyze the output in terms of the order in which the variables are included in the regression model.

Compare the model selected by forward selection to backward selection.

6) Problem 3 (Principal Component Analysis – 20 points): The data given in the file ‘problem3.txt’1 (under course documents for week 3) is the percentage employed in different industries in Europe countries during 1979. Techniques such as Principal Component Analysis (PCA) can be used to examine which countries have similar employment patterns. There are 26 countries in the file and 10 variables as follows:

Variable Names:

1. Country: Name of country 2. Agr: Percentage employed in agriculture 3. Min: Percentage employed in mining 4. Man: Percentage employed in manufacturing 5. PS: Percentage employed in power supply industries 6. Con: Percentage employed in construction 7. SI: Percentage employed in service industries 8. Fin: Percentage employed in finance 9. SPS: Percentage employed in social and personal services 10. TC: Percentage employed in transport and communications.

Perform a principal component analysis using the covariance matrix:

a. How many principal components are required to explain 90% of the total variation for this data?

b. For the number of components in part a, give the formula for each component and a

brief interpretation, without rotation of the components. How easy are they to separate in-terms of meaning? Then try rotating the data (your function for computing PCA may be doing this already, if so, make sure that you know the difference and can get both out of your software). Give the formula for each component and a brief interpretation. Has rotating improved the ability to interpret the components?

c. What countries have the highest and lowest values for each principal component (only include the number of components specified in part a). For each of those countries, give the principal component scores (again only for the number of components specified in part a).

d. Analyze the significance of the entries in the correlation matrix for fields that are highly correlated or completely uncorrelated with the other fields (use a 90% confidence level, and consider a field highly correlated if it is correlated with over 75% of the other fields).

1 http://lib.stat.cmu.edu/DASL/Datafiles/EuropeanJobs.html

If there are fields, try removing them from the analysis. Does this help your interpretation of the analysis in b)?

7) (Principal Component Analysis) Begin with the “census2.csv” datafile, which contains

census data on various tracts in a district. The fields in the data are

Total Population (thousands)

Professional degree (percent)

Employed age over 16 (percent)

Government employed (percent)

Median home value (dollars)

a) Conduct a principal component analysis using the covariance matrix (the default for prcomp and many routines in other software), and interpret the results. How much of the variance is accounted for in the first component and why is this?

b) Try dividing the MedianHomeValue field by 100,000 so that the median home value in the dataset is measured in $100,000’s rather than in dollars. How does this change the analysis?

c) Compute the PCA with the correlation matrix instead. How does this change the result and how does your answer compare (if you did it) with your answer in b)?

d) Analyze the correlation matrix for this dataset for significance, and also look for variables that are extremely correlated or uncorrelated. Discuss the effect of this on the analysis.

e) (Extra Credit for Undergraduate Students) Discuss what using the correlation matrix does and why it may or may not be appropriate in this case.

8) (Principal Component Analysis, Extra Credit for Undergraduate Students) Download the “trackRecord.txt” dataset and perform a principal component analysis on the data. The data give track records for various countries in a series of events (100m, 200m, 400m, 800m, 1500m, 5000m, 10000m, Marathon). Note that the first three are measured in seconds and the last 4 in minutes. Choose your PCA method carefully and give a reason for your choice. Your method should account for the differences in scales of the fields. Try different ways of formulating the analysis until you get a small set of components that are easy to interpret. Finally, run a common factor analysis on the same data. What difference, if any, do you find? Does the factor analysis change your ability to interpret the results practically?

9) Problem #2 (Canonical Correlation Analysis – 20 points): Water, soil, and mosquito fish samples were collected at n = 165 sites/stations in the marshes of southern Florida. The following water variables were measured:

MEHGSWB Methyl Mercury in surface water, ng/L

TURB in situ surface water turbidity

DOCSWD Dissolved Organic Carbon in surface water, mg/L

SRPRSWFB Soluble Reactive Phosphorus in surface water,mg/L or ug/L

THGFSFC Total Mercury in mosquitofish (Gambusia affinis), average of 7 individuals, ug/kg

In addition, the following soil variables were measured:

THGSDFC Total Mercury in soil, ng/g

TCSDFB Total Carbon in soil, %

TPRSDFB Total Phosphorus in soil, ug/g

Perform a canonical correlation analysis, describing the relationships between the soil and water variables using the data2 found in data_marsh_cleaned_homework#2 (both xls and spss files under the course documents for week 3).

1. Answer the following questions regarding the canonical correlations.

a. Test the null hypothesis that the canonical correlations are all equal to zero. Give your test statistic, d.f., and p-value.

b. Test the null hypothesis that the second and third canonical correlations equal zero. Give your test statistic, d.f., and p-value.

c. Test the null hypothesis that the third canonical correlation equals zero. Give your test statistic, d.f., and p-value.

d. Present the three canonical correlations

e. What can you conclude from the above analyses?

2. Answer the following questions regarding the canonical variates.

a. Give the formulae for the significant canonical variates for the soil and water variables.

b. Give the correlations between the significant canonical variates for soils and the soil variables, and the correlations between the significant canonical variates for water and the water variables.

c. What can you conclude from the above analyses?

For the top conical correlation, the Water variables that contribute most to the Water Cononical Variate are Docswd and Thgfsfc. The Soil variables that contribute most to the Soil Cononical variate are Tcsdfb and Tprsdfb. However, the correlation between the variates is only .149. This is leading me to believe that the two variates, although statistically significant, may not be practically significant.

2 http://www.epa.gov/region4/sesd/reports/epa904r07001.html