**Subject Code & Title:** STAT6020 Predictive Analytics

This assignment is marked out of 27 and is worth 10% of your course grade for STAT6020.

The data set(s) are available from the Assessment section of the STAT6020 Blackboard site. If you need to clarify the wording of any of the questions or if you have technical issues you may post in the appropriate Discussion Forum but your final submission must consist of your own work in accordance with the Academic Integrity Policy. It is assumed that you have read the relevant modules and worked through the examples and exercises. Read all instructions and questions carefully.**STAT6020 Predictive Analytics Assignment 1 – New Castle University Australia.**

You are expected to use R Studio to prepare an R Markdown document from which you will create a single pdf document which you will submit via the link in the Assessment section of the STAT6020 Blackboard site.

Make sure you clearly indicate your answers to each question/item. Make sure your final document includes all the R code and output required to answer each question. (This doesn’t necessarily mean that you will include all of the R code and all of the output that you may have explored during the process of creating your answer. Use your judgement.)

**Your code will be assessed on**

- Correctness.
- Suitable/sensible choices (explained or justified as necessary).
- Organisation, clarity and readability (eg appropriate comments in the code, meaningful object names, judicious use of indenting and other white space).

Your written answers will be assessed on whether the interpretations are correct and answer the questions clearly with appropriate justification from your analysis. NB A yes or no, a numerical answer, a plot, or some computer output on their own is rarely sufficient. A few sentences interpreting your results is usually expected. (Think of it as like writing a report for your boss in your new data science job.)

**Grain Yield:**

The data set Yield. dat contains the yield of grain together with soil quality measurements at each of 215 sites in a portion of a field. Figure 1 shows the location of the measurement sites. The measurement sites are identified in the data set by a variable, x, indicating the measurement location along a particular east-west transect 1. Each measurement location is 1 a transect is a straight line across the earth’s surface along which measurements are taken approximately 12.2 m apart. The 8 transects, identified by a variable, y, are approximately 48.8 m apart. At harvest time, the harvesting machine was driven along each transect stopping each 12.2m to measure the yield of grain for that part of the field in bushels/acre. Measurements of 10 soil nutrients in parts per million (ppm) are made at each location: Boron (B), Calcium (Ca), Copper (Cu), Iron (Fe), Potassium (K), Magnesium (Mg), Manganese (Mn), Sodium(Na), Phosphorus (P), and Zinc (Zn). As can be seen in Figure 1, there were a number of points in the site location grid where data was not available.

We are interested in determining which soil nutrients are most important in determining grain yield.

**STAT6020 Predictive Analytics Assignment 1 – New Castle University Australia.**

**Preliminary exploration of the data:**

1.Use scatter plots to visually explore the relationship between yield and each of the 10 soil nutrients individually. Comment on relationships and likely important variables.

**Multiple Linear Regression**

2.Use the l m() function to fit a multiple linear regression model for yield using all 10 of the soil nutrient variables as predictors. Which of the possible predictor variables appear to have a statistically significant relationship with the response? Use a 10% significance level.

3.A colleague has suggested refitting the model after filtering out observation 200 from the data. Does this improve the model? Justify your answer. Is the removal of this observation reasonable? Why or why not? Does it change your answer to the previous question?

Irrespective of your answer here, the remaining analyses should be carried out with observation 200 removed.

yield∼B+I(1/Ca)+log(Cu)+log(Fe)+I(1/K)+I(1/Mg)+log(Mn)+Na+log(P)+log(Zn+)

in the l m() function to fit a second multiple linear regression model which uses transformations of some of the soil nutrient predictor variables. Does transforming the predictors provide an improved model? Justify your answer. Which of the possible predictor variables in this second model appear to have a statistically significant relationship with the response? Use a 10% significance level. How does this compare with your previous answer. (Note that these transformations were suggested by inspection of the individual bivariate relationships between each of the predictors and the yield. Each suggested transformation improved the linearity of the bivariate relationship.)

**Subset Selection:**

5.Using a forward step wise selection approach, choose an appropriate subset from the set of predictor variables used in the previous question (i.e. with some of the original variables transformed). Explain / justify your choice. Provide the summary of the linear model using this subset of predictors. Interpret all of the fitted model parameters.

**Hint: **There are several approaches that could be used in this question. It is possible to perform the forward step wise procedure manually but there are also many functions available that automate parts of the procedure. There are also several different criteria that can be used to select the variable to be added in each step and several different criteria that can indicate when it is appropriate to stop adding variables. You may use whichever tools and criteria you like as long as you explain clearly what you are doing and why. However, I suggest you use the reg sub sets() function in the leaps package and use adjusted R 2 to determine when to stop adding variables. This is reasonably straightforward and is consistent with the approach used in the ISLR text. NB If gy.fss is the object created from the output of your regsubsets() function then plot(gy.fss, scale=”adjr2″) and/or plot(summary(gy.fss)$adjr2) may be useful.

**Model Regularisation:**

In this section you should again be using the set of predictor variables with some of the original variables transformed.

6.Use glmnet() to fit LASSO models for a collection of λ values (as automatically selected by glmnet()). Plot the resulting object to display the model coefficients as a function of their L 1 norm. Interpret this plot.

Hint: This means that you should be explaining what is plotted (axes, lines, points, labels) and what this plot therefore tells you about this example.)

7.Use cv.glmnet() to explore how the estimated Mean Squared Error changes as a function of the shrinkage parameter, λ. Plot the resulting object and interpret this plot. Select a λ value that corresponds to a model exhibiting an appropriate trade-off between performance and regularisation. What are the model coefficients for the selected model?

**STAT6020 Predictive Analytics Assignment 1 – New Castle University Australia.**

8.Fit a multiple regression model using only those variables suggested by the LASSO model you described in the previous item. Are these the same variables that were suggested by the forward step wise procedure for a model with the same number of variables? Briefly explain how and why the model coefficients for this multiple regression model are different to the LASSO model coefficients?