**Subject Code&Title:** SIT743 Bayesian Learning and Graphical Models**Weighting :** 25%**INSTRUCTIONS:** For this assignment, you need to submit the following THREE files.

1.A written document (A single pdf only) covering all of the items described in the questions. All answers to the questions must be written in this document, i.e, not in the other files (code files) that you will be submitting. All the relevant results (outputs, figures) obtained by executing your R code must be included in this document. For questions that involve mathematical formulas, you may write the answers manually (hand written answers), scan it to pdf and combine with your answer document. Submit a combined single pdf of your answer document.

2.A separate “.R” file or ‘.txt’ file containing your code (R-code script) that you implemented to produce the results. Name the file as “name-Student ID-Ass1- Code.R” (where `name’ is replaced with your name – you can use your surname or first name, and Student ID with your student ID).

3.A data file named “name-Student ID-HI My Data.txt” (where `name’ is replaced with your name – you can use your surname or first name, and Student ID with your student ID).**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

• All the documents and files should be submitted (uploaded) via SIT 743 Cloud deakin Assignment Dropbox by the due date and time.

• Zip files are NOT accepted. All three files should be uploaded separately to the Cloud Deakin.

• E-mail or manual submissions are NOT allowed. Photos of the document are NOT allowed.

• The questions Q2 and Q3 (except Q3.2c) do not require any R programming.

Some of the questions in this assignment require you to use the “Heron Island” data set. This data set is given as a CSV file, named “AIMS Her on Island Data. csv”. You can download this from the Assignment folder in Cloud Deakin. Below is the description of this data set.

**Heron Island data set:**

This data set gives hourly weather measurements collected at Heron Island, which is an island in the Great Barrier Reef (North Queensland, Australia), during the period between January 2019 and January 2020

The data set includes the following variables, in the same order of columns as appear in the file

Water temperature @1.6 m depth: Water temperature in degree Celsius at a depth of 1.6 m below (sea) surface.**Wind Speed:** Wind speed in kilometre per hour**Air Temperature:** Air temperature in degree Celsius.**Air Pressure:** pressure measurements expressed in units of Hectopascals**Humidity:** Humidity in percentage.

Q1)

• Download the data file “AIMS Heron Island Data.csv” and save it to your R working directory.

• Assign the data to a matrix, e.g. using

the.data <- as.matrix(read.csv(“AIMSHeronIslandData.csv”, header = TRUE, sep = “,”))

• Generate a sample of 200 data using the following:

my.data <- the.data [sample(1: 366, 200), c(1:5)]

Save “my.data” to a text file titled “name-Student ID-HI My Data. txt” using the following R code (NOTE: you ‘must’ upload this data text file and the R code along with your submission. If not, ZERO marks will be given for this whole question).

write.table(my.data,”name-Student ID-HI My Data.txt”)

Use the sampled data (“my.data”) to answer the following questions.

1.1) Draw a histogram and a box plot for the ‘Humidity’ variable. Provide a five number summary for the humidity values. Use these to comment about the distribution of the humidity variable

1.2) Which summary statistics would you choose to summarize the center and the spread for the ‘Humidity’ data? Why?

1.3) Draw a scatter plot of ‘Air Temperature’ (as x) and ‘Water temperature @1.6 m depth’ (as y). Name the axes. Fit a linear regression model to the above two variables, and plot the (regression) line on the same scatter plot.

Write down the linear regression equation.Compute the correlation coefficient and the coefficient of Determination. Explain what these results reveal.

1.4) Create three new variables, namely ‘Water T Bucket’, Wind S Bucket’, and‘Air Pre Bucket’ which can take two values (‘High’ and ‘Low’) each, based on the criteria defined as follows for the three observed variables in the data: ‘Water temperature @1.6m depth’, ‘Wind speed’ and ‘Air pressure’ respectively:

**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

• ‘Water T Bucket’ : This takes the value ‘High’ when the ‘Water temperature @1.6 m depth’ is above 25 degree Celsius, otherwise it is “Low”.

• ‘Wind S Bucket’: This takes the value ‘High’ when the ‘Wind speed’ is above 30 km/h, otherwise it is “Low”.

• ‘Air Pre Bucket’: This takes the value ‘High’ when the ‘Air pressure’ is above 1019 hecta pascal, otherwise it is “Low”.

a) Write R program to construct a cross table (cross tabulation) using the above three new variables (Water T Bucket’, Wind S Bucket’, and ‘Air Pre Bucket’). Show the obtained cross table.

b) Use the above obtained cross table to answer the following questions. Consider that a record (row) is selected at random,

i) what is the probability that the Water T Bucket is high given that the Wind S Bucket is low and the Air Pre Bucket’ is low?

ii) what is the probability that the ‘Water T Bucket’ is low given that the ‘Wind S Bucket’ is low?

iii) Are ‘Water T Bucket’ and ‘Wind S Bucket’ independent? Give reason.

iv) Are ‘Water T Bucket’ and ‘Wind S Bucket’ mutually exclusive? Explain.

Q2)

2.1)

John has an urn kept in his room, which contains four red and six white balls. He performs two trials in a sequence as follows:

In the first trial, he picks a ball from the urn at random and marks its colour. If he gets a red ball, he returns two red balls (i.e., with one additional red ball) back into the urn. If he gets a white ball, he returns three white balls (i.e., with two additional white balls) back into the urn.

In the second trial, he randomly selects a ball from that urn again and marks its colour.

What is the probability that John picked two balls with different colours from the two trials?

2.2)

a) State two differences between frequentist way and the Bayesian way of estimating a parameter.

b) Why conjugate priors are useful in Bayesian statistics?

c) Give an example of a Conjugate pair.

Q3) Frequentist and Bayesian estimations

An autonomous electric vehicle provider, BiS La Ltd. manufactures batteries to power their vehicles. BiSLa Ltd. assumes that the lifetime of their battery follows an exponential distribution with an unknown average lifetime of as given below.

Assume that there are batteries produced, and each of their lifetime are independently and identically distributed (iid).

3.1) BiSLa Ltd first decided to use a frequentist approach to arrive at an estimate for .Answer the following questions.

a) Show that the joint distribution of lifetime of batteries can be given by the below equation (show the steps clearly to obtain this).

b) Find a simplified expression for the log-likelihood function

d) Suppose that the lifetimes of five of their batteries are {6, 10, 12, 5, 9}, what is the Maximum likelihood Estimate of parameter given this data ( (MLE))?

3.2) Engineers of Bi S La Ltd have now consulted a university research facility, which specialises on batteries and their chemical compositions for several years. They obtained some prior information about the lifetime of the batteries of similar capacity as the one they manufacture. Researchers, who are experts in this field, mentioned that the average life time of a battery () follows a pattern that can be described using a an Inverse-Gamma distribution, Inverse Gamma (a,b), as given below, with hyperpara meters $ = 1.2 and ( = 2.

a) BiSLa Ltd has decided to use this prior information for their estimation.

Use the inverse-Gamma distribution prior (Inverse Gamma (a,b)) and obtain an expression for the posterior distribution (show all the steps).

Show that the posterior distribution is also an Inverse-Gamma distribution,

Inverse Gamma (a’, b’), with different hyper-parameters $ 5

and (′.Express $ 5 and (′ in terms of /, 1, and .

b) Using the values suggested by the experts for the hyper-parameters of the prior (i.e.,$ and ( values), and the battery lifetimes that has been observed from five batteries, i.e., {6, 10, 12, 5, 9}, find the values of /5 and 1′.

What is the Maximum aposteriori estimate (MAP) of ?

Note that the mean and the mode of an inverse-gamma (a,b) distribution are given by (/($ − 1) and (/($ + 1), respectively

c) Write a R program and plot the obtained likelihood distribution, prior distribution and posterior distribution on the same graph. Use different colors to show the distributions on the plot.

Q4) Bayesian inference for Gaussians (unknown mean and known variance)

A factory producing electric lamp holders performs torque tests to conform the quality of their lamp holders. They are quality tested by measuring the rotational force required to turn, open or close, a bulb on the lamp holder. Tests on a random sample of 8 lamp holders show an average torque required to be 2.5 N-m (Newton meters). Assume that the torque measurements are normally distributed with an unknown mean θ and a known standard deviation of 0.2 N-m. Suppose your prior distribution for θ is normal with mean 3 N-m and standard deviation of 1 N-m.

a) Write an expression for the posterior distribution for in terms of 8. (Do not derive the formulae)

b) For n=100, find the mean and the standard deviation of the posterior distribution. Comment on the posterior variance

c) Assume that the prior distribution of torque is changed, and now the prior is distributed as defined below over the range between 1 and 5:

Write a R program to implement this prior, and compute the posterior distribution considering 8 = 1. Using R program find the posterior mean estimate of . Sketch, on a single coordinate axes, the obtained prior, likelihood and posterior distributions

**Q5) Clustering:**

5.1) K-Means clustering: Use the data file “IT data.txt” provided in Cloud Deakin for this m question. Load the file “IT data 2020.txt” using the following:

zz<-read.table(“ITdata.txt”)

zz<-as.matrix(zz)

Use k=5 and perform k-means clustering on this data. Show the results using a scatter plot (show the different clusters with different colours). Comment on the clusters obtained.

5.2) Spectral Clustering: Use the same data set (zz) and perform a spectral clustering (use the number of clusters/centers as 5). Show the results on a scatter plot (with colour coding).

Compare these clusters with the clusters obtained using the k-means above and comment on the results.

Q6)

For this question you will be using the “AIMS Heron Island Data” data set (as used in

Q1). This data set is given as a CSV file, named “AIMS Heron Island Data.csv”.Consider only one variable, namely “Water temperature @1.6m depth” (WT) to answer the following questions.

6.1) Plot the histogram for WT data. Comment on the shape. How many modes can be observed in the data?

6.2) Fit a single Gaussian model J(K, LC) to the distribution of the data, where K is the mean and L is the standard deviation of the Gaussian distribution.

Find the maximum likelihood estimate (MLE) of the parameters, i.e., the mean K and the standard deviation L.

Plot the obtained (single Gaussian) density distribution along with the histogram on the same graph.

**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

6.3) Fit a mixture of Gaussians model to the distribution of the data using the number of Gaussians equal to the number of modes found in the data (in Q6.2 above).Use R programming to perform this. Provide the mixing coefficients, mean and standard deviation for each of the Gaussians found. Plot these Gaussians on top of the histogram plot. Include a plot of the combined density distribution as well (use different colors for the density plots in the same graph).

6.4) Provide a plot of the log likelihood values obtained over the iterations and comment on them.

6.5) Comment on the distribution models obtained in Q6.2 and Q6.3. Which one is better?

Q7*) Research based questions (Real world application – using COVID-19 data)

This is an HD (Higher Distinction) level question. Those students who target HD grade should answer this question (including answering all the above questions). For others, this question is an option. This question aims to demonstrate your expertise in the subject area and the ability to do your own research in the related area.

a) Download the following article from the link provided below. Read that article and answer the following questions. This article provides an application of Bayesian modelling on COVID-19 datasets.

**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

i) What parameters are estimated in this work?

ii) Describe briefly the datasets used for their analysis?

iii) Describe the prior and likelihood used for the analysis?

iv) Table-1 in the paper shows the infection fatality rate in Iceland, and the

obtained posterior and posterior median values (including 95% credible

interval) for each age group. Perform a similar analysis using the COVID-19 data from Australia, available from the below mentioned link. Use the columns “Covid 19 cases” and “Deaths” in this data for this analysis.

Compute the posterior and the posterior median (including 95% credible interval) for the fatality rate for the two states in Australia (NSW, VIC). Use the same prior as used in Table-1 of the paper. Provide your answer in a table form (including the values for “Covid 19 cases” and “Deaths”). Use R program as appropriate to answer this question.

v) From the above analysis using the COVID-19 Australia data, plot the posterior probability densities obtained for the infection fatality rate for the VIC state and NSW state.

**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

b) In the paper above it is mentioned that “In reality, many Bayesian models do not have an analytical solution and thus require specialized software for MCMC sampling.”

**Do your own research (from books, papers/journals/conference, online, etc), and prepare a brief report covering the following details, written on your own words:**

i) What is MCMC (Markov chain monte carlo) sampling?

ii) How MCMC sampling based method can be used for Bayesian posterior estimation?. Provide an example and briefly explain the process of obtaining the posterior estimation using an MCMC based technique.

iii) Find a journal or conference paper that applies/uses MCMC sampling-based Bayesian analysis method for COVID-19 related work/research. Describe clearly what problem the paper is solving and how the Bayesian method is used for modelling/solving (need to discuss the technical details here). What software is used for evaluations?

**SIT743 Bayesian Learning and Graphical Models Assignment-Deakin University Australia.**

Provide the reference of the papers/materials used.

The report covering the above questions (Q7 b) must NOT exceed (one and a half) 1.5 A 4 pages including references.