Subject Code & Title: BAC213 Business Analytics Coding
Part 1 – Data Exploration and Manipulation
Tasks 1-4 focus on tweets about COVID-19, with each row representing a tweet and each column representing a piece of information recorded about the tweet. This data has been taken from Kaggle 1 and can be accessed in the “vaccination_all_tweets.csv” file on i Learn in the Assignment 2 folder.
BAC213 Business Analytics Coding Assignment 2 – Australia.
Task 1
• Report the number of rows and the number of columns in the data set.
• Report the structure of the data and calculate descriptive statistics for all numeric columns.Descriptive statistics include the mean, standard deviation, maximum and minimum.
• In ascending order, print the number of unique values for all columns of type object.
• In descending order, print the proportion of True values for all columns of type bool.
Task 2:
• Create a new column tweet_length equal to the number of characters in each tweet. Create a histogram showing the distribution of values in this new column using 25 bins with appropriate customisations. Comment on the plot.
• Create a new column retweet_prop equal to the number of retweets divided by the number of followers. Inspect this column. What problem has occurred?
• For tweets where retweet_prop is less than or equal to 1, create a histogram coloured by whether the user is verified. Add reasonable customisations to the plot and interpret it.
Task 3
• Create a new column called account_year reporting the year the account associated with each tweet was created. Create a stacked bar chart showing the number of verified and unverified accounts created each year. Include appropriate customisations.
• The source column takes a lot of different values, but only a few are common. Identify the 7 most common values of the source column, then modify the column to change all other values to be “Other”.
• Create a pie chart showing the breakdown of the categories in the newly modified source variable. Display the percentage of each category on the pie chart. Interpret the plot.
• Create a column that shows how many minutes have passed each day by the time the tweet was made (e.g., a tweet at 17:52:03 becomes 1072.05). Create a histogram showing the distribution of times that tweets are made at with appropriate customisations.
Task 4
• What is the maximum number of characters associated with a single hashtag? For example, the ‘Pfizer’ hashtag has 6 characters.
• Create four columns corresponding to whether each tweet mentioned Pfizer, Moderna, Astra Zeneca, or Sinovac. Create a line plot with four lines corresponding to the number of times Pfizer, Moderna, AstraZeneca, or Sinovac was mentioned. Your x-axis should be the date. You can choose whether you are considering daily tweet counts or weekly tweet counts.
Part 2 – AFL Modelling, Betting and Simulations
Jim is an avid Australian Football League (AFL) fan and a budding data analyst. He has collected information about AFL games and built a statistical model predicting whether the home team will win each match. He has done all the required statistical work of building the model but wants your help with checking whether the model is useful. For the following tasks, you will be using two datasets available on i Learn in the Assignment 2 folder:
• AFL Predictions.csv
• AFL Odds.csv
The following tasks require that you first load in the “AFL Predictions. csv” file into a pred_data Data Frame and load in the “AFL Odds.csv” file into an odds_data Data Frame. It might help in the following tasks to make sure you treat start_dt in the AFL Predictions file as a date when loading it in. Task 1 requires you prepare the data by combining it with odds data that has been extracted from a website, so that in Task 2 you can evaluate the value of a betting strategy using Jim’s predictive model.
Task 1 – Prepare the Data
1.Using the Pred Prob Win and Home Win columns in the pred_data object, calculate the accuracy of predictions. A win is predicted if Pred Prob Win ≥ 0.5, otherwise a loss is predicted. Round your accuracy to 4 decimal places and print it as a percentage (e.g., 0.81234 becomes 81.23%).
2.Using the Date and Start_Time columns in the odds_data object, create a start_dt column matching the format of the same column in the pred_data object. Print a random sample of 5 rows of the odds_data object with the new column.
3.Report the number of start_dt values in the pred_data object that are also in the start_dt column of the odds_data. Also report the number of unique values of the start_dt column of pred_data.
4.Create a margin column for the odds_data object representing the absolute difference of the home score and away score.
5.Use the start_dt and margin columns in both datasets to add the Home_Odds and Away_Odds columns to the pred_data object (Answer check: your final Data Frame object should have 532 rows).
6.Using the newly added Home_Odds column in the pred_data, create a new Odds_Prob column representing the probability of the home team winning implied by the betting odds. Calculate this column with the following formula:
BAC213 Business Analytics Coding Assignment 2 – Australia.
Task 2 – Simulating a Simple Betting Strategy
In case you were unable to complete Part 2 Task 1, please used the “P2T2_sub stitute.csv” file on i Learn in the Assignment 2 folder. For this task, you are given more flexibility in how you structure your code to perform a larger task. I would suggest trying to break the task into several sub-tasks for which you create functions.
Jim now has his predictions as well as the odds provided by betting sites for a collection of AFL matches. You’ve reported the accuracy of his model, but it’s time to see whether it could be used in a practical scenario. Jim is asking you to take the data you have created and simulate a simple betting strategy to see if it would be profitable. Your simulation will look at a fictional person’s betting strategy over a sequence of games as follows:
A fictional person starts with capital of $1000. The fictional person will consider a sequence of 100 games. For each game, they will make a bet according to Table 1. Each of the 100 games will be randomly selected from the data produced in Part 2 Task 1 or using the substitute data on i Learn.
Table 1: Betting Rules
The result of a successful bet on the home team is determined by the Home_Odds column. For a bet,of $100 and Home_Odds of 1.54, a win returns the original $100 and a bonus $54. The same process is followed for bet on the away team but using the Away_Odds column.
• Simulate the above betting strategy over the 100 random games, 10 times. For each simulation, games should be generated again (i.e., don’t use the same 100 games all 10 times).
• The capital at the start of each simulation and after each game in each simulation should be saved. For example, for a single simulation, there should be 101 values saved (starting capital plus the capital after each of the 100 games).
• Report the average capital at the end of the 100 games.
• Create a line plot with one line for each simulation showing the capital across games.
• Based on the above, what do you notice? If you wanted to improve the strategy, what would you want to investigate?
Task 3 – Average Capital or a Martingale
For this task, you may choose between two options. I would suggest Option 1 if you are comfortable with your answer to Part 2 Task 2. Option 2 does not depend on you having already written code.
Option 1
• In many betting applications, there is a minimum bet size. Modify your code from Part 2 Task 2 to continue placing bets until either 50 games are passed or until capital is less than $2 (reflecting a minimum bet size of $1).
• Simulate this process 100 times and produce a line plot with a single line showing the average capital across the 50 games. Include appropriate customisations and interpret the plot.
• Report the number of simulations that lost money.
Option 2:
A well-known betting strategy is the Martingale, in which you double your bet size every time you lose a bet. Consider a person starting with $1275 in capital who starts with a bet of $5 and doubles their previous bet size every time they lose a bet. If the person wins the bet, they receive twice their bet size. For example, if they bet $10 and win, they receive their original $10 and $10 in winnings.
BAC213 Business Analytics Coding Assignment 2 – Australia.
• Write a program simulating the Martingale betting strategy where the probability of winning each bet is 47.4% (a typical Roulette wheel). The person stops betting after their first win or when they have no capital left. The program should return their profit or loss.
• Simulate this process 20 times and report the average profit or loss. Simulate the process 500 times and report the average profit or loss.