Assignment Detail:- BUS5WB Data Warehousing and Big Data Assignment
The third assignment focuses on Big Data analytics on unstructured text data using Microsoft Azure- You are required to derive insights by applying big data distributed processing and machine learning techniques-
Dataset 1 - Amazon Reviews
The dataset contains ~10000 reviews of Amazon products- The fields are;
What you are required to do
1 HD Insight to Analyse ReviewsDevelop an aggregate of these reviews using your knowledge of Hadoop and MapReduce in Microsoft HDInsight-
a- Follow the same approach as the Big Data Analytics Workshop -using the wordcount method in HDInsight- to determine the contributory words for each level of rating-b- Present the workflow of using HDInsight -you may use screen captures- along with a summary of findings and any insights for each level of rating- MapReduce documentation for HDInsight is available here-
You may either create your own Hadoop Cluster or make use of the one provided to run your analysis- The details of the cluster will be provided on the LMS under the section for Assignment 3-
2 Azure Databricks for Big Data ProcessingUse the period of data allocated -it will be a single year- to you on the New York City Taxi & Limousine Commission dataset on Azure Databrick to answer the questions below;
a- Plot a visual to show by month for the total fare amount generated by taxi trips with 4 or less passengers have been paid for by credit card- -You will have 12 records-
b- Plot a visual to show the average cost per mile of a taxi ride in each month of the year assigned to you that travelled more than 5 miles, but less than 20 miles grouped by whether the trip was to the airport- -You will have 24 records-c- Plot a visual to show the day of the week the average number of taxi trips with a single passenger???? -You will have 7 records-d- What are top 10 most profitable routes -in terms of source and destination- for a taxi???? -You will have 10 records-
For each of the questions above provide;• A screenshot of the visual• A table of the values• The code that you used to generate itYou will make use of the Azure Databrick cluster which is allocated to you- The details of the cluster will be provided on the LMS under the section Assignment 3- The year allocated to you for analysis will also be shared with you on the LMS-
3 Azure Machine Learning for Prediction
Based on the year assigned to you in the New York City Taxi Dataset -as given in question 2 above- use Azure ML Studio to build a model that predicts the total ride duration of taxi trips in New York City-
Provide the following:a- A screen capture of the completed model diagram and any decision you made in training the model- For example, rationale for some of the components used, how many records have been used for training and how many for testing-b- A set of metrics which presents how effective your model is-c- Which features were most influential in driving your model????d- Using your model predict the total trip duration for trips given below-
You will make use of the Azure Machine Learning Studio that has been allocated to you- Information regarding accessing the application can be found in the LMS under the section Assignment 3-
The datasets which are required for training and testing are available in Azure Machine Learning Studio further information has been provided in the LMS under section Assignment 3-
Most Recent Questions