Problem solving task - Using aggregation functions for data analysis
This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning Outcomes (GLO):
ULO1 - assessed through student ability to apply knowledge of multivariate functions, data transformations and data distributions to summarise data sets.
ULO2 - assessed through the student ability to analyse datasets by interpreting summary statistics, model and function parameters.
ULO4 - assessed through student ability to develop software codes to solve computational problems for real world analytics.
This assignment will test your knowledge and understanding of the aggregation functions and their applications for data summarization and prediction. This assignment will also test your ability in R programming, in using specific R commands as well as R packages.
The work is individual. Solutions and answers to the assignment must be explained carefully in a concise manner and presented carefully. Use of books, articles and/or online resources on share price related to SIT718 Real World Analytics is allowed. Students are expected to refer to the suitable literature where appropriate.
Forest Fires Data Set
In order to predict the burned area of forest fires ("UCI Machine Learning Repository: Forest Fires Data Set", 2017), in the northeast region of Portugal ("Montesinho.Com - Nature Tourism In Montesinho Natural Park", 2017), analysis of the meteorological and other data is required (see details at "Forest Fires Dataset", 2017), also consider the information given in http://cwfis.cfs.nrcan.gc.ca/background/summary/fwi . For this assignment you are provided with a modified dataset "Forest718.txt".
X1: x-axis spatial coordinate within the Montesinho park map: 1 to 9 ("Montesinho.Com - Nature Tourism In Montesinho Natural Park", 2017)
X2: y-axis spatial coordinate within the Montesinho park map: 2 to 9 ("Montesinho.Com - Nature Tourism In Montesinho Natural Park", 2017)
X3: month - month of the year: 'jan=1' to 'dec=12' X4: day - day of the week: 'mon=1' to 'sun=7'
X5: FFMC - FFMC index from the FWI system: 18.7 to 96.20 (Happe, 2017) X6: DMC - DMC index from the FWI system: 1.1 to 291.3 (Happe, 2017)
X7: DC - DC index from the FWI system: 7.9 to 860.6 (Happe, 2017) X8: ISI - ISI index from the FWI system: 0.0 to 56.10 (Happe, 2017) X9: temp - temperature in Celsius degrees: 2.2 to 33.30
X10: RH - relative humidity in %: 15.0 to 100 X11: wind - wind speed in km/h: 0.40 to 9.40 X12: rain - outside rain in mm/m2 : 0.0 to 6.4
X13=Y: area - the burned area of the forest (in ha): 0.00 to 1090.84
1. Understand the data
(i) Download the txt file (Forest718.txt) from Future Learn and save it to your R working directory
(ii) Assign the data to a matrix, e.g. using the.data <- as.matrix(read.table("Forest718.txt"))
Your variable of interest is X13=Y: area - the burned area of the forest (in ha): 0.00 to 1090.84 (the thirteenth column in the dataset). Generate a subset of 200 data e.g. using:
my.data <- the.data[sample(1:517,200),c(1:13)]
(iii) Choose any FOUR variables from X5 to X11. Using scatter plots and histograms, report on the general relationship between each of the variables and your variable of interest Y. Include 4 scatter plots, 5 histograms and 1 or 2 sentences for each of the variables
2. Transform the data
(i) For the chosen four variables and the variable of interest Y make appropriate transformations so that the values can be aggregated in order to predict the variable of interest (the area). Assign your transformed data along with your transformed variable of interest X13=Y to an array (it should be 200 rows and 5 columns). Save it to a txt file titled "name-transformed.txt".
(iii) Briefly explain the general relationship between each of your transformed variables and your variable of interest (the area). (2-3 sentences each)
3. Build models and investigate the importance of each variable
(i) Download the AggWaFit.R file (from CloudDeakin) to your working directory and load into the R workspace using,
(ii) Using the fitting functions to learn the parameters for:
• A weighted arithmetic mean,
• Weighted power means with p = 0:5, and p = 2,
• An ordered weighted averaging function, and
• A Choquet integral. [10 marks]
(iii) Include two tables in your report - one on the error measures, and one summarising the weights/parameters that were learned for your data.
(iv) Compare and interpret the data in your tables. Be sure to comment on:
a. How good the model is.
b. The importance of each of the variables (the four variables that you have selected),
c. Any interaction between any of those variables (are they complementary or redundant?) and
d. Better models favour higher or lower inputs. (1-3 paragraphs)
4. Use your model for prediction
(i) Using your best fitting model, predict the area for the following input: X5=91.6; X6=181.3; X7=613; X8=7.6; X9=24.6; X10=44; X11=4; X12=0.
(ii) Give your result and comment on whether you think it is reasonable. (1-2) sentences)
(iii) Comment generally on the ideal conditions (in terms of your chosen four variables) under which an area will result. (1-2 sentences)
Your final submission, which should be submitted to the SIT718 CloudDeakin Dropbox, should include the following three files. Please follow the instructions below and do not compress your files.
1. A "name-report.pdf" report (created in any word processor), covering all of the items in above (items coloured blue usually have explicit instructions about what should be included). With plots and tables it should only be 3 - 5 pages.
2. A data file named "name-transformed.txt" (where `name' is replaced with your name
- you can use your surname or first name - just to help me distinguish them!).
3. The R code file (that you have written to produce your results) named "name- code.R" (where `name' is replaced with your name - you can use your surname or first name).