Big Data Technologies Assignment - Data Lake Architecture
In this assignment you will explore the management of big data using Data Lake technology. This Assessment Task relates to the following Learning Outcomes:
- Obtain a high level of technical competency in standard and advanced methods for big data technologies.
- Understand the current status of and recognize future trends in big data technologies.
- Develop a competency with emerging big data technologies, applications and tools.
Part 1 - Data Lake Components
In the lecture, you have been introduced to the high-level concepts of the whys and whats of a Data Lake. The goal of this assignment is to take a deep dive into the architecture of Data Lake and provide a Design Patterns for the problem of dealing with organizing a collection of datasets that holds a vast amount of data gathered from various private/open data islands. Your design should include the specification of the following components in some details:
Data Ingestion Component:
a. You need to research and identify the different types of data (from structured to unstructured) and data ingest (e.g., batch, micro-batch, real-time), and briefly explain them.
b. Identify the existing Big Data Technologies and Tools for ingesting big data, e.g., Hortonworks DataFlow.
Data Organization Component:
a. You need to research and compare various techniques for organizing data, e.g., Directory Structure, Version Control and Database Management Systems.
b. Identify the existing Database Management Systems for each category, e.g. MySQL in Relational DBs and MongoDB in NoSQL document-oriented DBs.
Data Security and Governance Component:
a. You need to research and identify the requirements for governing the right data access and the rights for defining and modifying data.
b. Identify the existing trust, security, and privacy issues in Big Data.
Indexing and Search Component:
a. You need to research on the topic "Federated Search" topic and identify technologies that facilitates the simultaneous search of multiple searchable resources.
b. Identify the existing Big Data Technologies and Tools for indexing and searching the big data: e.g., Elasticsearch and some research outcomes.
a. You need to research and compare the techniques for analysing the data (from structured to unstructured) and extracting insight from them.
b. Identify the existing Big Data Technologies and Tools for analysing the big data: SAS Tools (such as SAS Text-Analytics), Microsoft ML platform, Amazon ML Platform, and Apache Mahoot.
a. You need to research and identify the techniques for visualizing the data.
b. Identify the existing Big Data Technologies and Tools for visualizing the big data: e.g. SAS10 Visual Analytics. Other examples include D3.JS and VIS.JS.
Part 2 - Data Lake Architecture
Design Patterns are formalized best practices that one can use to solve common problems when designing a system. Refer to the Data Lake components in Part 1, and propose a Data Lake architecture for the problem of graph search in big graph databases. Read the following papers to gain an understanding of a typical Data Lake architecture and a graph based search:
1. A. Beheshti, B. Benatallah, R. Nouri, V. Chhieng, H. Xiong, and X. Zhao, CoreDB: a Data Lake Service. Conference on Information and Knowledge Management (CIKM) 2017.
2. G. Sun, G. Liu, Y. Wang, M. A. Orgun, and X. Zhou: Incremental Graph Pattern based Node Matching, IEEE International Conference on Data Engineering (ICDE) 2018.
Attachment:- Big Data Technologies Assignment File.rar