Lesson 1: Big Data Hadoop Stack

Hadoop Basics

open sourced storage processing cluster of commodity of hardware

Why scalable why useful?

Batch processing framework

Move computation to data

Higher level abstraction: Pig Latin, Hive (SQL)

Hadoop Common: common libraries and stuff HDFS Hadoop MapReduce Yarn

Distributed, scalable, and portable file system written in Java to support Hadoop Framework.

Notice the naming convention: upper lvl is for MR layer, lower lvl is for HDFS layer.

typicall one single namenode but a cluster of data nodes
secondary namenode does NOT take over when primary namenode is down. It takes snapshots

jobtracker: client submit request to jobtracker jobtracker pushes work out to tasktracker --> with keeping balance as possible

MRV2: mapreduce version two
provides processing model that is beyond map-reduce processing model.

split up: jobtracker, resource management, job scheduling, job monitoring
into two

global resource manager and application resource manager
YARN focouses on scheduling, compatible with MapReduce

can now support additional engine like graph processing, iterative modeling etc.
now it is very good for ML

Higher level hadoop stack
have data, want to access those data using SQL like language
zookeeper is playing the role like Chubby

SQL to Hadoop: can load entire table into HDFS

Key component, support data need fast / random access to the HDFS Based on Google BigTable, Key Value storage, not relational database

Scripting language describe data analysis problems as data flows can have pig in JRuby, JPython, Java

Pig for ETL, UDF(User Defined Function)

Apache hive, data warehouse system SQL like language: Hive QL facilatates querying Compiles to MapReduce work

Workflow schedule system DAGs, job scheduling

collection, aggregating, moving log data

parallel database technology

Spark: in memory processing, multi-stage, good for complex analysis (very very good for machine learning)

works on HDFS directly ~ Hive(compiles to mapreduce)