Lesson 1: Big Data Hadoop Stack

Hadoop Basics

open sourced storage processing cluster of commodity of hardware

Why scalable why useful?

Batch processing framework

Move computation to data

Higher level abstraction: Pig Latin, Hive (SQL)

  • Processes:
    • Tasktracker + jobtracker
    • Namenode + Datanode

Apache Framework: Basic Module

Hadoop Common: common libraries and stuff HDFS Hadoop MapReduce Yarn

HDFS

Distributed, scalable, and portable file system written in Java to support Hadoop Framework.

image
Notice the naming convention: upper lvl is for MR layer, lower lvl is for HDFS layer. image

Namenode

  • typicall one single namenode but a cluster of data nodes

  • secondary namenode does NOT take over when primary namenode is down. It takes snapshots

Mapreduce Engine

jobtracker: client submit request to jobtracker jobtracker pushes work out to tasktracker --> with keeping balance as possible

YARN

MRV2: mapreduce version two
provides processing model that is beyond map-reduce processing model.

split up: jobtracker, resource management, job scheduling, job monitoring
into two

global resource manager and application resource manager
YARN focouses on scheduling, compatible with MapReduce

can now support additional engine like graph processing, iterative modeling etc.
now it is very good for ML

ZOO

Higher level hadoop stack
have data, want to access those data using SQL like language
zookeeper is playing the role like Chubby

SQOOP

SQL to Hadoop: can load entire table into HDFS

HBASE:

Key component, support data need fast / random access to the HDFS Based on Google BigTable, Key Value storage, not relational database

PIG

Scripting language describe data analysis problems as data flows can have pig in JRuby, JPython, Java

Pig for ETL, UDF(User Defined Function)

HIVE

Apache hive, data warehouse system SQL like language: Hive QL facilatates querying Compiles to MapReduce work

Oozie

Workflow schedule system DAGs, job scheduling

ZooKeeper

  • Operational services : dist configuration services
  • Centralized, synchronyzation,

Flume:

collection, aggregating, moving log data

Some Additional component

parallel database technology

Spark: in memory processing, multi-stage, good for complex analysis (very very good for machine learning)

Impala

works on HDFS directly ~ Hive(compiles to mapreduce)