Powered by GitBook

Overview of Hadoop Stack

Tez - execution engine , runs on top of YARN

Applications and Framework

HBASE: BigTable
Hive: data summarization and querying
Pig data flow language
Spark - computation engine

HDFS and HDFS2

HDFS

Goal:

Requiremnt

Resiliance, can handle fialreu
Scalable, namespace issue,
Application Locality
Portability

Design:

one namenode
Multiple datanode

HDFS2:

HDFS Federation
multiple namespaces
block pools

Design:

multiple namenode servers
Multiple namespaces
Block pools
high availability
heterogenious storage: SSD, RAM, etc

YARN

Jobtracker
Tasktracker gets the request form jobtracker and process them
- executes task per jobtracker request

Main Idea Separate Schuduling and resourse management job trcking
Job tracking responsibility is spread out to mulitple application managers
High availability resourse manager
use Cgroup to mange resources used by containers
YARN web services
timeline server

Hadoop Execution Environment

Original model:
map reduce, works for good amount of thinsg but not all for instance with iteration(like ML algorithms)

tez, yarn, spark (DAG)
memory caching

Tez:

can handle DAG
Dynamica DAG changes
customize data format( not necessary key-val)
Pig and Hive already use Tez

Spark:

inmeory processing
can handl cyclic

Hadoop Resourse Scheduling

Motivations for scheduling ?
- important work may have to wait
Policies
- FIFO (Defalt)
- Fair Share
- Capacity

Hadoop based application and services