第3周的主要内容是,可扩展计算的基本概念,以及,Hadoop相关概念,比如,为什么选择Hadoop;Hadoop的生态系统,包括HDFS、YARN、MapReduce等。
Basic Scalable Computing Concepts
The commodity clusters are a cost effective way of achieving data parallel scalability for big data applications.
Programming model for big data -> Programmability on top of DFS (Distributed File Systems)
Requirements for big data programming models
-
Support big data operation split volumes fo data access data fast distribute computations to nodes
-
Handle fault tolerance replicate data partitions recover files when needed
- Enable adding more racks
- Optimized for specific data types
Fault-tolerance
It is NOT practical to restart everything every time, if failure happens. The ability to recover from such failures is called Fault-tolerance. Two neat solutions emerged. Namely,
- Redundant data storage
- Data-parallel job restart
Quote
在讲Foundation重要的原因时,Dr. Ilkay提到了David Allan的名言:
“我们在欣赏一个建筑物时,不应只是去看它的美,经得住时间考验的,是它的基础结构。” 意思是说,不要光是被外表的东西吸引,要去研究它的基础。这里引用这句话的目的是说,不要仅看到Hadoop表面的东西,要去理解它的基础功能,要解决的基本问题,比如,大数据操作、容错、增加Rack,以及数据类型优化等。 那么这个 David Allan Coe 又是何许人也呢?
David是美国70-80年代有名的音乐人,从图片中可以看到浓浓的年代气息,金色长发、链子、吉他、纹身。我找了几首他的歌听了一下,如,Mona Lisa Lost Her Smile
,She Used to Love Me a Lot
, 属于 blues
和 country music
风格,挺好听的。他也曾演绎过那首经典的 Unchained Melody
,跟我们听的比较多的 Bobby Hatfield 版风格很不同。
Getting Started with Hadoop
MapReduce
Map-Reduce is a scalable programming model that simplifies distributed processing of data. It consists of three main steps: Mapping, Shuffling and Reducing. An easy way to think about a Map-Reduce job is to compare it with act of ‘delegating’ a large task to a group of people, and then combining the result of each person’s effort, to produce the final outcome.
一个典型的例子是,大家一起做意大利面调味汁(Pasta Sauce)。Source: https://words.sdsc.edu/words-data-science/mapreduce 当然,如果对中国人介绍这个例子,可以考虑换成做火锅,可能接地气 :)
Hadoop ecosystem
- HDFS: scalable storage, fault tolerance
- YARN: flexible scheduling and resource management. YARN schedules jobs on 40,000+ servers at Yahoo
- MapReduce: simplified programming model, Map->apply(), Reduce->summarize(). Google used MapReduce for indexing web sites.
- Pig: dataflow scripting. Created at Yahoo.
- Hive: SQL-like queries. Created at Facebook.
- Giraph: used by Facebook to analyze social graphs
- Storm, Spark, Flink: real-time and in-memory processing
- HBase, Cassandra, MongoDB: NoSQL for non-files, key-values, sparse tables. HBase used for Facebook’s Messaging Platform
- Zookeeper: created by Yahoo to wrangle services named after animals, for management, synchronization, high-avilability, configuration
YARN
When to reconsider Hadoop?
Quiz
What are the specific benefits to a distributed file system? (A, B, C)
A. Data scalability
B. High concurrency
C. High fault tolerance
D. Large storage (This is NOT a benefit specific to DFS, it is a benefit to long-term information storage, which is a more general idea.)
What is a way to enable fault tolerance? (C)
A. Distributed computing
B. Better LAN connection
C. Data-parallel job restart
D. System wide restart
What are the two key components of HDFS and what are they used ofr? (B)
A. FASTA for genome sequence and Rasters for geospatial data.
B. NameNode for metadat and DataNode for block storage.
C. NameNode for block storage and DataNode for metadata.
What is the job of the NameNode? (A)
A. Coordinate operations and assigns tasks to Data Nodes
B. Listens from DataNode for block creation, deletion, and replication
C. For gene sequencing calculations
As covered in the slides, which of the following are the major goals of Hadoop? (A, B, D, E, F)
A. Handle Fault Tolerance
B. Facilitate a Shared Environment
C. Latency Sensitive Tasks
D. Optimized for a Variety of Data Types
E. Provide Value for Data
F. Enable Scalability
Assignment: Understand by Doing: MapReduce
Your job is to perform the steps of MapReduce to calculate a count of the number of squares, stars, circles, hearts and triangles in the dataset shown in the picture above. You should follow the steps of MapReduce as they were explained in this video.