Navy Xie


A full-stack programmer.


Welcome to my blog.

Introduction to Big Data - Week 3

第3周的主要内容是,可扩展计算的基本概念,以及,Hadoop相关概念,比如,为什么选择Hadoop;Hadoop的生态系统,包括HDFS、YARN、MapReduce等。

Basic Scalable Computing Concepts

The commodity clusters are a cost effective way of achieving data parallel scalability for big data applications.

Programming model for big data -> Programmability on top of DFS (Distributed File Systems)

Requirements for big data programming models

  1. Support big data operation split volumes fo data access data fast distribute computations to nodes

  2. Handle fault tolerance replicate data partitions recover files when needed

  3. Enable adding more racks
  4. Optimized for specific data types

Fault-tolerance

It is NOT practical to restart everything every time, if failure happens. The ability to recover from such failures is called Fault-tolerance. Two neat solutions emerged. Namely,

  • Redundant data storage
  • Data-parallel job restart

Quote

在讲Foundation重要的原因时,Dr. Ilkay提到了David Allan的名言:

Quote about foundation

“我们在欣赏一个建筑物时,不应只是去看它的美,经得住时间考验的,是它的基础结构。” 意思是说,不要光是被外表的东西吸引,要去研究它的基础。这里引用这句话的目的是说,不要仅看到Hadoop表面的东西,要去理解它的基础功能,要解决的基本问题,比如,大数据操作、容错、增加Rack,以及数据类型优化等。 那么这个 David Allan Coe 又是何许人也呢?

David Allan Coe

David是美国70-80年代有名的音乐人,从图片中可以看到浓浓的年代气息,金色长发、链子、吉他、纹身。我找了几首他的歌听了一下,如,Mona Lisa Lost Her SmileShe Used to Love Me a Lot, 属于 bluescountry music 风格,挺好听的。他也曾演绎过那首经典的 Unchained Melody,跟我们听的比较多的 Bobby Hatfield 版风格很不同。

Getting Started with Hadoop

MapReduce

MapReduce

Map-Reduce is a scalable programming model that simplifies distributed processing of data. It consists of three main steps: Mapping, Shuffling and Reducing. An easy way to think about a Map-Reduce job is to compare it with act of ‘delegating’ a large task to a group of people, and then combining the result of each person’s effort, to produce the final outcome.

一个典型的例子是,大家一起做意大利面调味汁(Pasta Sauce)。Source: https://words.sdsc.edu/words-data-science/mapreduce 当然,如果对中国人介绍这个例子,可以考虑换成做火锅,可能接地气 :)

Hadoop ecosystem

Hadoop ecosystem

  • HDFS: scalable storage, fault tolerance
  • YARN: flexible scheduling and resource management. YARN schedules jobs on 40,000+ servers at Yahoo
  • MapReduce: simplified programming model, Map->apply(), Reduce->summarize(). Google used MapReduce for indexing web sites.
  • Pig: dataflow scripting. Created at Yahoo.
  • Hive: SQL-like queries. Created at Facebook.
  • Giraph: used by Facebook to analyze social graphs
  • Storm, Spark, Flink: real-time and in-memory processing
  • HBase, Cassandra, MongoDB: NoSQL for non-files, key-values, sparse tables. HBase used for Facebook’s Messaging Platform
  • Zookeeper: created by Yahoo to wrangle services named after animals, for management, synchronization, high-avilability, configuration

YARN

YARN

When to reconsider Hadoop?

Reconsider Hadoop

Quiz

What are the specific benefits to a distributed file system? (A, B, C)

A. Data scalability
B. High concurrency
C. High fault tolerance
D. Large storage (This is NOT a benefit specific to DFS, it is a benefit to long-term information storage, which is a more general idea.)

What is a way to enable fault tolerance? (C)

A. Distributed computing
B. Better LAN connection
C. Data-parallel job restart
D. System wide restart

What are the two key components of HDFS and what are they used ofr? (B)

A. FASTA for genome sequence and Rasters for geospatial data.
B. NameNode for metadat and DataNode for block storage.
C. NameNode for block storage and DataNode for metadata.

What is the job of the NameNode? (A)

A. Coordinate operations and assigns tasks to Data Nodes
B. Listens from DataNode for block creation, deletion, and replication
C. For gene sequencing calculations

As covered in the slides, which of the following are the major goals of Hadoop? (A, B, D, E, F)

A. Handle Fault Tolerance
B. Facilitate a Shared Environment
C. Latency Sensitive Tasks
D. Optimized for a Variety of Data Types
E. Provide Value for Data
F. Enable Scalability

Assignment: Understand by Doing: MapReduce

Your job is to perform the steps of MapReduce to calculate a count of the number of squares, stars, circles, hearts and triangles in the dataset shown in the picture above. You should follow the steps of MapReduce as they were explained in this video.

MapReduce Assignment

最近的文章

学习平台比较

上次跟大家聊了“终身学习”,那么多的学习平台,各自的特点是什么?如何选择呢?我这里帮助大家熟悉一下 Coursera / Udacity / Udemy 等MOOC平台。我这里主要是针对机器学习(ML, Machine Learning)的学习。传统学习方法买几本经典的书籍,从数学和算法看起,同时在网上阅读一些文章,或观看视频。这样学习的好处是,对形成系统的理论体系有帮助。缺点是,学习效率不高,可能花了几周在线性代数和微积分上,还是写不出或读不懂机器学习代码,所学知识与实际问题解决之间...…

Learning继续阅读
更早的文章

终身学习 (1)

终身学习这个概念,近几年,越来越多的被人提起。我们在快速变化的时代面前,感到无所适从,也对层出不穷的新挑战,感到困惑。似乎,除了不断学习,终身学习,我们别无他法。如何一边工作,照顾家庭,一边还持续学习呢?我们来听听一位IT大叔的分享。现在人们都在说终身学习,原因是,我们这个时代,跟我们父辈那个年代差别太大。消费方式、教育程度、住什么房子、有没有小汽车等等,我觉得都不是重点。 最典型的是,我们父辈那个年代的人,可以凭一门手艺生活一辈子。 通常,一个木匠,一个个体户,或一个机关人员,从学校...…

Learning继续阅读