Navy Xie


A full-stack programmer.


Welcome to my blog.

Introduction to Big Data - Week 2

第2周的内容是:Characteristics of big data, 包括: 6个V,数据分析的5个步骤。

Characteristics of Big Data

  • Volume == size
  • Variety == complexity
  • Velocity == speed
  • Veracity == quality
  • Valence == connectedness
  • Value

这里提到一部1977年的影片,讲述了how powers of ten scale in the universe。其中讲述方式令人印象深刻,只是影片放在youtube上。https://www.youtube.com/watch?v=0fKBhvDjuy0

Powers of Ten takes us on an adventure in magnitudes. Starting at a picnic by the lakeside in Chicago, this famous film transports us to the outer edges of the universe. Every ten seconds we view the starting point from ten times farther out until our own galaxy is visible only a s a speck of light among many others. Returning to Earth with breathtaking speed, we move inward- into the hand of the sleeping picnicker- with ten times more magnification every ten seconds. Our journey ends inside a proton of a carbon atom within a DNA molecule in a white blood cell. POWERS OF TEN © 1977 EAMES OFFICE LLC (Available at www.eamesoffice.com)

还有,big data的一个 “small” 定义:”Big Data” is often used to refer to any datasets that is difficult to manage using traditional database system.

The Process of Data Analysis

数据分析的步骤指:

  • Step 1: Acquiring Data
  • Step 2-A: Exploring Data
  • Step 2-B: Pre-Processing Data
  • Step 3: Analyzing Data
  • Step 4: Communicating Results
  • Step 5: Turning Insights into Action

这里还谈了比较多的概念,比如,data science strategy 包括:

  • Aim
  • Policy
  • Plan
  • Action

Big data engineering包括:

  • Acquire
  • Prepare
  • Analyze
  • Report
  • Act

The data science activities:

  • purpose: big data strategy people: a group of researchers comprised of people with complementary skills
  • process: data science steps or tasks, such as data collection, date cleaning, data processing / analysis, result visualization, resulting in a data science workflow
  • platform: computing resources
  • programmability: programming tools

Quiz

以下几道做错了。

What are the challenges with big data that has high volume? (D)

A. Speed Increase in Processing
B. Effectiveness and Cost
C. Storage and Accessibility
D. Cost, Scalability, and Performance

Which of the following is the best way to describe why it is crucial to process data in real-time? (D)

A. More accurate.
B. Batch processing is an older method that is not as accurate as real-time processing.
C. More expensive to batch process.
D. Prevents missed opportunities.

What is done to the data in the preparation stage? (E)

A. select analytical techniques
B. retrieve data (acquire stage)
C. build models
D. identify data sets and query data (acquire stage)
E. cleaning, integrating, and packaging (step 2-b, pre-process data)

What is the first step in finding a right problem to tackle in data science? (B)

A. ask the right questions
B. define the problem
C. define goals
D. assess the situation

Which is a technique mentioned in the videos for building a model? (D)

A. investigation
B. evaluation
C. validation
D. analysis

Summary

这周围绕 Big Data 的特点,比较概念性地提出6 Vs,然后介绍了数据分析的步骤,这会在将来的工作中会帮助我们,以专业的视角来看数据分析行为如何展开,如何进行,以及如何产生价值。

最近的文章

终身学习 (1)

终身学习这个概念,近几年,越来越多的被人提起。我们在快速变化的时代面前,感到无所适从,也对层出不穷的新挑战,感到困惑。似乎,除了不断学习,终身学习,我们别无他法。如何一边工作,照顾家庭,一边还持续学习呢?我们来听听一位IT大叔的分享。现在人们都在说终身学习,原因是,我们这个时代,跟我们父辈那个年代差别太大。消费方式、教育程度、住什么房子、有没有小汽车等等,我觉得都不是重点。 最典型的是,我们父辈那个年代的人,可以凭一门手艺生活一辈子。 通常,一个木匠,一个个体户,或一个机关人员,从学校...…

Learning继续阅读
更早的文章

Introduction to Big Data - Week 1

最近在 Coursera 上学习一门大数据相关的课程 Introduction to Big Data,选择这门课的初衷是,希望系统地了解从数据产生,存储,清理,到输入到机器学习模型,再到用于预测或分析整个过程。课程背景制作学校是: University of California, San Diego,两位老师是: Ilkay Altintas, Chief Data Science Officer Amarnath Gupta, Director, Advanced Query...…

BigData继续阅读