第2周的内容是:Characteristics of big data, 包括: 6个V,数据分析的5个步骤。
Characteristics of Big Data
- Volume == size
- Variety == complexity
- Velocity == speed
- Veracity == quality
- Valence == connectedness
- Value
这里提到一部1977年的影片,讲述了how powers of ten scale in the universe。其中讲述方式令人印象深刻,只是影片放在youtube上。https://www.youtube.com/watch?v=0fKBhvDjuy0
Powers of Ten takes us on an adventure in magnitudes. Starting at a picnic by the lakeside in Chicago, this famous film transports us to the outer edges of the universe. Every ten seconds we view the starting point from ten times farther out until our own galaxy is visible only a s a speck of light among many others. Returning to Earth with breathtaking speed, we move inward- into the hand of the sleeping picnicker- with ten times more magnification every ten seconds. Our journey ends inside a proton of a carbon atom within a DNA molecule in a white blood cell. POWERS OF TEN © 1977 EAMES OFFICE LLC (Available at www.eamesoffice.com)
还有,big data的一个 “small” 定义:”Big Data” is often used to refer to any datasets that is difficult to manage using traditional database system.
The Process of Data Analysis
数据分析的步骤指:
- Step 1: Acquiring Data
- Step 2-A: Exploring Data
- Step 2-B: Pre-Processing Data
- Step 3: Analyzing Data
- Step 4: Communicating Results
- Step 5: Turning Insights into Action
这里还谈了比较多的概念,比如,data science strategy 包括:
- Aim
- Policy
- Plan
- Action
Big data engineering包括:
- Acquire
- Prepare
- Analyze
- Report
- Act
The data science activities:
- purpose: big data strategy people: a group of researchers comprised of people with complementary skills
- process: data science steps or tasks, such as data collection, date cleaning, data processing / analysis, result visualization, resulting in a data science workflow
- platform: computing resources
- programmability: programming tools
Quiz
以下几道做错了。
What are the challenges with big data that has high volume? (D)
A. Speed Increase in Processing
B. Effectiveness and Cost
C. Storage and Accessibility
D. Cost, Scalability, and Performance
Which of the following is the best way to describe why it is crucial to process data in real-time? (D)
A. More accurate.
B. Batch processing is an older method that is not as accurate as real-time processing.
C. More expensive to batch process.
D. Prevents missed opportunities.
What is done to the data in the preparation stage? (E)
A. select analytical techniques
B. retrieve data (acquire stage)
C. build models
D. identify data sets and query data (acquire stage)
E. cleaning, integrating, and packaging (step 2-b, pre-process data)
What is the first step in finding a right problem to tackle in data science? (B)
A. ask the right questions
B. define the problem
C. define goals
D. assess the situation
Which is a technique mentioned in the videos for building a model? (D)
A. investigation
B. evaluation
C. validation
D. analysis
Summary
这周围绕 Big Data 的特点,比较概念性地提出6 Vs,然后介绍了数据分析的步骤,这会在将来的工作中会帮助我们,以专业的视角来看数据分析行为如何展开,如何进行,以及如何产生价值。