DATA100 Note [3]

EDA and Data Wrangling

Unboxing of New Data Set

1. Unboxing of New Phone

想象刚刚到货了一个新的手机,一种可能的流程是:

开机 \to 随便试试 \to 发现某个设置不顺手 \to 调整设置 \to 重新试试 \to 发现某个设置不顺手 \to 调整设置 \to ......

bg left:33%

2. Unboxing of New Data Set

刚刚得到了一个新的数据集:

bg left:33%

Load Data \to EDA \to 发现不对劲 \to Wrangling \to EDA \to 发现不对劲 \to Wrangling \to ......

3. EDA (Exploratory Data Analysis) and Wrangling

It is an open-ended, informal analysis paradigm.

It majorly focuses on:

  1. Structure

  2. Granularity, Scope, and Temporality

  3. Faithfulness

3.1 Data's Structure


  • Format:

    • CSV
    • TSV
    • Json
    • \vdots
  • Variable Type:

    • Quantitative
      • Continuous
      • Discrete
    • Qualitative
      • Ordinal
      • Nominal

4. Data's Structure

4.1 File Format: csv, tsv, json ...

1
2
'Year,Candidate,Party,Popular vote,Result,%\n'
'1824,Andrew Jackson,Democratic-Republican,151271,loss,57.21012204\n'
1
2
'\ufeffYear\tCandidate\tParty\tPopular vote\tResult\t%\n'
'1824\tAndrew Jackson\tDemocratic-Republican\t151271\tloss\t57.21012204\n'
1
2
3
4
5
6
7
[
{
"Candidate": "Andrew Jackson",
"Party": "Democratic-Republican",
"Popular vote": 151271,
},
]

4.2 Metadata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{
"Meta": {
"Size": "1.2MB",
"Date": "2021-10-01",
},
"Data": [
{
"Candidate": "Andrew Jackson",
"Popular vote": 151271,
}, // ... more data section
]
}
]

4.3 Variable Type

5. Data's Granularity, Scope, and Temporality

  • Granularity: How detialed is the data about an indivisual?

  • Scope: How well the samples cover the target population?

  • Temporality: How timely is the data?

6. Data's Faithfulness

6.1 Signs that data may not be faithful

  • Unrealistic or “incorrect” values
  • Violations of obvious dependencies
  • Clear signs that data was entered by hand
  • Signs of data falsification
  • Duplicated
  • Truncated data

6.2 Missing Values (Abnormal Values)

  • Many abnormal values are actually just missing values.

h:400

6.2 Missing Values (Abnormal Values)

Three typical ways to deal with missing values: Drop, NaN, and Impute.

7. Summary

  • Data Overview: Assess data's date, size, organization, and structure.

  • Individual Analysis: Investigate each field/attribute/dimension.

  • Pairwise Analysis: Explore relationships between dimensions.

  • Along the way, we can:

    • Visualize
    • Validate assumptions
    • Address anomalies
    • Document everything (Ideally using Jupyter Notebook)