729 words
4 minutes
DATA100 Note [4]

Visualization#

Encoding Information into Intuition

1. Goals of Visualization#

  • To broaden your understanding of the data.

  • To communicate results/conclusions to others.

Altogether, these goals emphasize the fact that visualizations aren’t a matter of making “pretty” pictures.

2. An Overview of Distributions#

  • The total frequency of all categories must sum to 100%

  • Total count should sum to the total number of datapoints if we’re using raw counts.

Most of the time, we’re ploting the distribution of the data.

3. Variable Types Inform Plot Choice#

Recall the types of the variable:

3.1 Qualitative Variables: Bar Plots#

import seaborn as sns # seaborn is typically given the alias sns
sns.countplot(data = wb, x = 'Continent')

3.2 Quantitative Variables: Box, Violin and Hist#

Box#

sns.boxplot(data=wb, y='Gross domestic product: % growth : 2016');

Violin#

sns.violinplot(data=wb, y='Gross domestic product: % growth : 2016');

Histograms#

sns.histplot(data=wb, x="Gross n...", stat="density")

Overlap#

sns.boxplot(data=wb, x="Continent", y='Gross n...')
sns.histplot(data=wb, x="Gross n...: 2016", hue="Hemisphere", stat="density")

4. Evaluating Histograms#

  • Skewness and Tails

    • Skewed left vs skewed right
    • Left tail vs right tail
  • Outliers

    • Using percentiles
  • Modes

    • Most commonly occuring data

4.1 Skewness and Tails#

Left Skew and Right Tail#

Right Skew and Left Tail#

4.2 Outliers#

Loosely speaking, an outlier is defined as a data point that lies an abnormally large distance away from other values.

4.3 Modes#

We describe a “mode” of a histogram as a peak in the distribution.

4.4 Challenge#

In this image, it’s hard to observe. It is these ambiguities that motivate us to consider using Kernel Density Estimation (KDE)

5. KDE (Kernel Density Estimation)#

  • A kernel density estimate (KDE) is a smooth, continuous function that approximates a curve.

  • More formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn.

5.1 Constructing KDE#

  1. Place a kernel at each datapoint.

  2. Normalize the kernels to have a total area of 1 (across all kernels).

  3. Sum the normalized kernels.

Assume we want KDE this dataset: [ 2.2, 2.8, 3.7, 5.3, 5.7 ]

Step 1: Place a kernel at each datapoint.

Step 2: Normalize the kernels to have a total area of 1.

Step 3: Sum the normalized kernels.

5.2 Kernel Functions and Bandwidths#

A general “KDE formula” function is given bello

fα(x)=1ni=1nKα(x,xi)f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)

, which is pretty much like the convolution.

fα(x)=1ni=1nKα(x,xi)f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)
  1. Kα(xxi)K_{\alpha}(x - x_i) is the kernel centered on the observation i.

  2. nn is the number of observed datapoints that we have.

  3. Each xi{x1,x2,,xn}x_i \in \{x_1, x_2,\dots, x_n \} represents an observed datapoint.

The most common kernel is the Gaussian kernel.

Kα(x,xi)=12πα2e(xxi)22K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^2}} e^{-\frac{(x-x_i)^2}{2}}Kα(x,xi)=12πα2e(xxi)22K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^2}} e^{-\frac{(x-x_i)^2}{2}}
  1. α\alpha is the bandwidth of the kernel, which is the standard deviation.

  2. xix_i is the center of the kernel, which is the mean.

Gaussian kernel KDE with bandwiths: 0.10.1; 1.01.0; 2.02.0; 10.010.0:

6. Multi Quantitative Variables#

  • Up until now, we’ve discussed how to visualize single-variable distributions.

  • Going beyond this, we want to understand the relationship between pairs of numerical variables.

6.1 Scatter Plots#

plt.scatter(wb["per c..."], wb['Adult l...'])

6.1 Scatter Plots#

But this seems overplotting…

6.1 Scatter Plots#

We can shrink the marks and add random shifting noise.

6.2 Linear Plots#

sns.lmplot(data = wb, x = "per c...", y = "Adult l...")

6.3 Joint Plots#

sns.jointplot(data = wb, x = "per c...", y = "Adult l...")

6.4 Hex Plots#

sns.jointplot(data = wb, x = "per c...", y = "Adult l...", kind = "hex")

6.4 Hex Plots#

Hex plots can be thought of as two-dimensional histograms !

6.5 Contour Plots#

Contour plots can be thought of as two-dimensional KDE !

sns.kdeplot(data = wb, x = "per c...", y = "Adult l...", fill = True)

7. Transformation#

As said before, we want to reveal the relationships.

However, relying on plotting directly alone is limiting, not all plots show association.

Consider the following plot.

We can try applying transformation !

7.1 Making Transformation#

Step 1: Observe the plot

Step 2: Transform on X axis

plt.scatter(np.log(df["inc"]), df["lit"])

Step 3: Transform on Y axis

plt.scatter(np.log(df["inc"]), df["lit"]**4)

Step 4: Linear regression

from sklearn.linear_model import LinearRegression # Discuss in the future

7.2 Inference reversely#

y4=m(logx)+by=[m(logx)+b]14y^4 = m(\log{x}) + b \quad \to \quad y = [m(\log{x}) + b]^{\frac{1}{4}}

7.3 Tukey-Mosteller Bulge Diagram#

This diagram is a good guide when determining possible transformations.

8. Visualization Theory#

Remember, we had two goals for visualizing data. Visualization Theory is particularly important in:

  • Helping us understand the data and results,
  • Communicating our results and conclusions with others.

8.1 Information Channels#

Visualizations are able to convey information through various encodings.

Except things in the image, marks’ relative position is also a important channel.

bg left:40%

Each visulization should at least contains one accurate channel.

Thus don’t use pie chart any more!

8.2 What is Good Encoding?#

  • No wrong information encoded
    • Example: Are abortion and cancer related?

❌

✔️

  • No redundent infomation encoded
    • Example: How cases changed during Mar, 21,2020 to May, 6, 2020?

❌

✔️

  • Encode information linearly
    • Example: Shows the numerical strength distribution in 2D.

❌

✔️

  • Encode information linearly
    • Larger number show be mapped to higher gray scale color.

❌

✔️

9. Summary#

Good visualizations are always made by Intuitive and Empathetic Person

DATA100 Note [4]
https://zivmax.top/posts/data100/data100-note-4/
Author
Zivmax
Published at
2024-04-03