DATA100 Note [4]


Encoding Information into Intuition

1. Goals of Visualization

  • To broaden your understanding of the data.

  • To communicate results/conclusions to others.

Altogether, these goals emphasize the fact that visualizations aren’t a matter of making “pretty” pictures.

2. An Overview of Distributions

  • The total frequency of all categories must sum to 100%

  • Total count should sum to the total number of datapoints if we’re using raw counts.

Most of the time, we're ploting the distribution of the data.

3. Variable Types Inform Plot Choice

Recall the types of the variable:

3.1 Qualitative Variables: Bar Plots

import seaborn as sns # seaborn is typically given the alias sns
sns.countplot(data = wb, x = 'Continent')

3.2 Quantitative Variables: Box, Violin and Hist


sns.boxplot(data=wb, y='Gross domestic product: % growth : 2016');


sns.violinplot(data=wb, y='Gross domestic product: % growth : 2016');


sns.histplot(data=wb, x="Gross n...", stat="density")


sns.boxplot(data=wb, x="Continent", y='Gross n...')
sns.histplot(data=wb, x="Gross n...: 2016", hue="Hemisphere", stat="density")

4. Evaluating Histograms

  • Skewness and Tails

    • Skewed left vs skewed right
    • Left tail vs right tail
  • Outliers

    • Using percentiles
  • Modes

    • Most commonly occuring data

4.1 Skewness and Tails

Left Skew and Right Tail

Right Skew and Left Tail

4.2 Outliers

Loosely speaking, an outlier is defined as a data point that lies an abnormally large distance away from other values.

4.3 Modes

We describe a “mode” of a histogram as a peak in the distribution.

4.4 Challenge

In this image, it's hard to observe. It is these ambiguities that motivate us to consider using Kernel Density Estimation (KDE)

5. KDE (Kernel Density Estimation)

  • A kernel density estimate (KDE) is a smooth, continuous function that approximates a curve.

  • More formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn.

5.1 Constructing KDE

  1. Place a kernel at each datapoint.

  2. Normalize the kernels to have a total area of 1 (across all kernels).

  3. Sum the normalized kernels.

Assume we want KDE this dataset: [ 2.2, 2.8, 3.7, 5.3, 5.7 ]

Step 1: Place a kernel at each datapoint.

Step 2: Normalize the kernels to have a total area of 1.

Step 3: Sum the normalized kernels.

5.2 Kernel Functions and Bandwidths

A general “KDE formula” function is given bello

fα(x)=1ni=1nKα(x,xi)f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)

, which is pretty much like the convolution.

fα(x)=1ni=1nKα(x,xi)f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)

  1. Kα(xxi)K_{\alpha}(x - x_i) is the kernel centered on the observation i.

  2. nn is the number of observed datapoints that we have.

  3. Each xi{x1,x2,,xn}x_i \in \{x_1, x_2,\dots, x_n \} represents an observed datapoint.

The most common kernel is the Gaussian kernel.

Kα(x,xi)=12πα2e(xxi)22K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^2}} e^{-\frac{(x-x_i)^2}{2}}

Kα(x,xi)=12πα2e(xxi)22K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^2}} e^{-\frac{(x-x_i)^2}{2}}

  1. α\alpha is the bandwidth of the kernel, which is the standard deviation.

  2. xix_i is the center of the kernel, which is the mean.

Gaussian kernel KDE with bandwiths: 0.10.1; 1.01.0; 2.02.0; 10.010.0:

6. Multi Quantitative Variables

  • Up until now, we’ve discussed how to visualize single-variable distributions.

  • Going beyond this, we want to understand the relationship between pairs of numerical variables.

6.1 Scatter Plots

plt.scatter(wb["per c..."], wb['Adult l...'])

6.1 Scatter Plots

But this seems overplotting...

6.1 Scatter Plots

We can shrink the marks and add random shifting noise.

6.2 Linear Plots

sns.lmplot(data = wb, x = "per c...", y = "Adult l...")

6.3 Joint Plots

sns.jointplot(data = wb, x = "per c...", y = "Adult l...")

6.4 Hex Plots

sns.jointplot(data = wb, x = "per c...", y = "Adult l...", kind = "hex")

6.4 Hex Plots

Hex plots can be thought of as two-dimensional histograms !

6.5 Contour Plots

Contour plots can be thought of as two-dimensional KDE !

sns.kdeplot(data = wb, x = "per c...", y = "Adult l...", fill = True)

7. Transformation

As said before, we want to reveal the relationships.

However, relying on plotting directly alone is limiting, not all plots show association.

Consider the following plot.

We can try applying transformation !

7.1 Making Transformation

Step 1: Observe the plot

Step 2: Transform on X axis

plt.scatter(np.log(df["inc"]), df["lit"])

Step 3: Transform on Y axis

plt.scatter(np.log(df["inc"]), df["lit"]**4)

Step 4: Linear regression

from sklearn.linear_model import LinearRegression # Discuss in the future

7.2 Inference reversely

y4=m(logx)+by=[m(logx)+b]14y^4 = m(\log{x}) + b \quad \to \quad y = [m(\log{x}) + b]^{\frac{1}{4}}

7.3 Tukey-Mosteller Bulge Diagram

This diagram is a good guide when determining possible transformations.

8. Visualization Theory

Remember, we had two goals for visualizing data. Visualization Theory is particularly important in:

  • Helping us understand the data and results,
  • Communicating our results and conclusions with others.

8.1 Information Channels

Visualizations are able to convey information through various encodings.

Except things in the image, marks' relative position is also a important channel.

bg left:40%

Each visulization should at least contains one accurate channel.

Thus don't use pie chart any more!

8.2 What is Good Encoding?

  • No wrong information encoded
    • Example: Are abortion and cancer related?



  • No redundent infomation encoded
    • Example: How cases changed during Mar, 21,2020 to May, 6, 2020?



  • Encode information linearly
    • Example: Shows the numerical strength distribution in 2D.



  • Encode information linearly
    • Larger number show be mapped to higher gray scale color.



9. Summary

Good visualizations are always made by Intuitive and Empathetic Person