Visualization

Encoding Information into Intuition

1. Goals of Visualization

To broaden your understanding of the data.
To communicate results/conclusions to others.

Altogether, these goals emphasize the fact that visualizations aren’t a matter of making “pretty” pictures.

2. An Overview of Distributions

The total frequency of all categories must sum to 100%
Total count should sum to the total number of datapoints if we’re using raw counts.

Most of the time, we're ploting the distribution of the data.

3. Variable Types Inform Plot Choice

Recall the types of the variable:

3.1 Qualitative Variables: Bar Plots

1 2	import seaborn as sns # seaborn is typically given the alias sns sns.countplot(data = wb, x = 'Continent')

3.2 Quantitative Variables: Box, Violin and Hist

Box

1	sns.boxplot(data=wb, y='Gross domestic product: % growth : 2016');

Violin

1	sns.violinplot(data=wb, y='Gross domestic product: % growth : 2016');

Histograms

1	sns.histplot(data=wb, x="Gross n...", stat="density")

Overlap

1 2	sns.boxplot(data=wb, x="Continent", y='Gross n...') sns.histplot(data=wb, x="Gross n...: 2016", hue="Hemisphere", stat="density")

4. Evaluating Histograms

Skewness and Tails
- Skewed left vs skewed right
- Left tail vs right tail
Outliers
- Using percentiles
Modes
- Most commonly occuring data

4.1 Skewness and Tails

Left Skew and Right Tail

Right Skew and Left Tail

4.2 Outliers

Loosely speaking, an outlier is defined as a data point that lies an abnormally large distance away from other values.

4.3 Modes

We describe a “mode” of a histogram as a peak in the distribution.

4.4 Challenge

In this image, it's hard to observe. It is these ambiguities that motivate us to consider using Kernel Density Estimation (KDE)

5. KDE (Kernel Density Estimation)

A kernel density estimate (KDE) is a smooth, continuous function that approximates a curve.
More formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn.

5.1 Constructing KDE

Place a kernel at each datapoint.
Normalize the kernels to have a total area of 1 (across all kernels).
Sum the normalized kernels.

Assume we want KDE this dataset: [ 2.2, 2.8, 3.7, 5.3, 5.7 ]

Step 1: Place a kernel at each datapoint.

Step 2: Normalize the kernels to have a total area of 1.

Step 3: Sum the normalized kernels.

5.2 Kernel Functions and Bandwidths

A general “KDE formula” function is given bello

$f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)$

, which is pretty much like the convolution.

$f_{\alpha}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{\alpha}(x, x_i)$

$K_{\alpha}(x - x_i)$ is the kernel centered on the observation i.
$n$ is the number of observed datapoints that we have.
Each $x_i \in \{x_1, x_2,\dots, x_n \}$ represents an observed datapoint.

The most common kernel is the Gaussian kernel.

$K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^2}} e^{-\frac{(x-x_i)^2}{2}}$

$\alpha$ is the bandwidth of the kernel, which is the standard deviation.
$x_i$ is the center of the kernel, which is the mean.

Gaussian kernel KDE with bandwiths: $0.1$ ; $1.0$ ; $2.0$ ; $10.0$ :

6. Multi Quantitative Variables

Up until now, we’ve discussed how to visualize single-variable distributions.
Going beyond this, we want to understand the relationship between pairs of numerical variables.