# 7  Exploratory data analysis

Abstract. This chapter explains how to use data analysis and visualization techniques to understand and communicate the structure and story of our data. It first introduces the reader to exploratory statistics and data visualization in R and Python. Then, it discusses how unsupervised machine learning, in particular clustering and dimensionality reduction techniques, can be used to group similar cases or to decrease the number of features in a dataset.

Keywords. descriptive statistics, visualization, unsupervised machine learning, clustering, dimensionality reduction

Objectives:

• Be able to conduct an exploratory data analysis
• Understand the principles of unsupervised machine learning
• Be able to conduct a cluster analysis
• Be able to apply dimension reduction techniques

In this chapter we use the R packages tidyverse, maps and factoextra for data analysis and visualization. For Python we use pandas and numpy for data analysis and matplotlib, seaborn and geopandas for visualization. Additionally, in Python we use scikit-learn and scipy for cluster analysis. You can install these packages with the code below if needed (see Section 1.4 for more details):

!pip3 install pandas matplotlib seaborn geopandas
!pip3 install scikit-learn scipy bioinfokit
!pip3 install descartes
install.packages(c("tidyverse", "glue", "maps",
"factoextra"))

After installing, you need to import (activate) the packages every session:

# General packages
import itertools
import pandas as pd
import numpy as np

# Packages for visualizing
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

# Packages for clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from sklearn.decomposition import PCA
import bioinfokit.visuz
library(tidyverse)
library(glue)
library(maps)
library(factoextra)

## 7.1 Simple Exploratory Data Analysis

Now that you are familiar with data structures (Chapter 5) and data wrangling (Chapter 6) you are probably eager to get some real insights into your data beyond the basic techniques we briefly introduced in Chapter 2.

As we outlined in Chapter 1, the computational analysis of communication can be bottom-up or top-down, inductive or deductive. Just as in traditional research methods Bryman (2012), sometimes, an inductive bottom-up approach is a goal in itself: after all, explorative analyses are invaluable for generating hypotheses that can be tested in follow-up research. But even when you are conducting a deductive, hypothesis-testing study, it is a good idea to start by describing your dataset using the tools of exploratory data analysis to get a better picture of your data. In fact, we could even go as far as saying that obtaining details like frequency tables, cross-tabulations, and summary statistics (mean, median, mode, etc.) is always necessary, even if your research questions or hypotheses require further complex analysis. For the computational analysis of communication, a significant amount of time may actually be invested at this stage.

Exploratory data analysis (EDA), as originally conceived by Tukey (1977), can be a very powerful framework to prepare and evaluate data, as well as to understand its properties and generate insights at any stage of your research. It is mandatory to do some EDA before any sophisticated analysis to know if the data is clean enough, if there are missing values and outliers, and how the distributions are shaped. Furthermore, before making any multivariate or inferential analysis we might want to know the specific frequencies for each variable, their measures of central tendency, their dispersion, and so on. We might also want to integrate frequencies of different variables into a single table to have an initial picture of their interrelations.

To illustrate how to do this in R and Python, we will use existing representative survey data to analyze how support for migrants or refugees in Europe changes over time and differs per country. The Eurobarometer (freely available at the Leibniz Institute for the Social Sciences – GESIS) has contained these specific questions since 2015. We might pose questions about the variation of a single variable or also describe the covariation of different variables to find patterns in our data. In this section, we will compute basic statistics to answer these questions and in the next section we will visualize them by plotting within and between variable behaviors of a selected group of features of the Eurobarometer conducted in November 2017 to 33193 Europeans.

For most of the EDA we will use tidyverse in R and pandas as well as numpy and scipy in Python (Example 7.1). After loading a clean version of the survey data1 stored in a csv file (using the tidyverse function read_csv in R and the pandas function read_csv in R), checking the dimensions of our data frame (33193 x 17), we probably want to get a global picture of each of our variables by getting a frequency table. This table shows the frequency of different outcomes for every case in a distribution. This means that we can know how many cases we have for each number or category in the distribution of every variable, which is useful in order to have an initial understanding of our data.

In this book, we use pandas data frames a lot: they make our lives easier compared to native data types (Section 3.1), and they already integrate a lot of functionality of underlying math and statistics packages such as numpy and scipy. However, you do not have to force your data into a data frame if a different structure makes more sense in your script. numpy and scipy will happily calculate mean, media, skewness, and kurtosis of the values in a list, or the correlation between two lists. It’s up to you.

Example 7.1 Load data from Eurobarometer survey and select some variables

url = "https://cssbook.net/d/eurobarom_nov_2017.csv"
print("Shape of my filtered data =", d2.shape)
Shape of my filtered data = (33193, 17)
print(d2.columns)
Index(['survey', 'uniqid', 'date', 'country', 'marital_status', 'educational',
'gender', 'age', 'occupation', 'type_community',
'household_composition', 'support_refugees', 'support_migrants',
'date_n', 'support_refugees_n', 'support_migrants_n', 'educational_n'],
dtype='object')
url="https://cssbook.net/d/eurobarom_nov_2017.csv"
glue("{nrow(d2)} row x {ncol(d2)} columns")
33193 row x 17 columns
colnames(d2)
 [1] "survey"                "uniqid"                "date"
[4] "country"               "marital_status"        "educational"
[7] "gender"                "age"                   "occupation"
[10] "type_community"        "household_composition" "support_refugees"
[13] "support_migrants"      "date_n"                "support_refugees_n"
[16] "support_migrants_n"    "educational_n"        

Let us first get the distribution of the categorical variable gender by creating tables that include absolute and relative frequencies. The frequency tables (using the dplyr functions group_by and summarize in R, and pandas function value_counts in Python) reveals that 17716 (53.38%) women and 15477 (46.63%) men answered this survey (Example 7.2). We can do the same with the level of support of refugees [support_refugees] (To what extent do you agree or disagree with the following statement: our country should help refugees) and obtain that 4957 (14.93%) persons totally agreed with this statement, 12695 (38.25%) tended to agree, 5931 (16.24%) tended to disagree and 3574 (10.77%) totally disagreed.

Example 7.2 Absolute and relative frequencies of support of refugees and gender.

print(d2["gender"].value_counts())
Woman    17716
Man      15477
Name: gender, dtype: int64
print(d2["gender"].value_counts(normalize=True))
Woman    0.533727
Man      0.466273
Name: gender, dtype: float64
d2 %>%
group_by(gender) %>%
summarise(frequency = n()) %>%
mutate(rel_freq = frequency / sum(frequency))   
# A tibble: 2 × 3
gender frequency rel_freq
<chr>      <int>    <dbl>
1 Man        15477    0.466
2 Woman      17716    0.534
print(d2["support_refugees"].value_counts())
Tend to agree       12695
Tend to disagree     5391
Totally agree        4957
Totally disagree     3574
Name: support_refugees, dtype: int64
print(d2["support_refugees"].value_counts(normalize=True, dropna=False))
Tend to agree       0.382460
NaN                 0.198114
Tend to disagree    0.162414
Totally agree       0.149339
Totally disagree    0.107673
Name: support_refugees, dtype: float64
d2 %>%
group_by(support_refugees) %>%
summarise(frequency = n()) %>%
mutate(rel_freq = frequency / sum(frequency)) 
# A tibble: 5 × 3
support_refugees frequency rel_freq
<chr>                <int>    <dbl>
1 Tend to agree        12695    0.382
2 Tend to disagree      5391    0.162
3 Totally agree         4957    0.149
4 Totally disagree      3574    0.108
5 <NA>                  6576    0.198

Before diving any further into any between variables analysis, you might have noticed that there might be some missing values in the data. These values represent an important amount of data in many real social and communication analysis (just remember that you cannot be forced to answer every question in a telephone or face-to-face survey!). From a statistical point of view, we can have many approaches to address missing values: For example, we can drop either the rows or columns that contain any of them, or we can impute the missing values by predicting them based on their relation with other variables – as we did in Section 6.2 by replacing the missing values with the column mean. It goes beyond the scope of this chapter to explain all the imputation methods (and, in fact, mean imputation has some serious drawbacks when used in subsequent analysis), but at least we need to know how to identify the missing values in our data and how to drop the cases that contain them from our dataset.

In the case of the variable support_refugees we can count its missing data (6576 cases) with base R function is.na and the pandas method isna2. Then we may decide to drop all the records that contain these values in our dataset using the tidyr function drop_na in R and the Pandas function dropna in Python3 (Example 7.3). By doing this we get a cleaner dataset and can continue with a more sophisticated EDA with cross-tabulation and summary statistics for the group of cases.

Example 7.3 Drop missing values

n_miss = d2["support_refugees"].isna().sum()
print(f"# of missing values: {n_miss}")
# of missing values: 6576
d2 = d2.dropna()
print(f"Shape after dropping NAs: {d2.shape}")
Shape after dropping NAs: (23448, 17)
support_combined$migrants, method = "pearson")  Pearson's product-moment correlation data: support_combined$refugees and support_combined$migrants t = 9.0133, df = 8, p-value = 1.833e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8127522 0.9893855 sample estimates: cor 0.9541243  Another useful representation is the heatmap. This figure can help you plot a continuous variable using a color scale and shows its relation with another two variables. This means that you represent your data as colors, which might be useful for understanding patterns. For example, we may wonder what the level of support for refugees is given the nationality and the gender of the individuals. For this visualization, it is necessary to create a proper data frame (Example 7.14) to plot the heatmap, in which each number of your continuous variable *_refugees_n* is included in a table where each axis (x= gender, y=country) represents the categorical variables. This pivoted table stored in an object called pivot_data can be generated using some of the already explained commands. Example 7.14 Create a data frame to plot the heatmap pivot_data = pd.pivot_table( d2, values="support_refugees_n", index=["country"], columns="gender" ) pivot_data= d2 %>% select(gender, country, support_refugees_n) %>% group_by(country, gender) %>% summarise(score = mean(support_refugees_n)) In the first resulting figure proposed in Example 7.15, the lighter the blue the greater the support in each combination of country $$\times$$ gender. You can see that level of support is similar in countries such as Slovenia or Spain, and is different in the Czech Republic or Austria. It also seems that women have a higher level of support. For this default heatmap we can use the ggplot2 function geom_tile in R and seaborn function heatmap in Python. To personalize the scale colors (e.g. if we want a scale of blues) we can use the ggplot2 function scale_fill_gradient in R or the parameter cmap of the seaborn function heatmap in Python. Example 7.15 Heatmap of country gender and support for refugees plt.figure(figsize=(10, 6)) sns.heatmap(pivot_data, cmap="Blues", cbar_kws={"label": "support_refugees_n"}) plt.show() ggplot(pivot_data, aes(x = gender, y = fct_rev(country), fill = score)) + geom_tile()+ scale_fill_gradient2(low="white", high="blue") As you will notice, one of the goals of EDA is exploring the variance of our variables, which includes some uncertainty about their behavior. We will introduce you to two basic plots to visually communicate this uncertainty. Firstly, ribbons and area plots can help us to clearly identify a predefined interval of a variable in order to interpret its variance over some cases. Let us mark this interval in 0.15 points in the above-mentioned plots of the average support to refugees or migrants by day, and we can see that the lines tend to converge more on the very last day and are more separated by day four. This simple representation can be conducted in R using the ggplot2 function geom_ribbon and in Python using the parameter ci of the seaborn function lineplot. Example 7.16 Add ribbons to the graph lines of support to refugees and migrants sns.lineplot( x="date_n", y="support_refugees_n", data=d2, color="blue", ci=100, label="Refugees", ) sns.lineplot( x="date_n", y="support_migrants_n", data=d2, color="red", ci=100, label="Migrants", ) plt.xlabel("Day") plt.ylabel("Level of support") plt.title("Support for refugees and migrants") plt.show() ggplot(support_long, aes(x=date_n, y=support, color=group)) + geom_line(size=1.5) + geom_ribbon(aes(fill=group, ymin=support-0.15, ymax=support+0.15), alpha=.1, lty=0) + ggtitle("Support for refugees and migrants") ### 7.2.3 Plotting Geospatial Data Plotting geospatial data is a more powerful tool to compare countries or other regions. Maps are very easy to understand and can have greater impact to all kinds of readers, which make them a useful representation for a wide range of studies that any computational analyst has to deal with. Geospatial data is based on the specific location of any country, region, city or geographical area, marked by its coordinates, latitude and longitude, that can later build points and polygon areas. The coordinates are normally mandatory to plot any data on a map, but are not always provided in our raw data. In those cases, we must join the geographical information we have (i.e. the name of a country) with its coordinates in order to have an accurate data frame for plotting geospatial data. Some libraries in R and Python might directly read and interpret different kinds of geospatial information by recognizing strings such as “France” or “Paris”, but in the end they will be converted into coordinates. Using the very same data as our example, we might want to plot in a map the level of support to European refugees by country. Firstly, we should create a data frame with the average level of support to refugees by country (supports_country). Secondly, we must install an existing library that provides you with accurate geospatial information. In the case of R, we recommend the package maps which contains the function map_data that helps you generate an object with geospatial information of specific areas, countries or regions, that can be easily read and plotted by ggplot2. Even if not explained in this book, we also recommend ggmap in R (Kahle and Wickham, 2013). When working with Python we recommend geopandas that works very well with pandas and matplotlib (it will also need some additional packages such as descartes). In Example 7.17 we illustrate how to plot a world map (from existing geographical information). We then save a partial map into the object some_eu_maps containing the European countries that participated in the survey. After we merge supports_country and some_eu_maps (by region) and get a complete data frame called support_map with coordinates for each country (Example 7.18). Finally, we plot it using the ggplot2 function geom_polygon in R and the geopandas method plot in Python (Example 7.19). Voilà: a nice and comprehensible representation of our data with a scale of colors! Example 7.17 Simple world map supports_country = ( d2.groupby(["country"])["support_refugees_n"] .mean() .to_frame() .reset_index() ) # Load a world map and plot it wmap = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")) wmap = wmap.rename(columns={"name": "country"}) wmap.plot() supports_country = d2 %>% group_by(country) %>% summarise(m=mean(support_refugees_n,na.rm=TRUE)) #Load a world map and plot it wmap = map_data("world") ggplot(wmap, aes(x=long,y=lat,group=group)) + geom_polygon(fill="lightgray", colour = "white") Example 7.18 Select EU countries and joint the map with Eurobarometer data countries = [ "Portugal", "Spain", "France", "Germany", "Austria", "Belgium", "Netherlands", "Ireland", "Denmark", "Poland", "UK", "Latvia", "Cyprus", "Croatia", "Slovenia", "Hungary", "Slovakia", "Czech republic", "Greece", "Finland", "Italy", "Luxemburg", "Sweden", "Sweden", "Bulgaria", "Estonia", "Lithuania", "Malta", "Romania", ] m = wmap.loc[wmap["country"].isin(countries)] m = pd.merge(supports_country, m, on="country") countries = c( "Portugal", "Spain", "France", "Germany", "Austria", "Belgium", "Netherlands", "Ireland", "Denmark", "Poland", "UK", "Latvia", "Cyprus", "Croatia", "Slovenia", "Hungary", "Slovakia", "Czech republic", "Greece", "Finland", "Italy", "Luxemburg", "Sweden", "Sweden", "Bulgaria", "Estonia", "Lithuania", "Malta", "Romania") m = wmap %>% rename(country=region) %>% filter(country %in% countries) %>% left_join(supports_country, by="country") Example 7.19 Map of Europe with the average level of support for refugees by country m = gpd.GeoDataFrame(m, geometry=m["geometry"]) m.plot( column="support_refugees_n", legend=True, cmap="OrRd", legend_kwds={"label": "Level of suppport"}, ).set_title("Support of refugees by country") ggplot(m, aes(long, lat, group=group))+ geom_polygon(aes(fill = m), color="white")+ scale_fill_viridis_c(option="B")+ labs(title="Support for refugees by country", fill="Level of support") ### 7.2.4 Other Possibilities There are many other ways of visualizing data. For EDA we have covered in this chapter only some of the most used techniques but they might be still limited for your future work. There are many books that cover data visualization in detail, such as Tufte (2006), Cairo (2019), and Kirk (2016). There are also many online resources, such as the Python Graph Gallery 7 and the R Graph Gallery 8, which introduce you to other useful plot types. These sites include code examples, many using the ggplot, matplotlib and seaborn packages introduced here, but also using other packages such as bokeh or plotly for interactive plots. ## 7.3 Clustering and Dimensionality Reduction So far, we have reviewed traditional statistical exploratory and visualization techniques that any social scientist should be able to apply. A more computational next step in your EDA workflow is using machine learning (ML) to let your computer “learn” about our data and in turn give more initial insights. ML is a branch of artificial intelligence that uses algorithms to interact with data and obtain some patterns or rules that characterize that data. We normally distinguish between supervised machine learning (SML) and unsupervised machine learning (UML). In Chapter 8, we will come back to this distinction. For now, it may suffice to say that the main characteristic of unsupervised methods is that we do not have any measurement available for a dependent variable, label, or categorization, which we want to predict. Instead, we want to identify patterns in the data without knowing in advance what these may look like. In this, unsupervised machine learning is very much of a inductive, bottom-up technique (see Chapter 1 and Boumans and Trilling (2016)). In this chapter, we will focus on UML as a means of finding groups and latent dimensions in our data, which can also help to reduce our number of variables. Specifically, we will use base R and Python’s scikit-learn to conduct $$k$$-means clustering, hierarchical clustering, and principal component analysis (PCA) as well as the closely related singular value decomposition (SVD). In data mining, we use clustering as a UML technique that aims to find the relationship between a set of descriptive variables. By doing cluster analysis we can identify underlying groups in our data that we will call clusters. Imagine we want to explore how European countries can be grouped based on their average support to refugees/migrants, age and educational level. We might create some a priori groups (such as southern versus northern countries), but cluster analysis would be a great method to let the data “talk” and then create the most appropriate groups for this specific case. As in all UML, the groups will come unlabeled and the computational analyst will be in charge of finding an appropriate and meaningful label for each cluster to better communicate the results. ### 7.3.1$$k$$-means Clustering $$k$$-means is a very frequently used algorithm to perform cluster analysis. Its main advantage is that, compared to the hierarchical clustering methods we will discuss later, it is very fast and does not consume many resources. This makes it especially useful for larger datasets. $$k$$-means cluster analysis is a method that takes any number of observations (cases) and groups them into a given number of clusters based on the proximity of each observation to the mean of the formed cluster (centroid). Mathematically, we measure this proximity as the distance of any given point to its cluster center, and can be expressed as: $J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} \|x_n - \mu_k\|^2$ where $$\|x_n - \mu_k\|$$ is the distance between the data point $$x_n$$ and the center of the cluster $$\mu_k$$. Instead of taking the mean, some variations of this algorithm take the median ($$k$$-medians) or a representative observation, also called medoid ($$k$$-medoids or partitioning around medoids, PAM) as a way to optimize the initial method. Because $$k$$-means clustering calculates distances between cases, these distances need to be meaningful – which is only the cases if the scales on which the variables are measured are comparable. If all your variables are measured on the same (continuous) scale with the same endpoints, you may be fine. In most cases, you need to normalize your data by transforming them into, for instance, $$z$$-scores9, or a scale from 0 to 1. Hence, the first thing we do in our example, is to prepare a proper dataset with only continuous variables, scaling the data (for comparability) and avoiding missing values (drop or impute). In Example 7.20, we will use the variables support to refugees (support_refugees_n), support to migrants (support_migrants_n), age (age) and educational level (number of years of education) (educational_n) and will create a data frame d3 with the mean of all these variables for each country (each observation will be a country). $$k$$-means requires us to specify the number of clusters, $$k$$, in advance. This is a tricky question, and (besides arbitrarily deciding $$k$$!), you essentially need to re-estimate your model multiple times with different $$k$$s. The simplest method to obtain the optimal number of clusters is to estimate the variability within the groups for different runs. This means that we must run $$k$$-means for different number of clusters (e.g. 1 to 15 clusters) and then choose the number of clusters that decreases the variability maintaining the highest number of clusters. When you generate and plot a vector with the variability, or more technically, the within-cluster sum of squares (WSS) obtained after each execution, it is easy to identify the optimal number: just look at the bend (knee or elbow) and you will find the point where it decreases the most and then get the optimal number of clusters (three clusters in our example). Example 7.20 Getting the optimal number of clusters # Average variables by country and scale d3 = d2.groupby(["country"])[ ["support_refugees_n", "support_migrants_n", "age", "educational_n"] ].mean() scaler = StandardScaler() d3_s = scaler.fit_transform(d3) # Store sum of squares for 1..15 clusters wss = [] for i in range(1, 15): km_out = KMeans(n_clusters=i, n_init=20) km_out.fit(d3_s) wss.append(km_out.inertia_) plt.plot(range(1, 15), wss, marker="o") plt.xlabel("Number of clusters") plt.ylabel("Within groups sum of squares") plt.show() # Average variables by country and scale d3_s = d2%>% group_by(country)%>% summarise( m_refugees=mean(support_refugees_n, na.rm=T), m_migrants=mean(support_migrants_n, na.rm=T), m_age=mean(age, na.rm=T), m_edu=mean(educational_n, na.rm=T)) %>% column_to_rownames(var="country") %>% scale() # Store sum of squares for 1..15 clusters wss = list() for (i in 1:15) { km.out = kmeans(d3_s, centers=i, nstart=25) wss[[i]] = tibble(k=i, ss=km.out$tot.withinss)
}
wss = bind_rows(wss)
ggplot(wss, aes(x=k, y=ss)) +
geom_line() + geom_point() +
xlab("Number of Clusters") +
ylab("Within groups sum of squares")

Now we can estimate our final model (Example 7.21). We generate 25 initial random centroids (the algorithm will choose the one that optimizes the cost). The default of this parameter is 1, but it is recommended to set it with a higher number (i.e. 20 to 50) to guarantee the maximum benefit of the method. The base R function kmeans and scikit-learn function KMeans in Python will produce the clustering. You can observe the mean (scaled) for each variable in each cluster, as well as the corresponding cluster for each observation.

Example 7.21 Using Kmeans to group countries based on the average support of refugees and migrants, age, and educational level

# Compute k-means with k = 3
km_res = KMeans(n_clusters=3, n_init=25).fit(d3_s)
print(km_res)
KMeans(n_clusters=3, n_init=25)
print("K-means cluste sizes:", np.bincount(km_res.labels_[km_res.labels_ >= 0]))
K-means cluste sizes: [12 13  3]
print(f"Cluster means: {km_res.cluster_centers_}")
Cluster means: [[ 0.66164163  0.64025687 -0.02468681 -0.39044418]
[-0.89000978 -0.82574663 -0.3892184  -0.21560025]
[ 1.2101425   1.01720791  1.78536032  2.49604445]]
print("Clustering vector:")
Clustering vector:
print(np.column_stack((d3.index, km_res.labels_)))
[['Austria' 0]
['Belgium' 0]
['Bulgaria' 1]
['Croatia' 1]
['Cyprus' 0]
['Czech republic' 1]
['Denmark' 2]
['Estonia' 1]
['Finland' 2]
['France' 0]
['Germany' 0]
['Greece' 1]
['Hungary' 1]
['Ireland' 0]
['Italy' 1]
['Latvia' 1]
['Lithuania' 1]
['Luxemburg' 0]
['Malta' 0]
['Netherlands' 0]
['Poland' 1]
['Portugal' 0]
['Romania' 1]
['Slovakia' 1]
['Slovenia' 1]
['Spain' 0]
['Sweden' 2]
['UK' 0]]
print("Within cluster sum of squares:")
Within cluster sum of squares:
print(km_res.inertia_)
42.50488485460034
set.seed(123)
km.res = kmeans(d3_s, 3, nstart=25)
print(km.res)
K-means clustering with 3 clusters of sizes 13, 3, 12

Cluster means:
m_refugees m_migrants       m_age      m_edu
1 -0.8739722 -0.8108671 -0.38220489 -0.2117152
2  1.1883363  0.9988783  1.75318903  2.4510670
3  0.6497192  0.6287198 -0.02424197 -0.3834086

Clustering vector:
Austria        Belgium       Bulgaria        Croatia         Cyprus
3              3              1              1              3
Czech republic        Denmark        Estonia        Finland         France
1              2              1              2              3
Germany         Greece        Hungary        Ireland          Italy
3              1              1              3              1
Latvia      Lithuania      Luxemburg          Malta    Netherlands
1              1              3              3              3
Poland       Portugal        Romania       Slovakia       Slovenia
1              3              1              1              1
Spain         Sweden             UK
3              2              3

Within cluster sum of squares by cluster:
[1] 20.485188  2.984011 17.517654
(between_SS / total_SS =  62.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

Using the function fviz_cluster of the library factoextra in R, or the pyplot function scatter in Python, you can get a visualization of the clusters. In Example 7.22 you can clearly identify that the clusters correspond to Nordic countries (more support to foreigners, more education and age), Central and Southern European countries (middle support, lower education and age), and Eastern European countries (less support, lower education and age)10 .

Example 7.22 Visualization of clusters

for cluster in range(km_res.n_clusters):
plt.scatter(
d3_s[km_res.labels_ == cluster, 0], d3_s[km_res.labels_ == cluster, 1]
)
plt.scatter(
km_res.cluster_centers_[:, 0],
km_res.cluster_centers_[:, 1],
s=250,
marker="*",
)
plt.legend(scatterpoints=1)
plt.show()

fviz_cluster(km.res, d3_s, ellipse.type="norm")

### 7.3.2 Hierarchical Clustering

Another method to conduct a cluster analysis is hierarchical clustering, which builds a hierarchy of clusters that we can visualize in a dendogram. This algorithm has two versions: a bottom-up approach (observations begin in their own clusters), also called agglomerative, and a top-down approach (all observations begin in one cluster), also called divisive. We will follow the bottom-up approach in this chapter and when you look at the dendogram you will realize how this strategy repeatedly combines the two nearest clusters at the bottom into a larger one in the top. The distance between clusters is initially estimated for every pair of observation points and then put every point in its own cluster in order to get the closest pair of points and iteratively compute the distance between each new cluster and the previous ones. This is the internal rule of the algorithm and we must choose a specific linkage method (complete, single, average or centroid, or Ward’s linkage). Ward’s linkage is a good default choice: it minimizes the variance of the clusters being merged. In doing so, it tends to produce roughly evenly sized clusters and is less sensitive to noise and outliers than some of the other methods. In Example 7.23 we will use the function hcut of the package factoextra in R and scikit-learn function AgglomerativeClustering in Python, to compute the hierarchical clustering.

A big advantage of hierarchical clustering is that, once estimated, you can freely choose the number of clusters in which to group your cases without re-estimating the model. If you decide, for instance, to use four instead of three clusters, then the cases in one of your three clusters are divided into two subgroups. With $$k$$-means, in contrast, a three-cluster solution can be completely different from a four-cluster solution. However, this comes at a big cost: hierarchical clustering requires a lot more computing resources and may therefore not be feasible for large datasets.

Example 7.23 Using hierarchical clustering to group countries based on the average support of refugees and migrants, age and educational level

hc_res = AgglomerativeClustering(affinity="euclidean", linkage="complete")
hc_res.fit_predict(d3_s)
array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0])
print(hc_res)
AgglomerativeClustering(linkage='complete')
hc.res <- hcut(d3_s, hc_method="complete")
summary(hc.res)
            Length Class  Mode
merge        54    -none- numeric
height       27    -none- numeric
order        28    -none- numeric
labels       28    -none- character
method        1    -none- character
call          3    -none- call
dist.method   1    -none- character
cluster      28    -none- numeric
nbclust       1    -none- numeric
silinfo       3    -none- list
size          2    -none- numeric
data        112    -none- numeric  

We can then plot the dendogram with base R function plot and scipy (module cluster.hierarchy) function dendogram in Python. The summary of the initial model suggest two clusters (size=2) but by looking at the dendogram you can choose the number of clusters you want to work with by choosing a height (for example four to get three clusters).

Example 7.24 Dendogram to visualize the hierarchical clustering

dendrogram = sch.dendrogram(
labels=list(d3.index),
leaf_rotation=90,
)
plot(hc.res, cex=0.5)

If you re-run the hierarchical clustering for three clusters (Example 7.25) and visualize it (Example 7.26) you will get a graph similar to the one produced by $$k$$-means.

Example 7.25 Re-run hierarchical clustering with three clusters

hc_res = AgglomerativeClustering(
)
hc_res.fit_predict(d3_s)
array([0, 0, 2, 0, 0, 2, 1, 2, 1, 0, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0,
0, 2, 0, 0, 1, 0])
print(hc_res)
AgglomerativeClustering(n_clusters=3)
hc.res = hcut(d3_s, k=3, hc_method="complete")
summary(hc.res)
            Length Class  Mode
merge        54    -none- numeric
height       27    -none- numeric
order        28    -none- numeric
labels       28    -none- character
method        1    -none- character
call          3    -none- call
dist.method   1    -none- character
cluster      28    -none- numeric
nbclust       1    -none- numeric
silinfo       3    -none- list
size          3    -none- numeric
data        112    -none- numeric  

Example 7.26 Re-run hierarchical clustering with three clusters

for cluster in range(hc_res.n_clusters):
plt.scatter(
d3_s[hc_res.labels_ == cluster, 0], d3_s[hc_res.labels_ == cluster, 1]
)
plt.legend(scatterpoints=1)
plt.show()

fviz_cluster(hc.res, d3_s, ellipse.type="convex")

### 7.3.3 Principal Component Analysis and Singular Value Decomposition

Cluster analyses are in principle used to group similar cases. Sometimes, we want to group similar variables instead. A well-known method for this is principal component analysis (PCA)11. This unsupervised method is useful to reduce the dimensionality of your data by creating new uncorrelated variables or components that describe the original dataset. PCA uses linear transformations to create principal components that are ordered by the level of explained variance (the first component will catch the largest variance). We will get as many principal components as number of variables we have in the dataset, but when we look at the cumulative variance we can easily select only few of these components to explain most of the variance and thus work with a smaller and summarized data frame that might be more convenient for many tasks (i.e. those that require avoiding multicollinearity or just need to be more computationally efficient). By simplifying the complexity of our data we can have a first understanding of how our variables are related and also of how our observations might be grouped. All components have specific loadings for each original variable, which can tell you how the old variables are represented in the new components. This statistical technique is especially useful in EDA when working with high dimensional datasets but it can be used in many other situations.

The mathematics behind PCA can be relatively easy to understand. However, for the sake of simplicity, we will just say that in order to obtain the principal components the algorithm firstly has to compute the mean of each variable and then compute the covariance matrix of the data. This matrix contains the covariance between the elements of a vector and the output will be a square matrix with an identical number of rows and columns, corresponding to the total number of dimensions of the original dataset. Specifically, we can calculate the covariance matrix of the variables X and y with the formula:

$cov_{x,y}=\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}$

Secondly, using the covariance matrix the algorithm computes the eigenvectors and their corresponding eigenvalues, and then drop the eigenvectors with the lowest eigenvalues. With this reduced matrix it transforms the original values to the new subspace in order to obtain the principal components that will synthesize the original dataset.

Let us now conduct a PCA over the Eurobarometer data. In Example 7.27 we will re-use the sub-data frame d3 containing the means of 4 variables (support to refugees, support to migrants, age and educational level) for each of the 30 European countries. The question is can we have a new data frame containing less than 4 variables but that explains most of the variance, or in other words, that represents our original dataset well enough, but with fewer dimensions? As long as our features are measured on different scales, it is normally suggested to center (to mean 0) and scale (to standard deviation 1) the data. You may also know this transformation as “calculating $$z$$-scores”. We can perform the PCA in R using the base function prcomp and in Python using the function PCA of the module decomposition of scikit-learn.

Example 7.27 Principal component analysis (PCA) of a data frame with 30 records and 4 variables

pca_m = PCA()
pca = pca_m.fit(d3_s)
pca_n = PCA()
pca = pca_n.fit_transform(d3_s)
pca_df = pd.DataFrame(data=pca, columns=["PC1", "PC2", "PC3", "PC4"])
pca_df.index = d3.index
print(pca_df.head())
               PC1       PC2       PC3       PC4
country
Austria  -0.103285 -1.220018 -0.535673  0.066888
Belgium  -0.029355 -0.084707  0.051515  0.227609
Bulgaria -1.660518  0.949533 -0.480337 -0.151837
Croatia  -1.267502 -0.819093 -0.920657  0.843682
Cyprus    0.060590 -0.195928  0.573670  0.812519
pca_df_2 = pd.DataFrame(
data=pca_n.components_.T, columns=["PC1", "PC2", "PC3", "PC4"]
)
pca_df_2.index = d3.columns
print(pca_df_2)
                         PC1       PC2       PC3       PC4
support_refugees_n  0.573292 -0.369010  0.139859  0.718058
support_migrants_n  0.513586 -0.533140 -0.094283 -0.665659
age                 0.445117  0.558601  0.670994 -0.199005
educational_n       0.457642  0.517261 -0.722023  0.041073
pca = prcomp(d3_s, scale=TRUE)
head(pca$x)  PC1 PC2 PC3 PC4 Austria 0.10142430 1.19803416 -0.52602091 0.06568251 Belgium 0.02882642 0.08318037 0.05058673 0.22350741 Bulgaria 1.63059623 -0.93242257 -0.47168110 -0.14910143 Croatia 1.24466229 0.80433304 -0.90406722 0.82847927 Cyprus -0.05949790 0.19239739 0.56333279 0.79787757 Czech republic 2.17979559 -0.66348027 -0.74991773 -0.08459459 pca$rotation
                  PC1        PC2         PC3         PC4
m_refugees -0.5732924  0.3690096  0.13985877  0.71805799
m_migrants -0.5135857  0.5331396 -0.09428317 -0.66565948
m_age      -0.4451170 -0.5586008  0.67099390 -0.19900549
m_edu      -0.4576422 -0.5172613 -0.72202312  0.04107309

The generated object with the PCA contains different elements (in R sdev, rotation, center, scale and x) or attributes in Python (components_, explained_variance_, explained_variance_ratio, singular_values_, mean_, n_components_, n_features_, n_samples_, and noise_variance_). In the resulting object we can see the values of four principal components of each country, and the values of the loadings, technically called eigenvalues, for the variables in each principal component. In our example we can see that support for refugees and migrants are more represented on PC1, while age and educational level are more represented on PC2. If we plot the first two principal components using base function biplot in R and the library bioinfokit in Python (Example 7.28), we can clearly see how the variables are associated with either PC1 or with PC2 (we might also want to plot any pair of the four components!). But we can also get a picture of how countries are grouped based only in these two new variables.

Example 7.28 Plot PC1 and PC2

var1 = round(pca_n.explained_variance_ratio_[0], 2)
var2 = round(pca_n.explained_variance_ratio_[1], 2)
bioinfokit.visuz.cluster.biplot(
cscore=pca,
labels=pca_df_2.index.values,
var1=var1,
var2=var2,
show=True,
)

biplot(x = pca, scale = 0, cex = 0.6,
col = c("blue4", "brown3"))

So far we are not sure how many components are enough to accurately represent our data, so we need to know how much variance (which is the square of the standard deviation) is explained by each component. We can get the values (Example 7.29) and plot the proportion of explained variance (Example 7.30). We get that the first component explains 57.85% of the variance, the second 27.97%, the third 10.34% and the fourth just 3.83%.

Example 7.29 Proportion of variance explained

print("Proportion of variance explained:")
Proportion of variance explained:
print(pca_n.explained_variance_ratio_)
[0.57848569 0.27974794 0.10344996 0.03831642]
print("Proportion of variance explained:")
[1] "Proportion of variance explained:"
prop_var = tibble(pc=1:4,
var=pca$sdev^2 / sum(pca$sdev^2))
prop_var
# A tibble: 4 × 2
pc    var
<int>  <dbl>
1     1 0.578
2     2 0.280
3     3 0.103
4     4 0.0383

Example 7.30 Plot of the proportion of variance explained

plt.bar([1, 2, 3, 4], pca_n.explained_variance_ratio_)
plt.ylabel("Proportion of variance explained")
plt.xlabel("Principal component")
plt.xticks([1, 2, 3, 4])
plt.show()

ggplot(prop_var, aes(x=pc, y=var)) +
geom_col() +
scale_y_continuous(limits = c(0,1)) +
xlab("Principal component") +
ylab("Proportion of variance explained")

When we estimate (Example 7.31) and plot (Example 7.32) the cumulative explained variance it is easy to identify that with just the two first components we explain 88.82% of the variance. It might now seem a good deal to reduce our dataset from four to two variables, or let’s say half of the data, but retaining most of the original information.

Example 7.31 Cumulative explained variance

cvar = np.cumsum(pca_n.explained_variance_ratio_)
cvar
array([0.57848569, 0.85823362, 0.96168358, 1.        ])
cvar = cumsum(prop_var)
cvar
# A tibble: 4 × 2
pc   var
<int> <dbl>
1     1 0.578
2     3 0.858
3     6 0.962
4    10 1    

Example 7.32 Plot of the cumulative explained variance

plt.plot(cvar)
plt.xlabel("number of components")
plt.xticks(np.arange(len(cvar)), np.arange(1, len(cvar) + 1))
plt.ylabel("cumulative explained variance")
plt.show()

ggplot(cvar, aes(x=pc, y=var)) +
geom_point() +
geom_line() +
theme_bw() +
xlab("Principal component") +
ylab("Cumulative explained variance")

And what if we want to use this PCA and deploy a clustering (as explained above) with just these two new variables instead of the four original ones? Just repeat the $$k$$-means procedure but now using a new smaller data frame selecting PC1 and PC2 from the PCA. After estimating the optimal number of clusters (three again!) we can compute and visualize the clusters, and get a very similar picture to the one obtained in the previous examples, with little differences such as the change of cluster of the Netherlands (more similar now to the Nordic countries!). This last exercise is a good example of how to combine different techniques in EDA.

Example 7.33 Combining PCA to reduce dimensionality and $$k$$-means to group countries

# Generate a new dataset with first components
d5 = pca[:, 0:2]
d5[0:5]

# Get optimal number of clusters
array([[-0.10328545, -1.22001827],
[-0.02935539, -0.08470675],
[-1.66051792,  0.94953266],
[-1.26750204, -0.81909268],
[ 0.06058969, -0.19592792]])
wss = []
for i in range(1, 15):
km_out = KMeans(n_clusters=i, n_init=20)
km_out.fit(d5)
wss.append(km_out.inertia_)

# Plot sum of squares vs. number of clusters
KMeans(n_clusters=1, n_init=20)
KMeans(n_clusters=2, n_init=20)
KMeans(n_clusters=3, n_init=20)
KMeans(n_clusters=4, n_init=20)
KMeans(n_clusters=5, n_init=20)
KMeans(n_clusters=6, n_init=20)
KMeans(n_clusters=7, n_init=20)
KMeans(n_init=20)
KMeans(n_clusters=9, n_init=20)
KMeans(n_clusters=10, n_init=20)
KMeans(n_clusters=11, n_init=20)
KMeans(n_clusters=12, n_init=20)
KMeans(n_clusters=13, n_init=20)
KMeans(n_clusters=14, n_init=20)
plt.plot(range(1, 15), wss, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Within groups sum of squares")
plt.show()

# Compute again with k = 3 and visualize

km_res_5 = KMeans(n_clusters=3, n_init=25).fit(d5)
for cluster in range(km_res_5.n_clusters):
plt.scatter(
d3_s[km_res_5.labels_ == cluster, 0],
d3_s[km_res_5.labels_ == cluster, 1],
)
plt.scatter(
km_res_5.cluster_centers_[:, 0],
km_res_5.cluster_centers_[:, 1],
s=250,
marker="*",
)
plt.legend(scatterpoints=1)
plt.show()

#Generate a new dataset with first components
d5 = pca$x[, c("PC1", "PC2")] head(d5)  PC1 PC2 Austria 0.10142430 1.19803416 Belgium 0.02882642 0.08318037 Bulgaria 1.63059623 -0.93242257 Croatia 1.24466229 0.80433304 Cyprus -0.05949790 0.19239739 Czech republic 2.17979559 -0.66348027 #Get optimal number of clusters wss = list() for (i in 1:15) { km.out = kmeans(d5, centers = i, nstart = 20) wss[[i]] = tibble(k=i, ss=km.out$tot.withinss)
}
wss = bind_rows(wss)

# Plot sum of squares vs. number of clusters
ggplot(wss, aes(x=k, y=ss)) + geom_line() +
xlab("Number of Clusters") +
ylab("Within groups sum of squares")

# Compute again with k = 3 and visualize
set.seed(123)
km.res_5 <- kmeans(d5, 3, nstart = 25)
fviz_cluster(km.res_5, d5, ellipse.type = "norm")

When your dataset gets bigger, though, you may actually not use PCA but the very much related singular value decomposition, SVD. They are closely interrelated, and in fact SVD can be used “under the hood” to estimate a PCA. While PCA is taught in a lot of classical textbooks for statistics in the social sciences, SVD is usually not. Yet, it has a great advantage: in the way that it is implemented in scikit-learn, it does not require to store the (dense) covariance matrix in memory (see the feature box in Section 11.4.1 for more information on sparse versus dense matrices). This means that once your dataset grows bigger than typical survey datasets, a PCA maybe quickly become impossible to estimate, whereas the SVD can still be estimated without much resource required. Therefore, especially when you are working with textual data, you will see that SVD is used instead of PCA. For all practical purposes, the way that you can use and interpret the results stays the same.

1. Original data ZA6928_v1-0-0.csv was cleaned and prepared for the exercise. The preparation of the data are in the notebooks cleaning_eurobarometer_py.ipynb and cleaning_eurobarometer_r.ipynb.↩︎

5. A $$z$$-transformation means rescaling data to a mean of 0 and a standard deviation of 1↩︎
6. We can re-run this cluster analysis using $$k$$-medoids or partitioning around medoids (PAM) and get similar results (the three medoids are: Slovakia, Belgium and Denmark), both in data and visualization. In R you must install the package cluster than contains the function pam, and in Python the package scikit-learn-extra with the function Kmedoids.↩︎