Python code: | R code:

# 7.Exploratory data analysis

Abstract This chapter explains how to use data analysis and visualization techniques to understand and communicate the structure and story of our data. It first introduces the reader to exploratory statistics and data visualization in R and Python. Then, it discusses how unsupervised machine learning, in particular clustering and dimensionality reduction techniques, can be used to group similar cases or to decrease the number of features in a dataset.
Keywords: descriptive statistics, visualization, unsupervised machine learning, clustering, dimensionality reduction
Chapter objectives:
• Be able to conduct an exploratory data analysis
• Understand the principles of unsupervised machine learning
• Be able to conduct a cluster analysis
• Be able to apply dimension reduction techniques
In this chapter we use the R packages tidyverse, maps and factoextra for data analysis and visualization. For Python we use pandas and numpy for data analysis and matplotlib, seaborn and geopandas for visualization. Additionally, in Python we use scikit-learn and scipy for cluster analysis. You can install these packages with the code below if needed (see Section 1.4 for more details):
Python code
!pip3 install pandas matplotlib seaborn geopandas 
!pip3 install scikit-learn scipy bioinfokit 
!pip3 install descartes  
R code
install.packages(c("tidyverse", "glue", "maps", 
                   "factoextra"))

After installing, you need to import (activate) the packages every session:
Python code
%matplotlib inline
# General packages
import itertools
import pandas as pd
import numpy as np
# Packages for visualizing
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
# Packages for clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import (KMeans, 
    AgglomerativeClustering)
import scipy.cluster.hierarchy as sch
from sklearn.decomposition import PCA
import bioinfokit.visuz  
R code
library(tidyverse)
library(glue)
library(maps)
library(factoextra)



## 7.1.Simple Exploratory Data Analysis

Now that you are familiar with data structures (Chapter 5) and data wrangling (Chapter 6) you are probably eager to get some real insights into your data beyond the basic techniques we briefly introduced in Chapter 2.

As we outlined in Chapter 1, the computational analysis of communication can be bottom-up or top-down, inductive or deductive. Just as in traditional research methods Bryman (2012), sometimes, an inductive bottom-up approach is a goal in itself: after all, explorative analyses are invaluable for generating hypotheses that can be tested in follow-up research. But even when you are conducting a deductive, hypothesis-testing study, it is a good idea to start by describing your dataset using the tools of exploratory data analysis to get a better picture of your data. In fact, we could even go as far as saying that obtaining details like frequency tables, cross-tabulations, and summary statistics (mean, median, mode, etc.) is always necessary, even if your research questions or hypotheses require further complex analysis. For the computational analysis of communication, a significant amount of time may actually be invested at this stage.

Exploratory data analysis (EDA), as originally conceived by Tukey (1977), can be a very powerful framework to prepare and evaluate data, as well as to understand its properties and generate insights at any stage of your research. It is mandatory to do some EDA before any sophisticated analysis to know if the data is clean enough, if there are missing values and outliers, and how the distributions are shaped. Furthermore, before making any multivariate or inferential analysis we might want to know the specific frequencies for each variable, their measures of central tendency, their dispersion, and so on. We might also want to integrate frequencies of different variables into a single table to have an initial picture of their interrelations.

To illustrate how to do this in R and Python, we will use existing representative survey data to analyze how support for migrants or refugees in Europe changes over time and differs per country. The Eurobarometer (freely available at the Leibniz Institute for the Social Sciences – GESIS) has contained these specific questions since 2015. We might pose questions about the variation of a single variable or also describe the covariation of different variables to find patterns in our data. In this section, we will compute basic statistics to answer these questions and in the next section we will visualize them by plotting within and between variable behaviors of a selected group of features of the Eurobarometer conducted in November 2017 to 33193 Europeans.

For most of the EDA we will use tidyverse in R and pandas as well as numpy and scipy in Python (Example 7.1). After loading a clean version of the survey data[1] stored in a csv file (using the tidyverse function read_csv in R and the pandas function read_csv in R), checking the dimensions of our data frame (33193 x 17), we probably want to get a global picture of each of our variables by getting a frequency table. This table shows the frequency of different outcomes for every case in a distribution. This means that we can know how many cases we have for each number or category in the distribution of every variable, which is useful in order to have an initial understanding of our data.

pandas versus pure numpy/scipy In this book, we use pandas data frames a lot: they make our lives easier compared to native data types (Section 3.1), and they already integrate a lot of functionality of underlying math and statistics packages such as numpy and scipy. However, you do not have to force your data into a data frame if a different structure makes more sense in your script. numpy and scipy will happily calculate mean, media, skewness, and kurtosis of the values in a list, or the correlation between two lists. It's up to you.

#### Example 7.1. Load data from Eurobarometer survey and select some variables

Python code
url="https://cssbook.net/d/eurobarom_nov_2017.csv"
d2=pd.read_csv(url)
print("Shape of my filtered data =", d2.shape)
print("Variables:", d2.columns)  
R code
url="https://cssbook.net/d/eurobarom_nov_2017.csv"
d2= read_csv(url, col_names = TRUE)
glue("{nrow(d2)} row x {ncol(d2)} columns")
colnames(d2)  
Python output
Shape of my filtered data = (33193, 17)
Variables: Index(['survey', 'uniqid', 'date', 'country', 'marital_status', 'educational',
'gender', 'age', 'occupation', 'type_community',
'household_composition', 'support_refugees', 'support_migrants',
'date_n', 'support_refugees_n', 'support_migrants_n', 'educational_n'],
dtype='object')
R output
33193 row x 17 columns
[1] "survey"                "uniqid"                "date"
[4] "country"               "marital_status"        "educational"
[7] "gender"                "age"                   "occupation"
[10] "type_community"        "household_composition" "support_refugees"
[13] "support_migrants"      "date_n"                "support_refugees_n"
[16] "support_migrants_n"    "educational_n"

Let us first get the distribution of the categorical variable gender by creating tables that include absolute and relative frequencies. The frequency tables (using the dplyr functions group_by and summarize in R, and pandas function value_counts in Python) reveals that 17716 (53.38%) women and 15477 (46.63%) men answered this survey (Example 7.2). We can do the same with the level of support of refugees [support_refugees] (To what extent do you agree or disagree with the following statement: our country should help refugees) and obtain that 4957 (14.93%) persons totally agreed with this statement, 12695 (38.25%) tended to agree, 5931 (16.24%) tended to disagree and 3574 (10.77%) totally disagreed.

#### Example 7.2. Absolute and relative frequencies of support of refugees and gender.

Python code
print(d2["gender"].value_counts())
print(d2["gender"].value_counts(normalize=True))


R code
d2 %>%
  group_by(gender) %>%
  summarise(frequency = n()) %>%
  mutate(rel_freq = frequency / sum(frequency))     
R output. Note that Python output may look slightly different
A tibble: 2 × 3
genderfrequencyrel_freq
<chr><int><dbl>
Man 154770.466273
Woman177160.533727
Python code
print(d2["support_refugees"].value_counts())
print(d2["support_refugees"].value_counts(
    normalize=True,dropna=False))

R code
d2 %>%
  group_by(support_refugees) %>%
  summarise(frequency = n()) %>%
  mutate(rel_freq = frequency / sum(frequency))   
R output. Note that Python output may look slightly different
A tibble: 5 × 3
support_refugeesfrequencyrel_freq
<chr><int><dbl>
Tend to agree 126950.3824602
Tend to disagree 53910.1624138
Totally agree 49570.1493387
Totally disagree 35740.1076733
NA 65760.1981141

Before diving any further into any between variables analysis, you might have noticed that there might be some missing values in the data. These values represent an important amount of data in many real social and communication analysis (just remember that you cannot be forced to answer every question in a telephone or face-to-face survey!). From a statistical point of view, we can have many approaches to address missing values: For example, we can drop either the rows or columns that contain any of them, or we can impute the missing values by predicting them based on their relation with other variables – as we did in Section 6.2 by replacing the missing values with the column mean. It goes beyond the scope of this chapter to explain all the imputation methods (and, in fact, mean imputation has some serious drawbacks when used in subsequent analysis), but at least we need to know how to identify the missing values in our data and how to drop the cases that contain them from our dataset.

In the case of the variable support_refugees we can count its missing data (6576 cases) with base R function is.na and the pandas method isna[2]. Then we may decide to drop all the records that contain these values in our dataset using the tidyr function drop_na in R and the Pandas function dropna in Python[3] (Example 7.3). By doing this we get a cleaner dataset and can continue with a more sophisticated EDA with cross-tabulation and summary statistics for the group of cases.

#### Example 7.3. Drop missing values

Python code
n_miss = d2["support_refugees"].isna().sum()
print(f"# of missing values: {n_miss}")

d2 = d2.dropna()
print(f"Shape after dropping NAs: {d2.shape}")  
R code
n_miss = sum(is.na(d2\$support_refugees))
print(glue("# of missing values: {n_miss}"))

d2 = d2 %>% drop_na()
print(glue("Rows after dropping NAs: {nrow(d2)}"))  
Python output
# of missing values: 6576
Shape after dropping NAs: (23448, 17)
R output
# of missing values: 6576
Rows after dropping NAs: 23448

Now let us cross tabulate the gender and support_refugees to have an initial idea of what the relationship between these two variables might be. With this purpose we create a contingency table or cross-tabulation to get the frequencies in each combination of categories (using dplyr functions group_by, summarize and spread in R, and the pandas function crosstab in Python; example 7.4). From this table you can easily see that 2178 women totally supported helping refugees and 1524 men totally did not. Furthermore, other interesting questions about our data might now arise if we compute summary statistics for a group of cases (using again dplyr functions group_by, summarize and spread, and base mean in R; and pandas function groupby and base mean in Python). For example, you might wonder what the average ages of the women were that totally supported (52.42) or not (53.2) to help refugees. This approach will open a huge amount of possible analysis by grouping variables and estimating different statistics beyond the mean, such as count, sum, median, mode, minimum or maximum, among others.

#### Example 7.4. Cross tabulation of support of refugees and gender, and summary statistics

Python code
print("Crosstab gender and support_refugees:")
print(pd.crosstab(d2["support_refugees"], 
                  d2["gender"]))

print("Summary statistics for group of cases:")
print(d2.groupby(["support_refugees", "gender"])
      ["age"].mean())


R code
print("Crosstab gender and support_refugees:")
d2 %>%
  group_by(gender, support_refugees)%>%
  summarise(n=n())%>%
  pivot_wider(values_from="n",names_from="gender")

print("Summary statistics for group of cases:")
d2 %>%
  group_by(support_refugees, gender)%>%
  summarise(mean_age=mean(age, na.rm = TRUE))  
Python output
Crosstab gender and support_refugees:
gender             Man  Woman
support_refugees
Tend to agree     5067   5931
Tend to disagree  2176   2692
Totally agree     2118   2178
Totally disagree  1524   1762
Summary statistics for group of cases:
support_refugees  gender
Tend to agree     Man       54.073022
Woman     53.373799
Tend to disagree  Man       52.819853
Woman     52.656761
Totally agree     Man       53.738905
Woman     52.421947
Totally disagree  Man       52.368110
Woman     53.203746
Name: age, dtype: float64
R output
[1] "Crosstab gender and support_refugees:"
support_refugees Man  Woman
1 Tend to agree    5067 5931
2 Tend to disagree 2176 2692
3 Totally agree    2118 2178
4 Totally disagree 1524 1762
[1] "Summary statistics for group of cases:"
support_refugees gender mean_age
1 Tend to agree    Man    54.07302
2 Tend to agree    Woman  53.37380
3 Tend to disagree Man    52.81985
4 Tend to disagree Woman  52.65676
5 Totally agree    Man    53.73890
6 Totally agree    Woman  52.42195
7 Totally disagree Man    52.36811
8 Totally disagree Woman  53.20375

## 7.2.Visualizing Data

Data visualization is a powerful technique for both understanding data yourself and communicating the story of your data to others. Based on ggplot2 in R and matplotlib and seaborn in Python, this section covers histograms, line and bar graphs, scatterplots and heatmaps. It touches on combining multiple graphs, communicating uncertainty with boxplots and ribbons, and plotting geospatial data. In fact, visualizing data is an important stage in both EDA and advanced analytics, and we can use graphs to obtain important insights into our data. For example, if we want to visualize the age and the support for refugees of European citizens, we can plot a histogram and a bar graph, respectively.