8 Statistical Modeling and Supervised Machine Learning

Abstract. This chapter introduces the reader to the world of supervised machine learning. It starts by outlining how classical statistical techniques such as regression models can be used for prediction. It then provides an overview of frequently-used techniques from Naïve Bayes classifiers to neural networks.

Keywords. supervised machine learning

Objectives: - Understand the principles of supervised machine learning - Be able to run a predictive model - Be able to evaluate the performance of a predictive model

Note

In this chapter, we use the Python package statsmodels for classical statistical modeling, before we move on to use a dedicated machine learning package, scikit-learn. In R, we use base R for statistical modeling, rsample for splitting our dataset, caret for machine learning, and pROC for determining the Receiver Operating Characteristic (ROC) curve. Note that caret requires additional packages for the actual machine learning models: naivebayes, LiblineaR, and randomforest. You can install them as follows (see Section 1.4 for more details):

Python code
R code

!pip3 install pandas statsmodels sklearn

install.packages(c("randomForest", "rsample",
    "glue", "caret", "naivebayes", "LiblineaR",
    "randomForest", "pROC","e1071"))

After installing, you need to import (activate) the packages every session:

Python code
R code

# Data handling, math, and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Classical statistical modeling
import statsmodels.formula.api as smf

# ML: Preprocessing
from sklearn import preprocessing

# ML: Train/test splits, cross validation,
# gridsearch
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
)

# ML: Different models
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# ML: Model evaluation
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_curve,
    auc,
    cohen_kappa_score,
    make_scorer,
    f1_score,
)

library(tidyverse)
library(rsample)
library(glue)
library(caret)
library(naivebayes)
library(LiblineaR)
library(randomForest)
library(pROC)

Note that in Python, we could also simply write import sklearn once instead of all the from sklearn import ... lines. But our approach saves a lot of typing later on, as we can simply write classification_report instead of sklearn.metrics.classification_report, for instance.

In this chapter, we introduce the basic concepts and ideas behind machine learning. We will outline how machine learning relates to traditional statistical approaches that you already might know (and as you will see, there is a lot of overlap), present different types of models, and discuss how to validate them. Later in this book (Section 11.4), we will specifically apply the knowledge you gain from this chapter to the analysis of textual data, arguably one of the most interesting tasks in the computational analysis of communication.

In this chapter, we focus on supervised machine learning (SML) – a form of machine learning, where we aim to predict a variable that, for at least a part of our data, is known. SML is usually applied to classification and regression problems. To illustrate the idea, imagine that you are interested in predicting gender, based on Twitter biographies. You determine the gender for some of the biographies yourself and hand these examples over to the computer. The computer “learns” this classification from your examples, and can then be used to predict the gender for other Twitter biographies for which you do not know the gender.

In unsupervised machine learning (UML), in contrast, you do not have such examples. Therefore, UML is usually applied to clustering and associations problems. We have discussed some of these techniques in Section 7.3, in particular cluster analysis and principal component analysis (PCA). Later, in Section 11.5, we will discuss topic modeling, an unsupervised method to extract so-called topics from textual data.

Even though both approaches can be combined (for instance, one could first reduce the amount of data using PCA or SVD, and then predict some outcome), they can be seen as fundamentally different, from both theoretical and conceptual points of view. Unsupervised machine learning is a bottom-up approach and corresponds to an inductive reasoning: you do not have a hypothesis of, for instance, which topics are present in a corpus of text; you rather let the topics emerge from the data. Supervised machine learning, in contrast, is a top-down approach and can be seen as more deductive: you define a priori which topics to predict.

8.1 Statistical Modeling and Prediction

Machine learning, many people joke, is nothing other than a fancy name for statistics. And, in fact, there is some truth to this: if you say “logistic regression”, this will sound familiar to both statisticians and machine learning practitioners. Hence, it does not make much sense to distinguish between statistics on the one hand and machine learning on the other hand. Still, there are some differences between traditional statistical approaches that you may have learned about in your statistics classes and the machine learning approach, even if some of the same mathematical tools are used. One may say that the focus is a different one, and the objective we want to achieve may differ.

Let us illustrate this with an example. media.csv¹ contains a few columns from survey data on how many days per week respondents turn to different media types (radio, newspaper, tv and Internet) in order to follow the news². It also contains their age (in years), their gender (coded as female = 0, male = 1), and their education (on a 5-point scale).

A straightforward question to ask is how far the sociodemographic characteristics of the respondents explain their media use. Social scientists would typically approach this question by running a regression analysis. Such an analysis tells us how some independent variables $x_1, x_2, \ldots, x_n$ can explain $y$. In an ordinary least square regression (OLS), we would estimate $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$.

In a typical social-science paper, we would then interpret the coefficients that we estimated, and say something like: when $x_1$ increases by one unit, $y$ increases by $\beta_1$. We sometimes call this “the effect of $x_1$ on $y$” (even though, of course, it depends on the study design whether the relationship can really be interpreted as a causal effect). Additionally, we might look at the explained variance $R^2$, to assess how well the model fits our data. In Example 8.1 we use this regression approach to model the relationship of age and gender over the number of days per week a person reads a newspaper. We fit the linear model using the stats function lm in R and the statsmodels function ols (imported from the module statsmodels.formula.api) in Python.

Example 8.1 Obtaining a model through estimating an OLS regression

Python code
R code

df = pd.read_csv("https://cssbook.net/d/media.csv")
mod = smf.ols(formula="newspaper ~ age + gender", data=df).fit()
# mod.summary() would give a lot more info,
# but we only care about the coefficients:
mod.params

Intercept   -0.089560
age          0.067620
gender       0.176665
dtype: float64

df = read.csv("https://cssbook.net/d/media.csv")
mod = lm(formula = "newspaper ~ age + gender",
         data = df)
# summary(mod) would give a lot more info, 
# but we only care about the coefficients:
mod


Call:
lm(formula = "newspaper ~ age + gender", data = df)

Coefficients:
(Intercept)          age       gender  
   -0.08956      0.06762      0.17666

Most traditional social-scientific analyses stop after reporting and interpreting the coefficients of age ($\beta = 0.0676$) and gender ($\beta = -0.0896$), as well as their standard errors, confidence intervals, p-values, and the total explained variance (19%). But we can go a step further. Given that we have already estimated our regression equation, why not use it to do some prediction?

We have just estimated that

By just filling in the values for a 20 year old man, or a 40 year old woman, we can easily calculate the expected number of days such a person reads the newspaper per week, even if no such person exists in the original dataset.

We learn that

This was easy to do by hand, but of course, we could do this automatically for a large and essentially unlimited number of cases. This could be as simple as shown in Example 8.2.

Example 8.2 Using the OLS model we estimated before to predict the dependent variable for new data where the dependent variable is unknown.

Python code
R code

newdata = pd.DataFrame([{"gender": 1, "age": 20}, {"gender": 0, "age": 40}])
mod.predict(newdata)

0    1.439508
1    2.615248
dtype: float64

gender = c(1,0)
age = c(20,40)
newdata = data.frame(age, gender)
predict(mod, newdata)

       1        2 
1.439508 2.615248

In doing so, we shift our attention from the interpretation of coefficients to the prediction of the dependent variable for new, unknown cases. We do not care about the actual values of the coefficients, we just need them for our prediction. In fact, in many machine learning models, we will have so many of them that we do not even bother to report them.

As you see, this implies that we proceed in two steps: first, we use some data to estimate our model. Second, we use that model to make predictions.

We used an OLS regression for our first example, because it is very straightforward to interpret and most of our readers will be familiar with it. However, a model can take the form of any function, as long as it takes some characteristics (or “features”) of the cases (in this case, people) as input and returns a prediction.

Using such a simple OLS regression approach for prediction, as we did in our example, can come with a couple of problems, though. One problem is that in some cases, such predictions do not make much sense. For instance, even though we know that the output should be something between 0 and 7 (as that is the number of days in a week), our model will happily predict that once a man reaches the age of 105 (rare, but not impossible), he will read a newspaper on 7.185 out of 7 days. Similarly, a one year old girl will even have a negative amount of newspaper reading. A second problem relates to the models’ inherent assumptions. For instance, in our example it is quite an assumption to make that the relationships between these variables are linear –- we will therefore discuss multiple models that do not make such assumptions later in this chapter. And, finally, in many cases, we are actually not interested in getting an accurate prediction of a continuous number (a regression task), but rather in predicting a category. We may want to predict whether a tweet goes viral or not, whether a user comment is likely to contain offensive language or not, whether an article is more likely to be about politics, sports, economy, or lifestyle. In machine learning terms, these tasks are known as classification.

In the next section, we will outline key terms and concepts in machine learning. After that, we will discuss specific models that you can use for different use applications.

8.2 Concepts and Principles

The goal of Supervised Machine Learning can be summarized in one sentence: estimate a model based on some data, and then use the model to predict the expected outcome for some new cases, for which we do not know the outcome yet. This is exactly what we have done in the introductory example in Section 8.1.

But when do we need it?

In short, in any scenario where the following two preconditions are fulfilled. First, we have a large dataset (say, $100000$ headlines) for which we want to predict to which class they belong to (say, whether they are clickbait or not). Second, for a random subset of the data (say, $2000$ of the headlines), we already know the class. For example because we have manually coded (“annotated”) them.

Before we start using SML, though, we first need to have a common terminology. At the risk of oversimplifying matters, Table 8.1 provides a rough guideline of how some typical machine learning terms translate to statistical terms that you may be familiar with.

Table 8.1: Some common machine learning terms explained
machine learning lingo	statistics lingo
feature	independent variable
label	dependent variable
labeled dataset	dataset with both independent and dependent variables
to train a model	to estimate
classifier (classification)	model to predict nominal outcomes
to annotate	to (manually) code (content analysis)

Let us explain them more in detail by walking through a typical SML workflow.

Before we start, we need to get a labeled dataset. It may be given to us, or we may need to create it ourselves. For instance, often we can draw a random sample of our data and use techniques of manual content analysis (e.g., Riffe et al. 2019) to annotate (i.e., to manually code) the data. You can download an example for this process (annotating the topic of news articles) from dx.doi.org/10.6084/m9.figshare.7314896.v1 (Vermeer 2018).

It is hard to give a rule of thumb for how much labeled data you need. It depends heavily on the type of data you have (for instance, if it is a binary as opposed to a multi-class classification problem), and on how evenly distributed (class balance) they are (after all, having $10000$ annotated headlines doesn’t help you if $9990$ are not clickbait and only $10$ are). These reservations notwithstanding, it is fair to say that typical sizes in our field are (very roughly) speaking often in the order of $1000$ to $10000$ when classifying longer texts (see Burscher et al. 2014), even though researchers studying less rich data sometimes annotate larger datasets (e.g., $60000$ social media messages in Vermeer et al. 2019).

Once we have established that this labeled dataset is available and have ensured that it is of good quality, we randomly split it into two datasets: a training dataset and a test dataset.³ We will use the first one to train our model, and the second to test how well our model performs. Common ratios range from 50:50 to 80:20; and especially if the size of your labeled dataset is rather limited, you may want to have a slightly larger training dataset at the expense of a slightly smaller test dataset.

In Example 8.3, we prepare the dataset we already used in Section 8.1 for classification by creating a dichotomous variable (the label) and splitting it into a training and a test dataset. We use y_train to denote the training labels and X_train to denote the feature matrix of the training dataset; y_test and X_test is the corresponding test dataset. We set a so-called random-state seed to make sure that the random splitting will be the same when re-running the code. We can easily split these datasets using the rsample function initial_split in R and the sklearn function train_test_split in Python.

Example 8.3 Preparing a dataset for supervised machine learning

Python code
R code

df = pd.read_csv("https://cssbook.net/d/media.csv")

df["uses-internet"] = (df["internet"] > 0).replace(
    {True: "user", False: "non-user"}
)
df.dropna(inplace=True)
print("How many people used online news at all?")

How many people used online news at all?

print(df["uses-internet"].value_counts())

user        1262
non-user     803
Name: uses-internet, dtype: int64

X_train, X_test, y_train, y_test = train_test_split(
    df[["age", "education", "gender"]],
    df["uses-internet"],
    test_size=0.2,
    random_state=42,
)
print(f"We have {len(X_train)} training and " f"{len(X_test)} test cases.")

We have 1652 training and 413 test cases.

df = read.csv("https://cssbook.net/d/media.csv")
df = na.omit(df %>% mutate(
    usesinternet=recode(internet, 
            .default="user", `0`="non-user")))

set.seed(42)
df$usesinternet = as.factor(df$usesinternet)
print("How many people used online news at all?")

[1] "How many people used online news at all?"

print(table(df$usesinternet))


non-user     user 
     803     1262

split = initial_split(df, prop = .8)
traindata = training(split)
testdata  = testing(split)

X_train = select(traindata, 
                 c("age", "gender", "education"))
y_train = traindata$usesinternet
X_test = select(testdata, 
                c("age", "gender", "education"))
y_test = testdata$usesinternet

glue("We have {nrow(X_train)} training and {nrow(X_test)} test cases.")

We have 1652 training and 413 test cases.

We now can train our classifier (i.e., estimate our model using the training dataset contained in the objects X_train and y_train). This can be as straightforward as estimating a logistic regression equation (we will discuss different classifiers in Section 8.3). It may be that we first need to create new independent variables, so-called features, a step known as feature engineering, for example by transforming existing variables, combining them, or by converting text to numerical word frequencies. Example 8.4 shows how easy it is to train a classifier using the Naïve Bayes algorithm with packages caret/naivebayes in R and sklearn in Python (this approach will be better explained in Section 8.3.1).

Example 8.4 A simple Naïve Bayes classifier

Python code
R code

myclassifier = GaussianNB()
myclassifier.fit(X_train, y_train)

GaussianNB()

y_pred = myclassifier.predict(X_test)

myclassifier = train(x = X_train, y = y_train, 
                     method = "naive_bayes")
y_pred = predict(myclassifier, newdata = X_test)

But before we can actually use this classifier to do some useful work, we need to test how capable it is to predict the correct labels, given a set of features. One might think that we could just feed it the same input data (i.e., the same features) again and see whether the predicted labels match the actual labels of the test dataset. In fact, we could do that. But this test would not be strict enough: after all, the classifier has been trained on exactly these data, and therefore one would expect it to perform pretty well. In particular, it may be that the classifier is very good in predicting its own training data, but fails at predicting other data, because it overgeneralizes some idiosyncrasy in the data, a phenomenon known as overfitting (see Figure 8.1).

Figure 8.1: Underfitting and overfitting. Example adapted from https://scikit-learn.org/stable/auto _ examples/model _ selection/plot _ underfitting _ overfitting.html

Instead, we use the features of the test dataset (stored in the objects X_test and y_test) as input for our classifier, and evaluate how far the predicted labels match the actual labels. Remember: the classifier has at no point in time seen the actual labels. Therefore, we can in fact calculate how often the prediction is right.⁴

Example 8.5 Calculating precision and recall

Python code
R code

print("Confusion matrix:")

Confusion matrix:

print(confusion_matrix(y_test, y_pred))

[[ 55 106]
 [ 40 212]]

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    non-user       0.58      0.34      0.43       161
        user       0.67      0.84      0.74       252

    accuracy                           0.65       413
   macro avg       0.62      0.59      0.59       413
weighted avg       0.63      0.65      0.62       413

print(confusionMatrix(y_pred, y_test))

Confusion Matrix and Statistics

          Reference
Prediction non-user user
  non-user       62   53
  user           99  199
                                          
               Accuracy : 0.632           
                 95% CI : (0.5834, 0.6786)
    No Information Rate : 0.6102          
    P-Value [Acc > NIR] : 0.1958408       
                                          
                  Kappa : 0.1843          
                                          
 Mcnemar's Test P-Value : 0.0002623       
                                          
            Sensitivity : 0.3851          
            Specificity : 0.7897          
         Pos Pred Value : 0.5391          
         Neg Pred Value : 0.6678          
             Prevalence : 0.3898          
         Detection Rate : 0.1501          
   Detection Prevalence : 0.2785          
      Balanced Accuracy : 0.5874          
                                          
       'Positive' Class : non-user

print("Confusion matrix:")

[1] "Confusion matrix:"

confmat = table(testdata$usesinternet, y_pred)
print(confmat)

          y_pred
           non-user user
  non-user       62   99
  user           53  199

print("Precision for predicting True internet")

[1] "Precision for predicting True internet"

print("users and non-internet-users:")

[1] "users and non-internet-users:"

precision = diag(confmat) / colSums(confmat)
print(precision)

 non-user      user 
0.5391304 0.6677852

print("Recall for predicting True internet")

[1] "Recall for predicting True internet"

print("users and non-internet-users:")

[1] "users and non-internet-users:"

recall = (diag(confmat) / rowSums(confmat))
print(recall)

 non-user      user 
0.3850932 0.7896825

As shown in Example 8.5, we can create a confusion matrix (generated with caret function confusionMatrix in R and sklearn function confusion_matrix in Python), and then estimate two measures: precision and recall (using base R calculations in R and sklearn function classification_report in Python). In a binary classification, the confusion matrix is a useful table in which each column usually represents the number of cases in a predicted class, and each row the number of cases in the real or actual class. With this matrix (see Figure 8.2) we can then estimate the number of true positives (TP) (correct prediction), false positives (FP) (incorrect prediction), true negatives (TN) (correct prediction) and false negatives (FN) (incorrect prediction).

Figure 8.2: Visual representation of a confusion matrix.

For a better understanding of these concepts, imagine that we build a sentiment classifier, that predicts – based on the text of a movie review – whether it is a positive review or a negative review. Let us assume that the goal of training this classifier is to build an app that recommends only good movies to the user. There are two things that we want to achieve: we want to find as many positive films as possible (recall), but we also want that the selection we found only contains positive films (precision).

Precision is calculated as $\frac{\rm{TP}}{\rm{TP}+\rm{FP}}$, where TP are true positives and FP are false positives. For example, if our classifier retrieves 200 articles that it classifies as positive films, but only 150 of them indeed are positive films, then the precision is $\frac{150}{150+50} = \frac{150}{200} = 0.75$.

Recall is calculated as $\frac{\rm{TP}}{\rm{TP}+\rm{FN}}$, where TP are true positives and FN are false negatives. If we know that the classifier from the previous paragraph missed 20 positive films, then the recall is $\frac{150}{150+20} = \frac{150}{170}= 0.88$.

In other words: recall measures how many of the cases we wanted to find we actually found. Precision measures how much of what we have found is actually correct.

Often, we have to make a trade-off between precision and recall. For example, just retrieving every film would give us a recall of 1.0 (after all, we didn’t miss a single positive film). But on the other hand, we retrieved all the negative films as well, so precision will be extremely low. It can depend on the task at hand whether precision or recall is more important. In ?sec-validation, we discuss this trade-off in detail, as well as other metrics such as accuracy, $F_1$-score or the area under the curve (AUC).

8.3 Classical Machine Learning: From Naïve Bayes to Neural Networks

To do supervised machine learning, we can use several models, all of which have different advantages and disadvantages, and are more useful for some use cases than for others. We limit ourselves to the most common ones in this chapter. The website of scikit-learn (www.scikit-learn.org) gives a good overview of more alternatives.

8.3.1 Naïve Bayes

The Naïve Bayes classifier is a very simple classifier that is often used as a “baseline”. Before estimating more complicated and resource-intensive models, it is a good idea to estimate a simpler model first, to assess how much better the other model actually is. Sometimes, the simple model might even be just fine.

The Naïve Bayes classifier allows you to predict a binary outcome, such as: “Is this message spam or not?”, “Is this article about politics or not?”, “Will this go viral or not?”. It, in fact, also allows you to do the same with more than one category, and both the Python and the R implementation will happily let you train a Naïve Bayes classifier on nominal data, such as whether an article is about politics, sports, the economy, or something different.

For the sake of simplicity, we will discuss a binary example, though.

As its name suggests, a Naïve Bayes classifier is based on Bayes’ theorem, and it is “naïve”. It may sound a bit weird to call a model “naïve”, but what it actually means is not so much that it is stupid, but that it makes very far-reaching assumptions about the data (hence, it is naïve). Specifically, it assumes that all features are independent from each other. Of course, that is hardly ever the case – for instance, in a survey data set, while age and gender indeed are generally independent from each other, this is not the case for education, political interest, media use, and so on. And in textual data, whether a word $W_1$ is used is not independent from the use of word $W_2$ – after all, both are not randomly drawn from a dictionary, but depend on the topic of the text (and other things). Astonishingly, even though these assumptions are regularly violated, the Naïve Bayes classifier works reasonably well in practice.

The Bayes part of the Naïve Bayes classifier comes from the fact that it uses Bayes’ formula, $ P(

:::

::: ::: :::

8.3.2 Train, Validate, Test

By now, we have established which measures we can use to decide which model to use. For all of them, we have assumed that we split our labeled dataset into two: a training dataset and a test dataset. The logic behind it was simple: if we calculate precision and recall on the training data itself, our assessment would be too optimistic – after all, our models have been trained on exactly these data, so predicting the label isn’t too hard. Assessing the models on a different dataset, the test dataset, instead, gives us an assessment of what precision and recall look like if the labels haven’t been seen earlier – which is exactly what we want to know.

Unfortunately, if we calculate precision and recall (or any other metric) for multiple models on the same test dataset, and use these results to determine which metric to use, we can run into a problem: we may avoid overfitting of our model on the training data, but we now risk overfitting it on the test data! After all, we could tweak our models until they fit our test data perfectly, even if this makes the predictions for other cases worse.

One way to avoid this is to split the original data into three datasets instead of two: a training dataset, a validation dataset, and a test dataset. We train multiple model configurations on the training dataset and calculate the metrics of interest for all of them on the validation dataset. Once we have decided on a final model, we calculate its performance (once) on the test dataset, to get an unbiased estimate of its performance.

8.3.3 Cross-validation and Grid Search

In an ideal world, we would have a huge labeled dataset and would not need to worry about the decreasing size of our training dataset as we set aside our validation and test datasets.

Unfortunately, our labeled datasets in the real world have a limited size, and setting aside too many cases can be problematic. Especially if you are already on a tight budget, setting aside not only a test dataset, but also a validation dataset of meaningful size may lead to critically small training datasets. While we have addressed the problem of overfitting, this could lead to underfitting: we may have removed the only examples of some specific feature combination, for instance.

A common approach to address this issue is $k$-fold cross-validation. To do this, we split our training data into $k$ partitions, known as folds. We then estimate our model $k$ times, and each time leave one of the folds aside for validation. Hence, every fold is exactly one time the validation dataset, and exactly $k-1$ times part of the training data. We then simply average the results of our $k$ values for the evaluation metric we are interested in.

If our classifier generalizes well, we would expect that our metric of interest (e.g., the accuracy, or the $F_1$-score, …) is very similar in all folds. Example 8.6 performs a cross-validation based on the logistic regression classifier we built above. We see that the standard deviation is really low, indicating that there are almost no changes between the runs, which is great.

Running the same cross-validation on our random forest, instead, would produce not only worse (lower) means, but also worse (higher) standard deviations, even though also here, there are no dramatic changes between the runs.

Example 8.6 Crossvalidation

Python code
R code

myclassifier = LogisticRegression(solver="lbfgs")
acc = cross_val_score(
    estimator=myclassifier, X=X_train, y=y_train, scoring="accuracy", cv=5
)
print(acc)

[0.64652568 0.64048338 0.62727273 0.64242424 0.63636364]

print(f"M={acc.mean():.2f}, SD={acc.std():.3f}")

M=0.64, SD=0.007

myclassifier = train(x = X_train, y = y_train,
    method = "glm", family="binomial",
    metric="Accuracy", trControl = trainControl(
     method = "cv", number = 5, 
     returnResamp ="all", savePredictions=TRUE),)
print(myclassifier$resample)

   Accuracy     Kappa parameter Resample
1 0.6646526 0.2564808      none    Fold1
2 0.6616314 0.2441998      none    Fold2
3 0.6606061 0.2057079      none    Fold3
4 0.6575758 0.2099241      none    Fold4
5 0.6333333 0.1670491      none    Fold5

print(myclassifier$results)

  parameter  Accuracy     Kappa AccuracySD    KappaSD
1      none 0.6555598 0.2166724 0.01267959 0.03525159

Very often, cross-validation is used when we want to compare many different model specifications, for example to find optimal hyperparameters. Hyperparameters are parameters of the model that are not estimated from the data. These depend on the model, but could for example be the estimation method to use, the number of times a bootstrap should be repeated, etc. Very good examples are the hyperparameters of support vector machines (see above): it is hard to know how soft our margins should be (the $C$), and we may also be unsure about the right kernel (Example 8.8), or in the case of a polynomial kernel, how many degrees we want to consider.

Using the help function (e.g., RandomForestClassifier? in Python), you can look up which hyperparameters you can specify. For a random forest classifier, for instance, this includes the number of estimators in the model, the criterion, and whether or not to use bootstrapping. Example 8.7, Example 8.8, and Example 8.9 illustrate how you can automatically assess which values you should choose.

Note that in R, not all parameters are “tunable” using standard caret. Therefore, an exact replication of the grid searches in Example 8.7 and Example 8.8 would requires either manual comparisons or writing a so-called caret extension.

Example 8.7 A simple gridsearch in Python ## Python code

f1scorer = make_scorer(f1_score, pos_label="user")

myclassifier = RandomForestClassifier()

grid = {
    "n_estimators": [10, 50, 100, 200],
    "criterion": ["gini", "entropy"],
    "bootstrap": [True, False],
}
search = GridSearchCV(
    estimator=myclassifier, param_grid=grid, scoring=f1scorer, cv=5
)
search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 50, 100, 200]},
             scoring=make_scorer(f1_score, pos_label=user))

print(search.best_params_)

{'bootstrap': True, 'criterion': 'entropy', 'n_estimators': 200}

print(classification_report(y_test, search.predict(X_test)))

              precision    recall  f1-score   support

    non-user       0.43      0.38      0.40       161
        user       0.63      0.68      0.65       252

    accuracy                           0.56       413
   macro avg       0.53      0.53      0.53       413
weighted avg       0.55      0.56      0.56       413

Example 8.8 A gridsearch in Python using multiple CPUs ## Python code

myclassifier = SVC(gamma="scale")

grid = {"C": [100, 1e4], "kernel": ["linear", "rbf", "poly"], "degree": [3, 4]}

search = GridSearchCV(
    estimator=myclassifier,
    param_grid=grid,
    scoring=f1scorer,
    cv=5,
    n_jobs=-1,  # use all cpus
    verbose=10,
)
search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [100, 10000.0], 'degree': [3, 4],
                         'kernel': ['linear', 'rbf', 'poly']},
             scoring=make_scorer(f1_score, pos_label=user), verbose=10)

print(f"Hyperparameters {search.best_params_} " "give the best performance:")

Hyperparameters {'C': 100, 'degree': 3, 'kernel': 'poly'} give the best performance:

print(classification_report(y_test, search.predict(X_test_scaled)))

              precision    recall  f1-score   support

    non-user       0.58      0.04      0.08       161
        user       0.62      0.98      0.76       252

    accuracy                           0.62       413
   macro avg       0.60      0.51      0.42       413
weighted avg       0.60      0.62      0.49       413

Example 8.9 A gridsearch in R. ## R code

# Create the grid of parameters
grid = expand.grid(Loss=c("L1","L2"),
                   cost=c(100,1000))

# Train the model using our previously defined 
# parameters
gridsearch = train(x = X_train, y = y_train,
    preProcess = c("center", "scale"), 
    method = "svmLinear3", 
    trControl = trainControl(method = "cv", 
            number = 5),
    tuneGrid = grid)
gridsearch

L2 Regularized Support Vector Machine (dual) with Linear Kernel 

1652 samples
   3 predictor
   2 classes: 'non-user', 'user' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1322, 1322, 1321, 1321, 1322 
Resampling results across tuning parameters:

  Loss  cost  Accuracy   Kappa    
  L1     100  0.6458555  0.1994112
  L1    1000  0.5587091  0.1483755
  L2     100  0.6525185  0.2102270
  L2    1000  0.6525185  0.2102270

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cost = 100 and Loss = L2.

Note

Supervised machine learning is one of the areas where you really see differences between Python and R. While in Python, virtually all you need is available via scikit-learn, in R, we often need to combine caret with various libraries providing the actual models. In contrast, all components we need for machine learning in Python are developed within one package, which leads to less friction. This is what you see in the gridsearch examples in this section. In scikit-learn, any hyperparameter can be part of the grid, but no hyperparameter has to be. Note that in R, in contrast, you cannot (at least, not easily) put any parameter of the model in the grid. Instead, you can look up the “tunable parameters” which must be present as part of the grid in the caret documentation. This means that an exact replication of the grid searches in Example 8.7 and Example 8.8 is not natively supported using caret and requires either manual testing or writing a so-called caret extension.

While in the end, you can find a supervised machine learning solution for all your use cases in R as well, if supervised machine learning is at the core of your project, it may save you a lot of cursing to do this in Python. Hopefully, the package will provide a better solution for machine learning in R in the near future.

You can download the file from cssbook.nl/d/media.csv ↩︎
For a detailed description of the dataset, see Trilling (2013).↩︎
In ?sec-validation, we discuss more advanced approaches, such as splitting into training, validation, and test datasets, or cross-validation.↩︎
We assume here that the manual annotation is always right; an assumption that one may, of course, challenge. However, in the absence of any better proxy for reality, we assume that this manual annotation is the so-called gold standard that reflects the ground truth as closely as possible, and that it by definition cannot be outperformed. When creating the manual annotations, it is therefore important to safeguard their quality. In particular, one should calculate and report some reliability measures, such as the intercoder reliability which tests the degree of agreement between two or more annotators in order to check if our classes are well defined and the coders are doing their work correctly.↩︎