Python code: | R code:

# 10.Text as data

Abstract This chapter shows how you can analyze texts that are stored as a data frame column or variable using functions from the package quanteda in R and the package sklearn in Python and R. Please see Chapter 9 for more information on reading and cleaning text.
Keywords: Text as Data, Document-Term Matrix
Chapter objectives:
• Create a document-term matrix from text
• Perform document and feature selection and weighting
• Understand and use more advanced representations such as n-grams and embeddings
This chapter introduces the packages quanteda (R) and sklearn and nltk (Python) for converting text into a document-term matrix. It also introduces the udpipe package for natural language processing. You can install these packages with the code below if needed (see Section 1.4 for more details):
Python code
!pip3 install ufal.udpipe spacy nltk scikit-learn==0.24.2
!pip3 install gensim==4.0.1 wordcloud nagisa conllu tensorflow==2.5.0 tensorflow-estimator==2.5.0

R code
install.packages(c("glue","tidyverse","quanteda", 
    "quanteda.textstats", "quanteda.textplots", 
    "udpipe", "spacyr"))  
After installing, you need to import (activate) the packages every session:
Python code
# Standard library and basic data wrangling
import os
import sys
import urllib
import urllib.request
import re
import regex
import pandas as pd
import numpy as np

# Tokenization
import nltk
from nltk.tokenize import (TreebankWordTokenizer, 
                           WhitespaceTokenizer)
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfVectorizer)
import nagisa

# For plotting word clouds
%matplotlib inline
from matplotlib import pyplot as plt
from wordcloud import WordCloud

# Natural language processing
import spacy
import ufal.udpipe
from gensim.models import KeyedVectors, Phrases
from gensim.models.phrases import Phraser
from ufal.udpipe import Model, Pipeline
import conllu  
R code
library(glue)
library(tidyverse)
# Tokenization
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
# Natural language processing
library(udpipe)
library(spacyr)



## 10.1.The Bag of Words and the Term-Document Matrix

Before you can conduct any computational analysis of text, you need to solve a problem: computations are usually done on numerical data – but you have text. Hence, you must find a way to represent the text by numbers. The document-term matrix (DTM, also called the term-document matrix or TDM) is one common numerical representation of text. It represents a corpus (or set of documents) as a matrix or table, where each row represents a document, each column represents a term (word), and the numbers in each cell show how often that word occurs in that document.

#### Example 10.1. Example document-term matrix

Python code
texts = [
    "The caged bird sings with a fearful trill", 
    "for the caged bird sings of freedom"]
cv = CountVectorizer()
d = cv.fit_transform(texts)
# Create a dataframe of the word counts to inspect
# - todense transforms the dtm into a dense matrix
# - get_feature_names() gives a list words
pd.DataFrame(d.todense(), 
             columns=cv.get_feature_names())   
R code
texts = c(
    "The caged bird sings with a fearful trill", 
    "for the caged bird sings of freedom")
d = tokens(texts) %>% dfm()
# Inspect by converting to a (dense) matrix
convert(d, "matrix") 


R output. Note that Python output may look slightly different
A matrix: 2 × 11 of type dbl
thecagedbirdsingswithafearfultrillforoffreedom
text111111111000
text211110000111

As an example, Example 10.1 shows a DTM made from two lines from the famous poem by Mary Angelou. The resulting matrix has two rows, one for each line; and 11 columns, one for each unique term (word). In the columns you see the document frequencies of each term: the word “bird” occurs once in each line, but the word “with” occurs only in the first line (text1) and not in the second (text2).

In R, you can use the dfm function from the quanteda package Benoit et al., 2018. This function can take a vector or column of texts and transforms it directly into a DTM (which quanteda actually calls a document-feature matrix, hence the function name dfm). In Python, you achieve the same by creating an object of the CountVectorizer class, which has a fit_transform function.

### 10.1.1.Tokenization

In order to turn a corpus into a matrix, each text needs to be tokenized, meaning that it must be split into a list (vector) of words. This seems trivial, as English (and most western) text generally uses spaces to demarcate words. However, even for English there are a number of edge cases. For example, should “haven't” be seen as a single word, or two?

#### Example 10.2. Differences between tokenizers

Python code
text = "I haven't seen John's derring-do"
tokenizer = CountVectorizer().build_tokenizer()
print(tokenizer(text))


R code
text = "I haven't seen John's derring-do"
tokens(text)


Python output
['haven', 'seen', 'John', 'derring', 'do']
R output
Tokens consisting of 1 document.
text1 :
[1] "I"          "haven't"    "seen"       "John's"     "derring-do"

Example 10.2 shows how Python and R deal with the sentence “I haven't seen John's derring-do”. For Python, we first use CountVectorizer.build_tokenizer to access the built-in tokenizer. As you can see in the first line of input, this tokenizes “haven't” to haven, which of course has a radically different meaning. Moreover, it silently drops all single-letter words, including the 't, 's, and I.

In the box “Tokenizing in Python” below, we therefore discuss some alternatives. For instance, the TreebankWordTokenizer included in the nltk package is a more reasonable tokenizer and splits “haven't” into have and n't, which is a reasonable outcome. Unfortunately, this tokenizer assumes that text has already been split into sentences, and it also includes punctuation as tokens by default. To circumvent this, we can introduce a custom tokenizer based on the Treebank tokenizer, which splits text into sentences (using nltk.sent_tokenize) – see the box for more details.

For R, we simply call the tokens function from the quanteda package. This keeps haven't and John's as a single word, which is probably less desirable than splitting the words but at least better than outputting the word haven.

As this simple example shows, even a relatively simple sentence is tokenized differently by the tokenizers considered here (and see the box on tokenization in Python). Depending on the research question, these differences might or might not be important. However, it is always a good idea to check the output of this (and other) preprocessing steps so you understand what information is kept or discarded.