12 Text Mining & NLP

There are various types of analysis you can do with text data, such as n-grams, sentiemnt analysis, and topic modelling. Various packages are available, the main ones are tm and NLP.

12.0.1 Text data

The source data for your text may come in various formats, for example a single string or a dataframe. Ideally you will put these into a tidy format

# Text data as a list of strings 
textdoc <- c("Once upon a time", "in a galaxy far far away", "On a dark and stormy night")

# Text data as a tibble/dataframe 
textdoc <- tibble('line'=c(1,2,3),
                  'text'=c("Once upon a time", "in a galaxy far far away.", "on a dark and stormy night"))

12.0.2 Tidy text

Ideally you will convert your text data into a tidy format using the tidytext package.

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.4.4

textdoc %>% unnest_tokens(input=text, output=word, token="words", to_lower = TRUE) -> tidytext
# This splits up your text into 'tokens'.
# By default a token is a word, but other options include "characters", "sentences","ngrams".
# By default all text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed.
# Numbers are not removed.

tidytext includes a stop_words tibble, which contain stopwords sourced from SMART, snowball, and onix. You can remove these stopwords from you tidy data by doing an anti join, or create your own custom stopword tibble.

library(dplyr)
stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

# Remove all common stopwords from your data
tidytext %>% anti_join(stop_words) -> tidytext2

## Joining, by = "word"

# Remove stopwords from a particular source (eg. snowball) from your data
tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2

## Joining, by = "word"

# Custom stopwords
custom_sw <- tibble(word=c("a","in","on","and"))
tidytext %>% anti_join(custom_sw) -> tidytext2

## Joining, by = "word"

# Nb. the tm package also includes a stopword function with the same list of words

12.0.3 Stemming

library(SnowballC)
tidytext2 <- tidytext2 %>% mutate(word_stem = wordStem(word, language="english"))

12.0.4 Basic text stats

If the text data is in a tidy format you can easily to use dplyr to manipulate data, such as produce word frequency:

tidytext %>% count(word, sort = TRUE) # Produces token frequency using dplyr count() function

## # A tibble: 13 x 2
##    word       n
##    <chr>  <int>
##  1 a          3
##  2 far        2
##  3 and        1
##  4 away       1
##  5 dark       1
##  6 galaxy     1
##  7 in         1
##  8 night      1
##  9 on         1
## 10 once       1
## 11 stormy     1
## 12 time       1
## 13 upon       1

12.0.5 N-Grams analysis

To look at neighbouring words you need to use the unnest_token function again:

textdoc %>% unnest_tokens(input=text, output=ngram, token="ngrams", n=2) -> ngramdoc
# All text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed, but numbers are not removed.

#N-gram frequency
ngramdoc %>% count(ngram, sort = TRUE)

## # A tibble: 13 x 2
##    ngram            n
##    <chr>        <int>
##  1 a dark           1
##  2 a galaxy         1
##  3 a time           1
##  4 and stormy       1
##  5 dark and         1
##  6 far away         1
##  7 far far          1
##  8 galaxy far       1
##  9 in a             1
## 10 on a             1
## 11 once upon        1
## 12 stormy night     1
## 13 upon a           1

12.0.6 Text Preparation with tm

An alternative way of working with text data is to use the tm package. This includes text cleaning functions.
This pacakge uses a data structure Corpus.

library(tm) # Also loads NLP

## Warning: package 'tm' was built under R version 3.4.4

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

#Convert yor text data into a Corpus
myCorpus <- VCorpus(VectorSource(textdoc))

# You can clean your corpus using the tm_map() function. This has various options:
tm_map(myCorpus, content_transformer(tolower)) # Changes text to lowercase

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

tm_map(myCorpus, removePunctuation) # Removes all punctuations [.,':;] from your Corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

tm_map(myCorpus, removeNumbers) # Removes any numbers from your Corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

tm_map(myCorpus, stripWhitespace)  # Removes multiple whitespace

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

tm_map(myCorpus, removeWords, c("i", "a", "is", "the", "and", "but") ) # Removes custom stopwords

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

# A stopwords() function is available, and can be added to the list above, for example
tm_map(myCorpus, removeWords, c(stopwords("en")) ) # Removes stopwords in the snowball list

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

12.1 Create a Term Document Matrix

Transforming text data into a matrix allows you to do further modelling such as LDA, Naive Bayes, regression. ### From tidy data

# First, the tidy data needs to be summarised so that it contains the count of each token per document.
tidytext %>% count(line, word, sort = TRUE) -> tidytext_count

# Now use the cast_dtm function to convert this into a Document Term Matrix.
tidytext_count %>% cast_dtm(line, word, n) -> myDTM

## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.

inspect(myDTM)

## <<DocumentTermMatrix (documents: 3, terms: 13)>>
## Non-/sparse entries: 15/24
## Sparsity           : 62%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs a and away dark far galaxy in once time upon
##    1 1   0    0    0   0      0  0    1    1    1
##    2 1   0    1    0   2      1  1    0    0    0
##    3 1   1    0    1   0      0  0    0    0    0

12.1.1 From a Corpus (by default words are converted to lowercase less than 3 characters are excluded)

myDTM <- tm::DocumentTermMatrix(myCorpus)

## By default words will be converted to lowercase, and words with <3 characters are removed. To alter defaults, you can specify a control list.

myDTM <- tm::DocumentTermMatrix(myCorpus, control = list(tolower=FALSE,
                                                         wordLengths=c(1,Inf),
                                                         removePunctuation=FALSE
                                                         ))

12.1.2 Matrix Managment

To perform calculations on very large matrices you may need to use the slam package.

## Sum rows of a matrix
slam::row_sums(myDTM)

##  1  2 
##  3 16

## Sum columns of a matrix
slam::col_sums(myDTM)

##      1      2      3      a    and  away.   dark    far galaxy     in 
##      1      1      1      3      1      1      1      2      1      1 
##  night     on   Once stormy   time   upon 
##      1      1      1      1      1      1

12.2 Topic Modelling & Latent Dirichlet Allocation (LDA)

Topic Modelling is a common method for discovering topics from text, such as comments. Most topic modelling techniques, such as LDA (Latent Dirichlet Allocation), require you to choose the number of topics, and will then use an algorithm to createthe topics.

You must interpret what these topic mean
Some words are equally likely to appear across topics; so a word like “account” could appear in both topic lists.

You can build an unsupervised LDA model using a DocumentTermMatrix

library(topicmodels) # Not available #

my_lda <- LDA(myDTM, k=4, control = list(seed=87533)) # k is the number of topics you want. 
 
# Option 1 : using tidytext to examine topic probabilities
topics_beta <- tidy(my_lda, matrix="beta") # beta represents word/topic probabilities
topicstats <- topics_beta %>% group_by(topic) %>% top_n(10,beta) %>% ungroup() %>% arrange(topic, -beta) #Top terms for each topic


topics_gamma <- tidy(my_lda, matrix="gamma") # gamma represents document/topic probabilities
classification <- topics_gamma %>% group_by(document) %>% top_n(1,gamma) %>% ungroup() # Most likely topic for each document


# Merge back to original document
classification <- mutate(classification, id=as.numeric(document))
final <- left_join(textdoc, classification, by="id")

Initially, each word from each document is randomly assigned to a topic. Gibbs sampling is used to re-assign topics, which involves taking each document in turn, and calculating the % of words that are currently assigned to each topic (eg. 12%/24%/36%/28% from topic A/B/C/D). It also looks at each word in the document, and calculates how often that word appears in each topic (eg. 2.5%/2.1%/1.7%/0.5% of topic A/B/C/D). These two sets %’s are multiplied together, and used to are used as weights to randomly re-assign a word to a new topic. This process is repeated for every word, at least 2000 times. [NB. During this process the word being assessed is temporarily removed from all caculations.]

LDA Implications A word is more likely to be re-assigned to another topic if lots of neighbouring words alreading belong to it, or if another topic has a higher concentration of that word. A word is more likely to keep its existing topic if it is part of the majority topic within its document, or becasue the word is spread evenly across topics. Words that only appear once in the corpus shouldnt have a significant effect on the creaton of topics. Relatively uncommon words (that appear 2-3 times in the corpus) should be assigned the same topic quite quickly.

LDA terms phi : the lokeihood that a word appears in atpoic (ie. the frequency of ‘w’ in a topic, divided by frequency of ‘w’ across the corpus). theta : the proportion of words in a document that wer assigned to each opic (nb. alpha has been added, so 0% is not possible). alpha :

The LDA function stores results in the following attributes :
@n : The total number of words in the corpus
@terms : A simple list of all the distinct words in the corpus
@beta : A table (words x topics) containg the log of phi
@gamma : A table (documents x topics) containing theta

More complex analysis of LDA including graph

topicstats %>% mutate(term = reorder(term, beta)) %>% 
          ggplot(aes(term, beta, fill = factor(topic))) +   # plot beta by theme
            geom_col(show.legend = FALSE) +                 # as a bar plot
            facet_wrap(~ topic, scales = "free") +          # which each topic in a seperate plot
            labs(x = NULL, y = "Beta") +                    # no x label, change y label 
            coord_flip()                                    # turn bars sideways

Since topic models are usually unsupervised, it can be difficult to assess how effective/reliable they are. The topics generated by LDA may not make much sense if most comments: * contain few words * contain too many words covering multiple themes * are too general/generic/non specific LDA can be sensitive, so could produce very different results depending on number of topics or when re-running with additional data.

12.3 Clustering - Similarity between topics

Objects with multiple features/dimensions (such as comments) can be grouped together based on how similar they are. First we need to measure the distances between all the objects.

12.4 Calculating distance between objects

The example below shows 3 simple objects (A,B,C) with x and y coordinates.

Obs	x	y
A	1	1
B	4	1
C	4	5

twoDimensions <- (data.frame(x=c(1,4,4),y=c(1,1,5)))
plot(twoDimensions)

For simple 2D objects you can calculate the standard (Euclidean) distance using trigonometry. This can be done with the dist() function.

proxy::dist(x = twoDimensions) # Euclidean distance method by default

##   1 2
## 2 3  
## 3 5 4

This produces a matrix showing the distance between each object:

Between 1(A) and 2(B) the distance is 3
Between 1(A) and 3(C) the distance is 5
Between 2(B) and 3(C) the distance is 4

12.5 Non-Euclidian Distances - Jensen Shannon

The topics generated by a topic model consist of 000’s of words (dimensions) as shown in the phi matrix. We can also use dist() to calculate distance between multi-dimensional objects, but you may want to use a different method to calculate distance.

Since phi contains probability distributions, a divergence measure such as Kullback-Liebler or Jensen-Shannon can be used. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence These compare the relative values of phi for each element of a topic (eg. the relative likelihood of ‘tax’ appearing in Topic 1 versus Topic 2), and compute an overall differences.

distanceMatrix <- proxy::dist(x = phi, method="Kullback")
distanceMatrix

This produces a matrix showing the distance between each Topic:

Between Topic 1 and 2 the distance is 2.038
Between Topic 1 and 3 the distance is 2.297
Between Topic 2 and 3 the distance is 1.817 Nb. The units of distance may not be meaningful

This means that Topics that share a similar word (phi) distribution will be closer to one another.

The Jensen-Shannon is similar to Kullback-Liebler, however is compares each distribution to the average rather than directly (ie. P vs average(P+Q) rather than P vs Q). This is meant to mitigate the effects of noise in the data.

jsPCA <- function(phi) { # first, we compute a pairwise distance between topic distributions # using a symmetric version of KL-divergence # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence jensenShannon <- function(x, y) { m <- 0.5 * (x + y) lhs <- ifelse(x == 0, 0, x * (log(x) - log(m))) rhs <- ifelse(y == 0, 0, y * (log(y) - log(m))) 0.5 * sum(lhs) + 0.5 * sum(rhs) }

12.6 Scaling - Principal Components

If you have used a method to calculate the distance between various objects, it will still have multiple-dimensions. You can use Principal Components Analysis (multi-dimensional scaling) to reduce these, eg. to 2 or 3 dimensions. This will make the data more manageable and suitable for visualisation.

# Multidimensional Scaling - reduces the K by K proximity matrix down to K by 2 components
pca <- stats::cmdscale(distanceMatrix, k = 2, eig=TRUE)
pca$points
plot(pca$points)

The coordinates produced by the scaling are not very meaningful, however the distance between objects (as calculated previously) will be preserved as much as possible.

12.7 Eigen vectors

12.8 K-Means

K-Means Clustering requires a DTM.

kmeans(DTM,           # The Document Term Matrix
       10,            # The number of clusters
       iter.max = 10, #  
       nstart = 3,    # 
       trace = TRUE)  #

The kmeans function stores results in the following attributes :
$cluster : The cluster assigned to each document
$centers : The position of the cluster centre

12.9 Naive Bayes Classifiers

library(e1071)

12.10 TF-IDF Classifiers (Supervised)

12.11 Sentiment Analysis

12.12 Word Bubble

To create a word bubble visualisation we can use ggplot2 and the packcircles package, which decides the size and position of bubbles

mytext <- tibble('Comment'=c('very good, easy to use once setup, couldnt do what i wanted, very quick way of cahnging details. it took only a few minutes to do what i needed. not all my information was correct. lots of confusing information, really good. quick and easy') )

# Use tidytext to extract all the words and remove stopwords
library(tidytext)
mytext %>% unnest_tokens(input=Comment, output=word, token="words", to_lower = TRUE) -> tidytext
tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2

## Joining, by = "word"

# Calculate word frequency and average score
tidytext3 <- group_by(tidytext2, word) %>% summarise(freq=n())

# The packcircles packages decides how to arrange a group a circles, automatically calculating their size and coordinates 
library(packcircles)

## Warning: package 'packcircles' was built under R version 3.4.4

circles <- circleProgressiveLayout(tidytext3$freq, sizetype='area') # Circle size is proportional to frequeny

# Add coordinates back to list of words 
tidytext3 = cbind(tidytext3, circles)

# Prodcue vertices so that the circles acan be constructed 
circles2 <- circleLayoutVertices(circles, npoints=40) # Option to choose how many vertices - more means better drawn circle


# Plot circles using ggplot2 & ggiraph
library(ggiraph)
library(ggplot2)
mybc <- ggplot() + 
  geom_polygon_interactive(data = circles2, aes(x, y, group = id, data_id = id), alpha = 0.6) +
  scale_fill_manual(values="steelblue") +
  geom_text(data = tidytext3, aes(x, y, size = freq, label = word)) +
  scale_size_continuous(range = c(1,13)) +
  theme_void() + 
  theme(legend.position="none") +
  coord_equal()

ggiraph(ggobj = mybc, width_svg = 12, height_svg = 12)

## Warning: package 'gdtools' was built under R version 3.4.4