12 Text Mining & NLP

There are various types of analysis you can do with text data, such as n-grams, sentiemnt analysis, and topic modelling. Various packages are available, the main ones are tm and NLP.

12.0.2 Tidy text

Ideally you will convert your text data into a tidy format using the tidytext package.

## Warning: package 'tidytext' was built under R version 3.4.4

tidytext includes a stop_words tibble, which contain stopwords sourced from SMART, snowball, and onix. You can remove these stopwords from you tidy data by doing an anti join, or create your own custom stopword tibble.

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

12.0.4 Basic text stats

If the text data is in a tidy format you can easily to use dplyr to manipulate data, such as produce word frequency:

## # A tibble: 13 x 2
##    word       n
##    <chr>  <int>
##  1 a          3
##  2 far        2
##  3 and        1
##  4 away       1
##  5 dark       1
##  6 galaxy     1
##  7 in         1
##  8 night      1
##  9 on         1
## 10 once       1
## 11 stormy     1
## 12 time       1
## 13 upon       1

12.0.5 N-Grams analysis

To look at neighbouring words you need to use the unnest_token function again:

## # A tibble: 13 x 2
##    ngram            n
##    <chr>        <int>
##  1 a dark           1
##  2 a galaxy         1
##  3 a time           1
##  4 and stormy       1
##  5 dark and         1
##  6 far away         1
##  7 far far          1
##  8 galaxy far       1
##  9 in a             1
## 10 on a             1
## 11 once upon        1
## 12 stormy night     1
## 13 upon a           1

12.0.6 Text Preparation with tm

An alternative way of working with text data is to use the tm package. This includes text cleaning functions.
This pacakge uses a data structure Corpus.

## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

12.1 Create a Term Document Matrix

Transforming text data into a matrix allows you to do further modelling such as LDA, Naive Bayes, regression. ### From tidy data

## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
## <<DocumentTermMatrix (documents: 3, terms: 13)>>
## Non-/sparse entries: 15/24
## Sparsity           : 62%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs a and away dark far galaxy in once time upon
##    1 1   0    0    0   0      0  0    1    1    1
##    2 1   0    1    0   2      1  1    0    0    0
##    3 1   1    0    1   0      0  0    0    0    0

12.1.2 Matrix Managment

To perform calculations on very large matrices you may need to use the slam package.

##  1  2 
##  3 16
##      1      2      3      a    and  away.   dark    far galaxy     in 
##      1      1      1      3      1      1      1      2      1      1 
##  night     on   Once stormy   time   upon 
##      1      1      1      1      1      1

12.2 Topic Modelling & Latent Dirichlet Allocation (LDA)

Topic Modelling is a common method for discovering topics from text, such as comments. Most topic modelling techniques, such as LDA (Latent Dirichlet Allocation), require you to choose the number of topics, and will then use an algorithm to createthe topics.

  • You must interpret what these topic mean
  • Some words are equally likely to appear across topics; so a word like “account” could appear in both topic lists.

You can build an unsupervised LDA model using a DocumentTermMatrix

Initially, each word from each document is randomly assigned to a topic. Gibbs sampling is used to re-assign topics, which involves taking each document in turn, and calculating the % of words that are currently assigned to each topic (eg. 12%/24%/36%/28% from topic A/B/C/D). It also looks at each word in the document, and calculates how often that word appears in each topic (eg. 2.5%/2.1%/1.7%/0.5% of topic A/B/C/D). These two sets %’s are multiplied together, and used to are used as weights to randomly re-assign a word to a new topic. This process is repeated for every word, at least 2000 times. [NB. During this process the word being assessed is temporarily removed from all caculations.]

LDA Implications A word is more likely to be re-assigned to another topic if lots of neighbouring words alreading belong to it, or if another topic has a higher concentration of that word. A word is more likely to keep its existing topic if it is part of the majority topic within its document, or becasue the word is spread evenly across topics. Words that only appear once in the corpus shouldnt have a significant effect on the creaton of topics. Relatively uncommon words (that appear 2-3 times in the corpus) should be assigned the same topic quite quickly.

LDA terms phi : the lokeihood that a word appears in atpoic (ie. the frequency of ‘w’ in a topic, divided by frequency of ‘w’ across the corpus). theta : the proportion of words in a document that wer assigned to each opic (nb. alpha has been added, so 0% is not possible). alpha :

The LDA function stores results in the following attributes :
@n : The total number of words in the corpus
@terms : A simple list of all the distinct words in the corpus
@beta : A table (words x topics) containg the log of phi
@gamma : A table (documents x topics) containing theta

More complex analysis of LDA including graph

Since topic models are usually unsupervised, it can be difficult to assess how effective/reliable they are. The topics generated by LDA may not make much sense if most comments: * contain few words * contain too many words covering multiple themes * are too general/generic/non specific LDA can be sensitive, so could produce very different results depending on number of topics or when re-running with additional data.

12.3 Clustering - Similarity between topics

Objects with multiple features/dimensions (such as comments) can be grouped together based on how similar they are. First we need to measure the distances between all the objects.

12.4 Calculating distance between objects

The example below shows 3 simple objects (A,B,C) with x and y coordinates.

Obs x y
A 1 1
B 4 1
C 4 5

For simple 2D objects you can calculate the standard (Euclidean) distance using trigonometry. This can be done with the dist() function.

##   1 2
## 2 3  
## 3 5 4

This produces a matrix showing the distance between each object:

  • Between 1(A) and 2(B) the distance is 3
  • Between 1(A) and 3(C) the distance is 5
  • Between 2(B) and 3(C) the distance is 4

12.5 Non-Euclidian Distances - Jensen Shannon

The topics generated by a topic model consist of 000’s of words (dimensions) as shown in the phi matrix. We can also use dist() to calculate distance between multi-dimensional objects, but you may want to use a different method to calculate distance.

Since phi contains probability distributions, a divergence measure such as Kullback-Liebler or Jensen-Shannon can be used. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence These compare the relative values of phi for each element of a topic (eg. the relative likelihood of ‘tax’ appearing in Topic 1 versus Topic 2), and compute an overall differences.

This produces a matrix showing the distance between each Topic:

  • Between Topic 1 and 2 the distance is 2.038
  • Between Topic 1 and 3 the distance is 2.297
  • Between Topic 2 and 3 the distance is 1.817 Nb. The units of distance may not be meaningful

This means that Topics that share a similar word (phi) distribution will be closer to one another.

The Jensen-Shannon is similar to Kullback-Liebler, however is compares each distribution to the average rather than directly (ie. P vs average(P+Q) rather than P vs Q). This is meant to mitigate the effects of noise in the data.

jsPCA <- function(phi) { # first, we compute a pairwise distance between topic distributions # using a symmetric version of KL-divergence # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence jensenShannon <- function(x, y) { m <- 0.5 * (x + y) lhs <- ifelse(x == 0, 0, x * (log(x) - log(m))) rhs <- ifelse(y == 0, 0, y * (log(y) - log(m))) 0.5 * sum(lhs) + 0.5 * sum(rhs) }

12.6 Scaling - Principal Components

If you have used a method to calculate the distance between various objects, it will still have multiple-dimensions. You can use Principal Components Analysis (multi-dimensional scaling) to reduce these, eg. to 2 or 3 dimensions. This will make the data more manageable and suitable for visualisation.

The coordinates produced by the scaling are not very meaningful, however the distance between objects (as calculated previously) will be preserved as much as possible.

12.7 Eigen vectors

12.8 K-Means

K-Means Clustering requires a DTM.

The kmeans function stores results in the following attributes :
$cluster : The cluster assigned to each document
$centers : The position of the cluster centre

12.9 Naive Bayes Classifiers

library(e1071)

12.10 TF-IDF Classifiers (Supervised)

12.11 Sentiment Analysis

12.12 Word Bubble

To create a word bubble visualisation we can use ggplot2 and the packcircles package, which decides the size and position of bubbles

## Joining, by = "word"
## Warning: package 'packcircles' was built under R version 3.4.4
## Warning: package 'gdtools' was built under R version 3.4.4