12 Text Mining & NLP
There are various types of analysis you can do with text data, such as n-grams, sentiemnt analysis, and topic modelling. Various packages are available, the main ones are tm and NLP.
12.0.1 Text data
The source data for your text may come in various formats, for example a single string or a dataframe. Ideally you will put these into a tidy format
12.0.2 Tidy text
Ideally you will convert your text data into a tidy format using the tidytext package.
## Warning: package 'tidytext' was built under R version 3.4.4
textdoc %>% unnest_tokens(input=text, output=word, token="words", to_lower = TRUE) -> tidytext
# This splits up your text into 'tokens'.
# By default a token is a word, but other options include "characters", "sentences","ngrams".
# By default all text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed.
# Numbers are not removed.tidytext includes a stop_words tibble, which contain stopwords sourced from SMART, snowball, and onix. You can remove these stopwords from you tidy data by doing an anti join, or create your own custom stopword tibble.
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
## Joining, by = "word"
# Remove stopwords from a particular source (eg. snowball) from your data
tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2## Joining, by = "word"
# Custom stopwords
custom_sw <- tibble(word=c("a","in","on","and"))
tidytext %>% anti_join(custom_sw) -> tidytext2## Joining, by = "word"
12.0.3 Stemming
12.0.4 Basic text stats
If the text data is in a tidy format you can easily to use dplyr to manipulate data, such as produce word frequency:
## # A tibble: 13 x 2
## word n
## <chr> <int>
## 1 a 3
## 2 far 2
## 3 and 1
## 4 away 1
## 5 dark 1
## 6 galaxy 1
## 7 in 1
## 8 night 1
## 9 on 1
## 10 once 1
## 11 stormy 1
## 12 time 1
## 13 upon 1
12.0.5 N-Grams analysis
To look at neighbouring words you need to use the unnest_token function again:
textdoc %>% unnest_tokens(input=text, output=ngram, token="ngrams", n=2) -> ngramdoc
# All text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed, but numbers are not removed.
#N-gram frequency
ngramdoc %>% count(ngram, sort = TRUE) ## # A tibble: 13 x 2
## ngram n
## <chr> <int>
## 1 a dark 1
## 2 a galaxy 1
## 3 a time 1
## 4 and stormy 1
## 5 dark and 1
## 6 far away 1
## 7 far far 1
## 8 galaxy far 1
## 9 in a 1
## 10 on a 1
## 11 once upon 1
## 12 stormy night 1
## 13 upon a 1
12.0.6 Text Preparation with tm
An alternative way of working with text data is to use the tm package. This includes text cleaning functions.
This pacakge uses a data structure Corpus.
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
#Convert yor text data into a Corpus
myCorpus <- VCorpus(VectorSource(textdoc))
# You can clean your corpus using the tm_map() function. This has various options:
tm_map(myCorpus, content_transformer(tolower)) # Changes text to lowercase## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
# A stopwords() function is available, and can be added to the list above, for example
tm_map(myCorpus, removeWords, c(stopwords("en")) ) # Removes stopwords in the snowball list## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
12.1 Create a Term Document Matrix
Transforming text data into a matrix allows you to do further modelling such as LDA, Naive Bayes, regression. ### From tidy data
# First, the tidy data needs to be summarised so that it contains the count of each token per document.
tidytext %>% count(line, word, sort = TRUE) -> tidytext_count# Now use the cast_dtm function to convert this into a Document Term Matrix.
tidytext_count %>% cast_dtm(line, word, n) -> myDTM## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
## <<DocumentTermMatrix (documents: 3, terms: 13)>>
## Non-/sparse entries: 15/24
## Sparsity : 62%
## Maximal term length: 6
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs a and away dark far galaxy in once time upon
## 1 1 0 0 0 0 0 0 1 1 1
## 2 1 0 1 0 2 1 1 0 0 0
## 3 1 1 0 1 0 0 0 0 0 0
12.1.1 From a Corpus (by default words are converted to lowercase less than 3 characters are excluded)
myDTM <- tm::DocumentTermMatrix(myCorpus)
## By default words will be converted to lowercase, and words with <3 characters are removed. To alter defaults, you can specify a control list.
myDTM <- tm::DocumentTermMatrix(myCorpus, control = list(tolower=FALSE,
wordLengths=c(1,Inf),
removePunctuation=FALSE
))12.1.2 Matrix Managment
To perform calculations on very large matrices you may need to use the slam package.
## 1 2
## 3 16
## 1 2 3 a and away. dark far galaxy in
## 1 1 1 3 1 1 1 2 1 1
## night on Once stormy time upon
## 1 1 1 1 1 1
12.2 Topic Modelling & Latent Dirichlet Allocation (LDA)
Topic Modelling is a common method for discovering topics from text, such as comments. Most topic modelling techniques, such as LDA (Latent Dirichlet Allocation), require you to choose the number of topics, and will then use an algorithm to createthe topics.
- You must interpret what these topic mean
- Some words are equally likely to appear across topics; so a word like “account” could appear in both topic lists.
You can build an unsupervised LDA model using a DocumentTermMatrix
library(topicmodels) # Not available #
my_lda <- LDA(myDTM, k=4, control = list(seed=87533)) # k is the number of topics you want.
# Option 1 : using tidytext to examine topic probabilities
topics_beta <- tidy(my_lda, matrix="beta") # beta represents word/topic probabilities
topicstats <- topics_beta %>% group_by(topic) %>% top_n(10,beta) %>% ungroup() %>% arrange(topic, -beta) #Top terms for each topic
topics_gamma <- tidy(my_lda, matrix="gamma") # gamma represents document/topic probabilities
classification <- topics_gamma %>% group_by(document) %>% top_n(1,gamma) %>% ungroup() # Most likely topic for each document
# Merge back to original document
classification <- mutate(classification, id=as.numeric(document))
final <- left_join(textdoc, classification, by="id")Initially, each word from each document is randomly assigned to a topic. Gibbs sampling is used to re-assign topics, which involves taking each document in turn, and calculating the % of words that are currently assigned to each topic (eg. 12%/24%/36%/28% from topic A/B/C/D). It also looks at each word in the document, and calculates how often that word appears in each topic (eg. 2.5%/2.1%/1.7%/0.5% of topic A/B/C/D). These two sets %’s are multiplied together, and used to are used as weights to randomly re-assign a word to a new topic. This process is repeated for every word, at least 2000 times. [NB. During this process the word being assessed is temporarily removed from all caculations.]
LDA Implications A word is more likely to be re-assigned to another topic if lots of neighbouring words alreading belong to it, or if another topic has a higher concentration of that word. A word is more likely to keep its existing topic if it is part of the majority topic within its document, or becasue the word is spread evenly across topics. Words that only appear once in the corpus shouldnt have a significant effect on the creaton of topics. Relatively uncommon words (that appear 2-3 times in the corpus) should be assigned the same topic quite quickly.
LDA terms phi : the lokeihood that a word appears in atpoic (ie. the frequency of ‘w’ in a topic, divided by frequency of ‘w’ across the corpus). theta : the proportion of words in a document that wer assigned to each opic (nb. alpha has been added, so 0% is not possible). alpha :
The LDA function stores results in the following attributes :
@n : The total number of words in the corpus
@terms : A simple list of all the distinct words in the corpus
@beta : A table (words x topics) containg the log of phi
@gamma : A table (documents x topics) containing theta
More complex analysis of LDA including graph
topicstats %>% mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
geom_col(show.legend = FALSE) + # as a bar plot
facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sidewaysSince topic models are usually unsupervised, it can be difficult to assess how effective/reliable they are. The topics generated by LDA may not make much sense if most comments: * contain few words * contain too many words covering multiple themes * are too general/generic/non specific LDA can be sensitive, so could produce very different results depending on number of topics or when re-running with additional data.
12.3 Clustering - Similarity between topics
Objects with multiple features/dimensions (such as comments) can be grouped together based on how similar they are. First we need to measure the distances between all the objects.
12.4 Calculating distance between objects
The example below shows 3 simple objects (A,B,C) with x and y coordinates.
| Obs | x | y |
|---|---|---|
| A | 1 | 1 |
| B | 4 | 1 |
| C | 4 | 5 |

For simple 2D objects you can calculate the standard (Euclidean) distance using trigonometry. This can be done with the dist() function.
## 1 2
## 2 3
## 3 5 4
This produces a matrix showing the distance between each object:
- Between 1(A) and 2(B) the distance is 3
- Between 1(A) and 3(C) the distance is 5
- Between 2(B) and 3(C) the distance is 4
12.5 Non-Euclidian Distances - Jensen Shannon
The topics generated by a topic model consist of 000’s of words (dimensions) as shown in the phi matrix. We can also use dist() to calculate distance between multi-dimensional objects, but you may want to use a different method to calculate distance.
Since phi contains probability distributions, a divergence measure such as Kullback-Liebler or Jensen-Shannon can be used. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence These compare the relative values of phi for each element of a topic (eg. the relative likelihood of ‘tax’ appearing in Topic 1 versus Topic 2), and compute an overall differences.
This produces a matrix showing the distance between each Topic:
- Between Topic 1 and 2 the distance is 2.038
- Between Topic 1 and 3 the distance is 2.297
- Between Topic 2 and 3 the distance is 1.817 Nb. The units of distance may not be meaningful
This means that Topics that share a similar word (phi) distribution will be closer to one another.
The Jensen-Shannon is similar to Kullback-Liebler, however is compares each distribution to the average rather than directly (ie. P vs average(P+Q) rather than P vs Q). This is meant to mitigate the effects of noise in the data.
jsPCA <- function(phi) { # first, we compute a pairwise distance between topic distributions # using a symmetric version of KL-divergence # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence jensenShannon <- function(x, y) { m <- 0.5 * (x + y) lhs <- ifelse(x == 0, 0, x * (log(x) - log(m))) rhs <- ifelse(y == 0, 0, y * (log(y) - log(m))) 0.5 * sum(lhs) + 0.5 * sum(rhs) }
12.6 Scaling - Principal Components
If you have used a method to calculate the distance between various objects, it will still have multiple-dimensions. You can use Principal Components Analysis (multi-dimensional scaling) to reduce these, eg. to 2 or 3 dimensions. This will make the data more manageable and suitable for visualisation.
# Multidimensional Scaling - reduces the K by K proximity matrix down to K by 2 components
pca <- stats::cmdscale(distanceMatrix, k = 2, eig=TRUE)
pca$points
plot(pca$points)The coordinates produced by the scaling are not very meaningful, however the distance between objects (as calculated previously) will be preserved as much as possible.
12.7 Eigen vectors
12.8 K-Means
K-Means Clustering requires a DTM.
kmeans(DTM, # The Document Term Matrix
10, # The number of clusters
iter.max = 10, #
nstart = 3, #
trace = TRUE) # The kmeans function stores results in the following attributes :
$cluster : The cluster assigned to each document
$centers : The position of the cluster centre
12.9 Naive Bayes Classifiers
library(e1071)
12.10 TF-IDF Classifiers (Supervised)
12.11 Sentiment Analysis
12.12 Word Bubble
To create a word bubble visualisation we can use ggplot2 and the packcircles package, which decides the size and position of bubbles
mytext <- tibble('Comment'=c('very good, easy to use once setup, couldnt do what i wanted, very quick way of cahnging details. it took only a few minutes to do what i needed. not all my information was correct. lots of confusing information, really good. quick and easy') )
# Use tidytext to extract all the words and remove stopwords
library(tidytext)
mytext %>% unnest_tokens(input=Comment, output=word, token="words", to_lower = TRUE) -> tidytext
tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2## Joining, by = "word"
# Calculate word frequency and average score
tidytext3 <- group_by(tidytext2, word) %>% summarise(freq=n())
# The packcircles packages decides how to arrange a group a circles, automatically calculating their size and coordinates
library(packcircles)## Warning: package 'packcircles' was built under R version 3.4.4
circles <- circleProgressiveLayout(tidytext3$freq, sizetype='area') # Circle size is proportional to frequeny
# Add coordinates back to list of words
tidytext3 = cbind(tidytext3, circles)
# Prodcue vertices so that the circles acan be constructed
circles2 <- circleLayoutVertices(circles, npoints=40) # Option to choose how many vertices - more means better drawn circle
# Plot circles using ggplot2 & ggiraph
library(ggiraph)
library(ggplot2)
mybc <- ggplot() +
geom_polygon_interactive(data = circles2, aes(x, y, group = id, data_id = id), alpha = 0.6) +
scale_fill_manual(values="steelblue") +
geom_text(data = tidytext3, aes(x, y, size = freq, label = word)) +
scale_size_continuous(range = c(1,13)) +
theme_void() +
theme(legend.position="none") +
coord_equal()
ggiraph(ggobj = mybc, width_svg = 12, height_svg = 12)## Warning: package 'gdtools' was built under R version 3.4.4