Knowledge Discovery – Text Mining Using R

word cloud introIntroduction

Text mining via analysis of online text sources is a vital tool for gaining new knowledge and insights into people’s habits, sentiments and for monitoring social progress. While search engines exist to provide internet users with easier access to existing information, text mining provides the opportunity to identify new knowledge and insights.

As many corporations store the majority of their data in text form, cost efficient text mining can be a valuable asset. One major use of text mining is within marketing where trends can be analysed using text transcripts of interactions with customers. For example, Hewlett Packard uses SAS Text Miner to analyse transcripts of telesales calls, partitioning notes by themes which can be used for subsequent analysis[1].

In addition, text mining of research literature has proven to be enormously useful, particularly in medical research. A notable success is that of the Children’s Memorial Hospital in Chicago where SPSS text mining software has been used for identifying drug targets to potentially cure cancers[2].

Comparison to Traditional Knowledge Extraction

Consider an analysis done by the University of Regensburg, Germany, of bond markets between 1913 and 1919. This shows German investors remained relatively positive until months before the end of World War I. However British and Dutch investors had a vastly different view, as the price of German government bonds fell on British and Dutch markets.

Worl War I and German Bonds

The full paper[3] highlights many of the weaknesses of such analysis – typically information is incomplete, often originally in paper based format and confined to a narrow section of people.

With social networking, deeper analysis is possible. The following is a social network graph of Twitter data where terms relating to R and data mining are clustered together. Terms used a lot in the same tweets are closer together while the most common terms are in the center.

Such text mining has been seen to be enormously useful for commercial use – for search engine optimisation, identifying hot topics in newsprint and for use in recommender algorithms.

Data Input

Often data can come from many inconvenient sources, so data input and scrubbing tends to be an important and time consuming task. Consider the Reuters-21578 dataset, a collection of Reuters articles from 1987[4]. The following is a sample of the first news item in one of the files:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
. lines omitted for brevity and illustrative purposes

Luckily, data has been prepared with a standard format which is explained in an accompanying README file. The article content is contained between the tags <BODY> and </BODY> while additional category data is included e.g. topic, location, people, organisations. From this, a basic script can be written to extract the news articles relating to business acquisitions:

con <- file("stdin", open = "r")

#initialise document ;list
document <- list()
#boolean to check if in news article
body <- FALSE
#boolean to check if article relates to chosen topic
on_topic <- FALSE

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
 line <- sub("<","<",line)

 if(grepl("<TOPICS>",line)) {
  #filter so only acquisition related news is looked at
  on_topic <- grepl("acq",line)

 if(!body & grepl("<BODY>",line)) {
  #check if reached article start
  body <- TRUE
 } else if(body & grepl("</BODY>",line)) {
  #check if reached article end
  body <- FALSE;
  on_topic <- FALSE

 if(grepl("<BODY>",line) & on_topic) {
  #create new document in list and add first line of article
  document[[length(document)+1]] <- strsplit(line,"<BODY>")[[1]][2]
 } else if(body & on_topic) {
  #or add next line of article
  document[[length(document)]] <- paste(document[[length(document)]],line)

Scrubbing and Preparing Data

Firstly, text is converted to lowercase while whitespace, numbers and punctuation are removed. This ensures multiple forms of a word, e.g. “Word,” and “word”, are now the same. Also, small but common words, e.g. “a” and “is”, need to be removed to get useful results.

Next, stemming is performed where similar words with different spellings are grouped, e.g. instances of ‘work’, ‘worker’, ‘worked’ and ‘working’ might come under the stem-word ‘work’. Stemming is a challenging problem given languages’ complexity e.g. “share” as a verb is vastly different from “shares” as a noun. In addition, software will often produce unusual stem words e.g. “company” stems to “compani”.

R provides support for text preparation via the package tm which can be used for texts in a wide variety of languages. The following code shows how to implement text scrubbing on the list of news articles extracted in the previous section:

myStopwords <- c(stopwords('english'), "available", "via",
		 "the", "said","reut","reuter","pct","mln","dlrs","inc","it","ab")
corp = Corpus(VectorSource(document))
parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = myStopwords, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))
myTdm = TermDocumentMatrix(corp,control=parameters)

This creates a term document matrix, which shows the number of occurrences for each term in each document, an essential tool in text analytics.

Generating Results

Word Cloud

The easiest result that can be generated from data treated with above code is a word cloud. The following code does just this:

m <- as.matrix(myTdm)
#frequency of words in descending order
wordFreq <- sort(rowSums(m), decreasing=TRUE)
#set colour in accordance with word frequency
grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) )
#build word cloud based on word frequency
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F,

And generates the following image:

Wor Cloud of Terms Relating to Acquisition News Articles


A useful tool when text mining is ‘clustering’ or grouping of data e.g. words like ‘shares’ and ‘acquisition’ might be part of a cluster. This has enormous use in search engine optimisation and observing user behaviours on social media.

Before clustering, a look at the term document matrix in R reveals this:

> myTdm
<<TermDocumentMatrix (terms: 1828, documents: 100)>>
Non-/sparse entries: 5261/177539
Sparsity           : 97%
Maximal term length: 17
Weighting          : SMART ltn (SMART)

Now the term document matrix is quite sparse, where 97% of the entries are 0. In other words, many terms will only appear in a handful of documents and working with the whole matrix can be unnecessarily time consuming. Using the function removeSparseTerms removes such terms.

Hierarchical cluster analysis is where the relationship between textual terms is evaluated based on similarity i.e. how similar their rows in a term document matrix are:

#remove sparse terms
myTdm2 <- removeSparseTerms(myTdm, sparse=0.75)
m2 <- as.matrix(myTdm2)
# create distance matrix
# which shows the distance between textual terms
# where each text term's position is represented
# by their row in the Term Document Matrix
# showing how similar terms are
distMatrix <- dist(scale(m2))
#apply hierarchical clustering
fit <- hclust(distMatrix, method="ward.D")
#plot dendogram
#mark out clusters with rectangles, default colour is red
rect.hclust(fit, k=10)


This easily reveals relationships between common terms e.g. trading related terms such as “stock”, “offer”, “common” and “share” are together on one side of the graph. However it is a time consuming algorithm for Big Data datasets. More efficient approaches are given by the k-means and k-medoids algorithms.

K-means and K-medoids Examples

Consider 4 users, their ratings of 2 movies which are rated from 1 to 5 stars, and try to break this into 2 clusters.

Problem of Clustering

K-means Algorithm

K-means - Step 1

Step 1: Randomly pick 2 users and let the centroid of each cluster be at their position.

K-means - Step 2

Step 2: Assign each user to the cluster with the closest centroid.

K-means - Step 3

Step 3: Calculate a new centroid for each cluster i.e. the average position of users in each cluster.

K-means - Repeat Steps 2 & 3
Step 4: Repeat steps 2 and 3 until the centroids converge.

K-medoids Partioning Around Medoids (PAM) Algorithm

K-medoids - Step 1

Step 1: Randomly pick 2 users and let the medoid of each cluster be at their position.

K-medoids - Step 2

Step 2: Assign non-medoids to the cluster of the closest medoid and find the partition’s “cost”. The cost is the sum of the distance of each user to the nearest medoid.

Medoids Step 3

Step 3: For each medoid try to find a non-medoid that results in the total cost of the partition being minimal. Repeat this step until the set of medoids converges.

Now the following illustrates the code for k-means clustering:

#remove sparse terms
myTdm2 <- removeSparseTerms(myTdm, sparse=0.75)
m2 <- as.matrix(myTdm2)
#take a guess at number of clusters
k <- 8
#apply k-means
kmeansResult <- kmeans(m2, k)
#print out clusters and terms in each cluster
for (i in 1:k) {
 cat(paste("cluster ", i, ": ", sep=""))
 cat(rownames(m2)[which(table(rownames(m2), kmeansResult$cluster)[,i]==1)], "\n")

Which outputs the following clusters.

cluster 1: will 
cluster 2: acquisit 
cluster 3: offer 
cluster 4: group 
cluster 5: common share stock 
cluster 6: acquir 
cluster 7: unit 
cluster 8: compani corp

And for the PAM algorithm:

#remove sparse terms
myTdm2 <- removeSparseTerms(myTdm, sparse=0.75)
m2 <- as.matrix(myTdm2)
#take a guess at number of clusters
k <- 8
#apply k-medoids pam algorithm
pamResult <- pam(m2,k)
#print out clusters and terms in each cluster
for (i in 1:k) {
 cat(paste("cluster ", i, ": ", sep=""))
 cat(rownames(m2)[which(table(rownames(m2), pamResult$clustering)[,i]==1)], "\n")

cluster 1: acquir compani corp 
cluster 2: acquisit 
cluster 3: common share 
cluster 4: group 
cluster 5: offer 
cluster 6: stock 
cluster 7: unit 
cluster 8: will

The results seen above are random in nature as kmeans and pam initially choose random centroids/medoids, so running the above code will give different results each time. If more stable results are required, particularly when developing code, the set.seed() function can be used.

Social Network Graphs

R has a powerful package, igraph, to generate social network graphs such as the Twitter analysis presented earlier. The following code generates a graph of terms, highlighting the key terms in the documents:

#remove sparse terms
termDocMatrix <- removeSparseTerms(myTdm, sparse=0.75)
termDocMatrix <- as.matrix(termDocMatrix)

#create boolean matrix
termDocMatrix[termDocMatrix>=1] <- 1
#transform into a term-term adjacency matrix
termMatrix <- termDocMatrix %*% t(termDocMatrix)

#create graph based on adjacency matrix
g <- graph.adjacency(termMatrix, weighted=T, mode="undirected")
#remove loops
g <- simplify(g)
#set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#show graph
plot(g, layout=layout.fruchterman.reingold(g))

Network Graph of Common Terms in Acquisition Themed Documents

Similarly, by numbering documents, a graph can be generated showing which documents are closely related by terms used and which aren’t. There are 100 articles related to acquisitions in the dataset and a graph of how they relate would be too dense to print. The analysis below is instead on articles relating to grain, which produces a clearer network graph.

#remove most common terms, so graph hasn't got too many edges
idx <- which(dimnames(termDocMatrix)$Terms %in% 
            c("tonn", "trade","price","week"))
M <- termDocMatrix[-idx,]
#create document-document adjacency matrix
docsMatrix <- t(M) %*% M
#create graph  based on adjacency matrix
g <- graph.adjacency(docsMatrix, weighted=T, mode = "undirected")
#remove loops
g <- simplify(g)
#remove edges of low weight
#i.e. when 2 terms don't appear that often in the same document
#don't show them connecting
g <- delete.edges(g, E(g)[E(g)$weight <= 1])
#remove isolated vertices
g <- delete.vertices(g, V(g)[degree(g) == 0])
#show graph
plot(g, layout=layout.fruchterman.reingold)

Network graph of documents

Inspecting elements at the centre e.g. documents 9, 13, 22 and 24 and comparing to documents on the exterior e.g. documents 14, 15 and 18 reveals that documents on the interior of the graph directly relate to grain and agriculture and are quite similar in content, while those at the edges have less relevance to grain and are a little more random in content:

> document[c(9,13,22,24)]
[1] "The U.S. Agriculture Department is not actively considering
offering subsidized wheat to the Soviet Union \"The grain companies
are trying to get this fired up again,\" an aide to Agriculture
Secretary Richard Lyng said. ... REUTER"

[1] "Indonesia\"s agriculture sector will grow by just 1.0 pct in
calendar 1987 ... Production of Indonesia\"s staple food, rice, is
forecast to fall to around 26.3 mln tonnes ... REUTER"

[1] "All major grain producing countries must do their part to help
reduce global surpluses ... REUTER"

[1] "Grain trade representatives continued to speculate that the
Reagan administration will offer subsidized wheat to the Soviet
Union ... REUTER"

> document[c(14,15,18)]
[1] "China's wheat crop this year is seriously threatened ...  REUTER"

[1] "Canadian and Egyptian wheat negotiators failed to conclude an
agreement on Canadian wheat exports ...  REUTER"

[1] "The French Cereals Intervention Board, ONIC, left its estimate
of French 1986/87 (July/June) soft wheat deliveries ...  Reuter"

Limitations and Issues with Text Mining

While processes such as stemming and removing stopwords, or profanity, have moved on since the 1990s these processes can never be unambiguously defined. It is therefore still at the user’s discretion and bias as to how this is implemented and whether or not key data patterns are noticed.

Text mining can be prone to bias relating to the age, sex and occupation of typical social network users and it is difficult to account for sarcasm or that some users will only write online when in one particular mood e.g. only reviewing a hotel if an experience is bad. These are only some of the factors that can skew text analysis results, which needs to be borne in mind when reporting results.


The above examples represent only a sample of what can be done with text mining. Given the level of textual information many conpanies store, text mining can significantly enhance data mining results. Also, text mining techniques have uses in other areas of computer science e.g. clustering can be used to process images and improve their quality.

Given the complexities of language, text mining can be quite prone to biases in data preparation and analysis. Nevertheless the knowledge, patterns and results obtained by text mining can often reveal new insights that can be corroborated and checked against other forms of analysis, ensuring any biases are removed.


  1. ^Bringing science to customer relationships
  2. ^Datamining to solve pediatric brain tumors, 2004
  3. ^Tobias A. Jopp, University of Regensburg. How did the capital market evaluate Germany’s prospects for winning World War I? Evidence from the Amsterdam market for government bonds., February 2014.
  4. ^Reuters-21578

Liam Murray

Authored by:
Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.


3 replies »

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s