Saturday, January 2, 2016

Text Mining in R for Newbies (like me!)

Lately, I got attracted to learn R (an open-source statistical program available at https://www.r-project.org/) because of text mining.  Stata also a textual analysis package called txttool but I haven't studied it yet.  

So over the holidays, I undertook a project to analyze textual documents.  Data mining of textual documents is on the rise because this area provides insights to your research if you want to surface dominant themes.  I did some content analysis before using Nvivo at University of Melbourne and even with Nvivo I found it a lot of challenge to analyze qualitative data.  So, let's begin with our small project.

Step 1: Install R (go to the link I above)

Step 2: Install RStudio (a user interface of R that is very helpful for newbies)
This is how the RStudio looks like.



Step 3: Install packages
On the left top most panel, type the following:

install.packages(‘tm’, dependencies=TRUE)
install.packages(‘wordcloud’, dependencies=TRUE)

tm (textmining) and wordcloud packages allow you to mine the text in any document and then convert the most frequent text into word clouds. 

Step 4:  Save your documents to text files using your notepad.  I suggest that you create a folder for the purpose of this project.  In my case, I saved the text files on C:/Users/grace/Desktop/txtmining

Step 5: Type the codes
These are not the best ever codes for text mining.  These codes were also borrowed from other R blogsites.  I only selected those codes that provides a simple solution to my problem of mining text documents.  

library(wordcloud)
library(tm)
## You have to upload the packages as library so that you can use them.

setwd("C:/Users/grace/Desktop/txtmining")
## This sets the working director

txtdata <- Corpus (DirSource("C://Users/grace/Desktop/txtmining"))
inspect(txtdata)

txtdata <- tm_map(txtdata, stripWhitespace)
## This removes blank spaces

txtdata <- tm_map(txtdata, content_transformer(tolower))
## This transforms uppercase to lowercase (e.g. 'DEPED' to 'deped')

txtdata <- tm_map(txtdata, removeWords, stopwords('en'))
## This removes words that are not necessary to your analysis (e.g. is, are, shall, in, the, etc)

wordcloud(txtdata)
## shows the word cloud of your texts

No comments:

Post a Comment