We’ve been exploring the power of the programming language R for data mining. In this post we will use R to visualize tweets as a word cloud to find out what people are tweeting about the NBA (#nba). A word cloud is a visual representation showing the most relevant words (i.e., the more times a word appears in our tweet sampling the bigger the word). Please see Twitter Analytics Using R Part 1: Extract Tweets for how to extract data from Twitter. The final result should look similar to the following:
1. Extract Tweets
Load the Twitter authentication and extract tweets using #nba.
load("twitter authentication.Rdata") registerTwitterOAuth(cred)
tweets <- searchTwitter(“#nba”, n=1499, cainfo=”cacert.pem”, lang=”en”)
tweets.text <- sapply(tweets, function(x) x$getText())
2. Clean Up Text
We have already been authenticated and successfully retrieved the text from the tweets using #nba. The first step in creating a word cloud is to clean up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted text. Gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we have chosen gsub because of its simplicity and readability.
#convert all text to lower case tweets.text <- tolower(tweets.text)
Replace blank space (“rt”)
tweets.text <- gsub("rt", "", tweets.text)
Replace @UserName
tweets.text <- gsub("@\\w+", "", tweets.text)
Remove punctuation
tweets.text <- gsub("[[:punct:]]", "", tweets.text)
Remove links
tweets.text <- gsub("http\\w+", "", tweets.text)
Remove tabs
tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
Remove blank spaces at the beginning
tweets.text <- gsub("^ ", "", tweets.text)
Remove blank spaces at the end
tweets.text <- gsub(" $", "", tweets.text)
3. Remove Stop Words
In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly used word such as “the”. Stop words should not be included in the analysis. If tm is not already installed you will need to install it (available from the Comprehensive R Archive Network).
#install tm – if not already installed install.packages("tm") library("tm")
#create corpus tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#clean up by removing stop words tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords()))
4. Generate word cloud
Now we’ll generate the word cloud using the wordcloud package. For this example we are concerned with plotting no more than 150 words that occur more than once with random color, order, and position. If wordcloud is not already installed you will need to install it (available from the Comprehensive R Archive Network).
#install wordcloud if not already installed install.packages("wordcloud") library("word cloud")
#generate wordcloud wordcloud(tweets.text.corpus,min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"), random.color= TRUE, random.order = FALSE, max.words = 150)
Summary
This post highlights how easily R can extract and visualize Twitter data as a word cloud. There are thousands of ways to represent data in R and you’ll need to dig deeper to fully understand how all the words are related to NBA. “Lakers” might be obvious, but “Paul” or “girlfriend” might require more context. In the next post we will learn how we can perform sentiment analysis and chart the analysis results as a graph using R.
Contact Us
Ready to achieve your vision? We're here to help.
We'd love to start a conversation. Fill out the form and we'll connect you with the right person.
Searching for a new career?
View job openings