Two Years On

It’s been just over 2 years since my last post. For a large variety of life reasons (a change in role at work, an influx of new health-related hobbies, and more), this blog fell by the wayside.

However, it’s time to change that – a few months ago, I made the decision to more properly pursue this love of data; to move beyond collating Excel sheets, throwing them into Tableau, and seeing what I could see. So, for the last couple of months, I’ve been undertaking Springboard’s Foundations of Data Science course, and truly learning to love data – from R, to stats, to data wrangling, to exploratory data analysis; feature analysis and linear regression and logistic regression and clustering and trees and text analytics and so much more.

It’s been a nice feeling – to move beyond a shallow understanding of data, to start to see the magic behind the curtain. I know now it’s something that I want to do more of, over time – more courses, more in a job, more in my spare time.

And so, you’ll be seeing more from this little blog of mine.

As a return, I want to share with you my capstone project – the final project I created to complete the course.

Written in R Markdown, I give you “Superbowl 50 & the Twitterverse”.

So, pull up a chair, get some popcorn, and enjoy. The show begins after the break. Oh, a word of warning – it’s long…









Superbowl 50 & the Twitterverse





Superbowl 50. The culmination of the National Football League in the US – the champions of the American Football Conference playing the champions of the National Football Conference, for the right to call themselves the best.

I’ve never been an American football fan, but every year, the Superbowl impedes into the consciousness of myself and many others. It’s impossible to ignore – the lauded/derided (depending on who’s singing) national anthem and halftime show; the platform for so many advertisers to get their wares in front of a huge domestic (and global) audience ($5 million for 30 seconds – but for that, the opportunity to get in front of 111.9 million Americans); and of course, the game itself.

I hadn’t intended for my capstone project to be on the Superbowl, but thanks to limitations on Twitter’s API (more on this later), like a quarterback forced to swerve at the last minute to make that final touchdown, I found myself changing my approach.

However, in the spirit of good data scientists everywhere, I wanted the data to tell the story – even to me. So I stayed away from everything bar the final result, designed and ran a number of different treatments on the data, and let it tell me its story. And it did.

Why Superbowl 50?

My initial capstone project was to be in the area of Natural Language Processing (NLP) – more accurately, a sentiment analysis of Twitter information: tweets sent when the Web Summit (a huge technology summit) in Ireland decided to uproot and move to Lisbon. However, I hit a speedbump – the Twitter API doesn’t allow you to pull information from further back than 1 week. The website does – however, I didn’t have the time to figure out how best to scrape this information.

So, thinking on my feet, I knew what I wanted to do – pull data, transform that data, mine it for trends and a story or two – could apply to other events, and Superbowl 50 had just happened. Now, I was still hamstrung a little – Twitter wouldn’t let me pull tweets from the time of the game itself; my data set was tweets happening a day or two afterwards.

It wasn’t ideal. But it was enough. However, throughout the report, I will call out when it was limiting.

Natural Language Processing – An Approach

The approach I undertook – a Twitter data pull, the transforming of the data, the surfacing of stories where possible – was straightforward.

  • Extract relevant tweets from Twitter, pulling a large sample for each hashtag, & store these. The Twitter data would be composed of 2 sections:
  • The Tweet text itself
  • The Tweeter – who the person was, location, name, any other salient information from their Twitter profile (all the elements available through user-class in the twitteR package)
  • Content would be stored as a data frame, and the tm text mining package would be used on the data (R’s most popular text mining package)
  • The tweet content would be converted to a corpus (a large and structured set of texts)
  • The tweet text would be transformed using a number of standard approaches:
  • Convert the text to lowercase
  • Remove retweets, numbers, links, spaces, URLs
  • Remove stopwords (words of no real help – a, the, and, or, and more)
  • Stem words where needed (so that words which referenced the same thing would be treated the same)
  • Build a Document-Term Matrix from the corpus (a matrix of the words left, to allow for analysis)
  • Look at frequency (how often key terms are appearing/mentioned in tweets), clustering (do these terms fit into logical families? Can patterns be observed?), etc
  • Perform sentiment analysis (for each tweet, look at the positive and negative words used, and determine a sentiment score – the more negative the score, the more negative the tweet, and vice versa)
  • Include additional elements that made sense (e.g. a word cloud, a geographical analysis of where people were tweeting from, a time analysis to look at sentiment change over time, etc).

In the following sections, I’ll discuss each of these in more detail, and will look at: * What I did * Why I did it * The outcome.

Part 0: Loading the relevant libraries

Before anything else, I set my working directory, and load all the relevant libraries for this project. Most of these are related to text mining and graphing.

library(twitteR)
library(RCurl)
library(stringr)
library(tm)
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(tm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)
library(fpc)   
library(cluster)
library(ggplot2)
library(scales)
library(ape)

Part 1: Getting the data – playing nice with Twitter

I used the brilliant twitteR package to access the Twitter API. This would allow me to pull out a number of tweets using hashtags I wanted, store them in R, and use them as my data set.

Using twitteR is exceptionally straightforward.

Firsly, I set myself up on Twitter as an app developer, as I needed the relevant details for an OAuth handshake (the link between myself and Twitter). This involved creating an app at Twitter (https://dev.twitter.com/), logging in with my Twitter Account, and creating a new application. This gave me the necessary API keys which I then coded into R. As these are linked to me, I’ve kept them anonymous in this report. However, setting them, and the handshake up, is very straightforward.

consumer_key <- "INSERT YOUR OWN HERE"
consumer_secret <- "INSERT YOUR OWN HERE"
access_token <- "INSERT YOUR OWN HERE"
access_secret <- "INSERT YOUR OWN HERE"

Then, I created the Twitter handshake itself.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

And finally, I pulled my data out of Twitter. To give myself a large data set, I used 4 main hashtags from the Superbowl (#superbowl, #superbowl50, #nfl and #sb50), and for each, pulled 10,000 tweets – easily done with the searchTwitter function. I also included English-language tweets only, and only ones from the 6th February onwards.

superbowl <- searchTwitter("#superbowl", n = 10000, lang = "en", since = "2015-02-06")
superbowl50 <- searchTwitter("#superbowl50", n = 10000, lang = "en", since = "2015-02-06")
nfl <- searchTwitter("#nfl", n = 10000, lang = "en", since = "2015-02-06")
sb50 <- searchTwitter("#sb50", n = 10000, lang = "en", since = "2015-02-06")

For the above, I haven’t run these as code (unlike any R code in the rest of this report, which will run inline), as they will then overwrite the data I do have, due to Twitter’s API. Instead, I’ll load the data sets into this report from what I’ve already run.

load("capstone.RData")

The searchTwitter function pulls back some very interesting information – the text of the status; the screen name of the user who posted this status; a unique ID; the screen name of the user this is in reply to (if applicable); the ID of the user this was in reply to (if applicable); when this status was created; whether the status has been favorited; the number of times this status has been retweeted (if applicable); and the longitude and latitude of the user (if available).

So, now I had 4 data sets – but they were messy…

Part 2: Cleaning the data – “You need to make a corpse?!?!”

Next up, it was time to clean up what I had.

Firstly, I wanted to strip retweets from my extracted Twitter data, as these would simply add too much noise. The twitteR package offers a strip_retweets() function for this exact task.

superbowl_no_rt <- strip_retweets(superbowl)
superbowl50_no_rt <- strip_retweets(superbowl50)
nfl_no_rt <- strip_retweets(nfl)
sb50_no_rt <- strip_retweets(sb50)

Then, I wanted to store the reduced data into data frames – my work later on would be on these data frames (or more accurately, a combined data frame from all 4 hashtags), and a corpus formed from the data pulled using twitteR (discussed below).

superbowl_df <- twListToDF(superbowl_no_rt)
superbowl50_df <- twListToDF(superbowl50_no_rt)
nfl_df <- twListToDF(nfl_no_rt)
sb50_df <- twListToDF(sb50_no_rt)
combined_df <- rbind(superbowl_df, superbowl50_df)
combined_df <- rbind(combined_df, nfl_df)
combined_df <- rbind(combined_df, sb50_df)

Next, text mining requires the creation of a corpus (not a corpse). So putting away my poison, I created a corpus from the text of the combined data frame, using the tm package.

A corpus is a large collection of documents – in this case, the text of all the tweets.

combined_corpus <- Corpus(VectorSource(combined_df$text))

Once created, I performed a number of standard transformations on the corpus, using the tm package. These standard tranformations are par for the course when it comes to text mining – converting all the characters to lowercase; removing any punctuation symbols; removing any numbers; removing any URLs; removing stopwords (words such as “and”, “or,”the“, etc.); stemming words (reducing similar words to the same stem so they are treated the same – e.g. removing the”-ally" from “fantastically”, so that any occurances of “fantastically” in the data set are treated as “fantastic”); and removing any whitespace.

combined_corpus <- tm_map(combined_corpus, removePunctuation)
combined_corpus <- tm_map(combined_corpus, removeNumbers)
combined_corpus <- tm_map(combined_corpus, function(x) gsub("http[[:alnum:]]*", "", x))
combined_corpus <- tm_map(combined_corpus, removeWords, stopwords("english"))
combined_corpus <- tm_map(combined_corpus, stemDocument)
combined_corpus <- tm_map(combined_corpus, stripWhitespace)
combined_corpus <- tm_map(combined_corpus, tolower)

This has now left us with a set of words in the corpus – a set of words which are of the most relevance, and allow the most insightful analysis on.

The next step is to change the corpus into a Document Term Matrix – a mathematical matrix that describes the frequency of terms that occur in a collection of documents (where rows correspond to documents in the collection and columns correspond to terms).

Alongside the DTM, I also create a sparse DTM – ignoring terms that have a document frequency lower than a given threshold (e.g. removing words which may appear only once) – making our remaining terms more relevant (hopefully).

In this case, however, I didn’t remove too many terms (allowing 99% sparsity), as while there are a lot of terms occuring frequently, the sheer number of documents mean that anything below 99% carves off a lot of valuable data.

combined_corpus <- tm_map(combined_corpus, PlainTextDocument)
combined_DTM <- DocumentTermMatrix(combined_corpus)
combined_DTMs <- removeSparseTerms(combined_DTM, 0.99)

Looking at the dimensions of both the DTM and sparse DTM, we can see the impact of creating the sparse DTM – the original DTM has 17,655 terms, while the sparse DTM has 118 terms.

dim(combined_DTM)
## [1] 16596 17655
dim(combined_DTMs)
## [1] 16596   118

Part 3: Frequency (frequency, frequency, frequency)

Now that we have our DTM, our sparse DTM, and our combined data frame, we can start to look for some stories in the data.

Firsly, we’ll look at the most frequent words – words occurring more than 100 times, in our case.

min_freq <- 100
term_freq <- colSums(as.matrix(combined_DTMs))
term_freq <- subset(term_freq, term_freq >= min_freq)
freq_words_df <- data.frame(term = names(term_freq), freq = term_freq)
findFreqTerms(combined_DTM, lowfreq = min_freq)
##   [1] "<U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E31><U+383C><U+3E39>" "<U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E34><U+393C><U+3E65>"
##   [3] "ads"                      "amp"                     
##   [5] "and"                      "apparel"                 
##   [7] "auto"                     "avosfrommexico"          
##   [9] "avosinspace"              "back"                    
##  [11] "bay"                      "bears"                   
##  [13] "bengals"                  "best"                    
##  [15] "beyonce"                  "beyoncé"                 
##  [17] "big"                      "black"                   
##  [19] "blue"                     "bowl"                    
##  [21] "box"                      "brady"                   
##  [23] "break"                    "broncos"                 
##  [25] "browns"                   "brunomars"               
##  [27] "buzz"                     "cam"                     
##  [29] "camnewton"                "can"                     
##  [31] "card"                     "cardinals"               
##  [33] "cards"                    "case"                    
##  [35] "champions"                "chargers"                
##  [37] "check"                    "chiefs"                  
##  [39] "chrome"                   "city"                    
##  [41] "code"                     "coldplay"                
##  [43] "colts"                    "combine"                 
##  [45] "commercial"               "commercials"             
##  [47] "congrats"                 "cowboys"                 
##  [49] "dallas"                   "dallascowboys"           
##  [51] "day"                      "denver"                  
##  [53] "denverbroncos"            "didnt"                   
##  [55] "dolphins"                 "dont"                    
##  [57] "draft"                    "eagles"                  
##  [59] "eli"                      "elimanning"              
##  [61] "espn"                     "fan"                     
##  [63] "fans"                     "fave"                    
##  [65] "favorite"                 "five"                    
##  [67] "footbal"                  "football"                
##  [69] "for"                      "free"                    
##  [71] "freeride"                 "full"                    
##  [73] "fumble"                   "game"                    
##  [75] "get"                      "giants"                  
##  [77] "going"                    "good"                    
##  [79] "gopats"                   "got"                     
##  [81] "great"                    "green"                   
##  [83] "halftime"                 "hat"                     
##  [85] "here"                     "hes"                     
##  [87] "houston"                  "how"                     
##  [89] "its"                      "jaguars"                 
##  [91] "jersey"                   "just"                    
##  [93] "know"                     "ladygaga"                
##  [95] "large"                    "last"                    
##  [97] "latest"                   "learn"                   
##  [99] "like"                     "live"                    
## [101] "logo"                     "look"                    
## [103] "looking"                  "lose"                    
## [105] "loser"                    "losing"                  
## [107] "loss"                     "lot"                     
## [109] "love"                     "make"                    
## [111] "manning"                  "manziel"                 
## [113] "media"                    "mens"                    
## [115] "miller"                   "mvp"                     
## [117] "national"                 "nba"                     
## [119] "new"                      "news"                    
## [121] "newton"                   "next"                    
## [123] "nfl"                      "night"                   
## [125] "now"                      "nwt"                     
## [127] "nygiants"                 "oakland"                 
## [129] "off"                      "one"                     
## [131] "orleans"                  "packers"                 
## [133] "panini"                   "panthers"                
## [135] "parade"                   "party"                   
## [137] "patriots"                 "people"                  
## [139] "performance"              "peyton"                  
## [141] "peytonmanning"            "pittsburgh"              
## [143] "play"                     "players"                 
## [145] "podcast"                  "promo"                   
## [147] "raiders"                  "rams"                    
## [149] "ravens"                   "redskins"                
## [151] "reebok"                   "report"                  
## [153] "ride"                     "right"                   
## [155] "rivera"                   "rookie"                  
## [157] "saints"                   "sale"                    
## [159] "san"                      "say"                     
## [161] "says"                     "seahawks"                
## [163] "season"                   "seattle"                 
## [165] "see"                      "sex"                     
## [167] "shirt"                    "show"                    
## [169] "size"                     "sport"                   
## [171] "sports"                   "spot"                    
## [173] "steelers"                 "still"                   
## [175] "sunday"                   "super"                   
## [177] "superbowl"                "taxis"                   
## [179] "team"                     "texans"                  
## [181] "that"                     "the"                     
## [183] "think"                    "this"                    
## [185] "time"                     "today"                   
## [187] "tom"                      "top"                     
## [189] "topps"                    "tshirt"                  
## [191] "uber"                     "ubercomedrive"           
## [193] "update"                   "via"                     
## [195] "video"                    "vikings"                 
## [197] "von"                      "vote"                    
## [199] "want"                     "was"                     
## [201] "washington"               "watch"                   
## [203] "watching"                 "way"                     
## [205] "weight"                   "what"                    
## [207] "why"                      "will"                    
## [209] "win"                      "winning"                 
## [211] "winter"                   "wire"                    
## [213] "with"                     "womens"                  
## [215] "won"                      "year"                    
## [217] "you"                      "your"

There are two non-English terms which we’ll remove.

freq_words_df <- freq_words_df[-1,]
freq_words_df <- freq_words_df[-1,]

So, some interesting things already jumping out – there are a few terms we would expect: terms relating to Cam Newton (the Carolina Panther’s MVP) and Peyton Manning (the Denver Bronco’s most well-known player); the teams themselves; the event. However, there’s also a few others: * Terms relating to music acts; perhaps the national anthem or the half-time show (Beyonce, Coldplay)? * Terms relating to taxis and free rides; potentially a promotion of some sort (Uber, freefire, ubercomedrive, taxis, promo)?

Let’s dig a little deeper – let’s look at the associations between a set of these frequent terms and other terms (to see which terms are used together), and plot a few of these. We’ll use the findAssocs() function for this investigation. For this iteration, we’ll look at anything with a correlation of greater than 0.25. We could reduce this to find more terms at a weaker correlation, but we’ll start with 0.25 for now.

assoc <- 0.25
broncos_assoc <- findAssocs(combined_TDM, "broncos", assoc)
avos_assoc <- findAssocs(combined_TDM, "avosinspace", assoc)
cam_assoc <- findAssocs(combined_DTM, "cam", assoc)
patriots_assoc <- findAssocs(combined_DTM, "patriots", assoc)
panther_assoc <- findAssocs(combined_DTM, "panther", assoc)
newton_assoc <- findAssocs(combined_DTM, "newton", assoc)
uber_assoc <- findAssocs(combined_DTM, "uber", assoc)
beyonce_assoc <- findAssocs(combined_DTM, "beyonc", assoc)
peyton_assoc <- findAssocs(combined_DTM, "peyton", assoc)
coldplay_assoc <- findAssocs(combined_DTM, "coldplay", assoc)
broncos_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), panther_assoc, names(broncos_assoc)))

Key terms associated with “Broncos” are “stanleynelson” (an Emmy-award winning filmmaker who made “The Black Panthers: Vanguard of the Revolution”), “black” and “revolutionary” (both terms in his documentary name). A little bit random…

avos_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), panther_assoc, names(avos_assoc)))

Key terms associated with “avosinspace” are the same as Bronco’s: “stanleynelson”, “black” and “revolutionary”.

cam_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), cam_assoc, names(cam_assoc)))

Key terms associated with “cam” (Cam Newton) are “newton” (for obvious reasons), “fumble”, “criticism”, “didn’t” and “learn”… It sounds like the Carolina Panther’s player did something wrong – and a quick Google search confirms it: “Carolina Panthers quarterback Cam Newton has been harshly criticised for appearing to hesitate instead of jumping on the loose football he had fumbled late in the fourth quarter. The Denver Broncos recovered the ball, and on the ensuing drive C.J. Anderson found the back of the end zone to seal Denver’s Super Bowl victory” (http://uk.businessinsider.com/cam-newton-explains-why-he-didnt-jump-on-the-fumble-in-super-bowl-50-2016-2).

patriot_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), patriot_assoc, names(patriot_assoc)))

“Patriot” refers to the New England Patriots, the defending champions from the previous year who were beaten by the Broncos in the AFL final. This left a little bad blood between the teams, so it seems Patriot fans were vocal on Twitter (“gopat”).

In addition, it’s evident Uber ran a promotion (“uber”, “taxi”, “getride”, “code”, “promo”) – we’ll look at Uber later on.

panther_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), panther_assoc, names(panther_assoc)))

As with “Broncos” and “avosinspace”, key terms associated with “Panther” are “stanleynelson”, “black” and “revolutionary”. This would make sense, given the association in names between the team and the documentary.

newton_associations_df <- do.call(rbind, Map(function(d, n) cbind.data.frame(xterm=if (length(d)>0) names(d) else NA, cor=if(length(d)>0) d else NA, term=n), newton_assoc, names(newton_assoc)))