Sentiment Analysis of Twitter Data: Uber / Lyft
By Andy Keeton and Tim Karfs
Project Overview
Uber and Lyft (collectively referred to as ridesharing) are changing the ways people think about mobility. Whether you like it or not, cities are the testing grounds for a new wave of mobility offerings that could end up replacing the need for people to own cars. Personally owned vehicles (POVs) are the second highest monthly expense for households behind mortgage or rent payments. As ridesharing services become cheaper, and easier to use, more people are expected to shift driving priorities - this could cut down on billions of tons of vehicle-related emissions (see Rocky Mountain Institute's Peak Car Ownership). If you ask someone on the street, they may feel reluctant to reveal their true preferences about ridesharing, but social media represents an interesting channel to analyze market sentiments. So what can social media data really tell us about how people are thinking about the transition to ridesharing?
Software companies are capitalizing on the abundant flow of information coming from Twitter accounts, and using Python packages such as Twython and Tweepy to collect market data from social media users. Crimson Hexagon is a firm based out of Boston that has developed a major presence in this area. A recent blog post that describes what social media conversations tell readers about consumer loyalty demonstrates just how far tech companies are willing to go to analyze how people are thinking about trending topics. However, accessing the tools created by companies like Crimson Hexagon are expensive, so it becomes unusable for those interested in exploring smaller-scale Twitter conversations.
We set out to understand how people around the country felt about ridesharing (specifically, Uber and Lyft) through the very personal, yet impersonal, lens of Twitter. We were hoping to find out the overall sentiment toward Lyft and Uber, as well as a state-based distribution of this sentiment - do people in different states feel differently about Lyft and Uber, and how significant is this difference?
Using a variety of specialized packages within Python, we were able to explore sentiment towards Lyft and Uber on Twitter (more on this later). Although it was much more complicated than you might think, we were able to determine the state that many Twitter users were located. This allowed us to build an understanding of how sentiment changed from state to state, as well as an interesting view into the differences in Twitter activity surrounding Lyft and Uber across the country.
The overall results were promising. While most people’s tweets are rather sentiment-free, there was an observable and significant positive skew in overall sentiment toward Uber and Lyft. However, the results from our distributional analysis were not as cut and dry. Due to a relatively low sample size, we were unable to make any definitive conclusions about the geographic distribution of Lyft and Uber sentiment across the United States. We were, however, able to identify a couple of key distributional differences. The quantity of tweets tended to be larger in liberal states with a higher overall population, more city centers, and a higher proportion of millennials. Additionally, these states tended to have tweets that were favorited more, meaning that the influence of these Twitter users may be higher. Fun fact: the tweet with the largest number of favorites came from a conservative political analyst in Chicago - it was favorited 41,179 times!
Methods
Using multiple specialized packages in Python, we were able to download specific tweets from Twitter, determine the state-level location of the tweet, analyze the sentiment of each tweet, and then map our findings by state. We pulled tweets from the Twitter API (note: API stands for “application program interface” and is the background data protocol and tool that allows information to be shared across platforms) that contained the text “Uber” or “Lyft” once daily for 19 days. Due to the limitations set by Twitter, we were only able to collect 100 tweets per day per search term, meaning that our final count of tweets totalled 3,800. Each tweet that was collected contained the following information: user name, date/time of tweet, text of tweet, user-specified location of Twitter account, and number of favorites for the tweet.
Once these tweets were collected, we ran a series of functions to determine each tweet’s state-level location. This proved tricky due to the inherent randomness and error within user-specified location information (i.e., one might indicate their location as “my house,” which clearly doesn’t indicate any geographical information). After removing tweets from users located outside of the United States and those with inadequate or no location information, we were left with a total of 1,345 tweets (approximately 35% of our initial collection).
We then ran a standard TextBlob sentiment analysis on the tweets to determine the overall sentiment of each tweet. This produced a single number between -1 and 1, with -1 meaning highly negative and 1 meaning highly positive. A chart of the distribution of overall sentiment can be found below.
From here, we were able to group all tweets by state and determine the following state-based measurements, which were graphed (see below): mean sentiment, total number of tweets, and total (sum) of favorites. This helped us gain an understanding of the distribution of sentiment and Twitter activity toward Uber and Lyft across the country.
Implications and Limitations
Our analysis shows that people generally tweet positively towards Uber and Lyft. This is a good sign for the ridesharing industry, as any positive sentiment (however small) is good. However, our sample size was extremely small, with most states having fewer than 10 total tweets. Therefore, the implications of our study do not extend greatly outside of the scope of this project.
An additional limitation was the lack of geolocation data on Twitter. Users can indicate their exact location if they’d like, but this is very rare. Instead, most simply specify a generalized location (e.g., Austin, TX or CO) or include no location at all. This made it tough to get an accurate distribution of tweets. And unfortunately for Twitter data analysts like us, this trend doesn’t seem to be changing any time soon. The Congressional testimony of yet another tech company executive on December 11th suggests that many people remain uneasy about how much of their own personal information is available for companies to use.
Even with these limitations, we were able to gain a solid initial understanding of our problem and develop a analysis procedure that can easily be expanded for a more complex and comprehensive understanding of the issue. Now we just need someone to help us collect Twitter data for a year. So, who’s up for the challenge?
Data Sources (References)
- Tweets: All tweets and their correpsonding information were collected from Twitter using the Python package Twython with search terms "Uber" and "Lyft"
- Locational data: All locational (state and country) data were derived using Geopy from user-indicated location data on Twitter
- Sentiments: All sentiments were derived using TextBlob from tweet text data on Twitter
- USA state boundaries: U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch
Our Code
Below is the code that we used to collect tweets from the Twitter API, process the data to determine location and sentiment, and plot the results. The code blocks used to collect tweets and process the data (Code Block 1 and Code Block 2) are listed in Markdown, as they were run iteratively over time due to pull limits and time restraints (i.e., we had to run the code in designated chunks: 100 tweets per search criterium per day were collected due to the Twitter API restraints and then segments of 500 tweets were analyzed due to the Earthpy limitations). These two code blocks produced files that were used iteratively, and ultimately a single set of output files was produced. That set of files is used in Code Block 3 to analyze the data (this code will run smoothly and produce the plots).
Packages and Functions
The following are the packages and functions that we used throughout the three different code blocks (as code cells):
Plot description
This plot shows the distribution of sentiment values for all tweets collected in the United States. Tweets receiving a sentiment value of -1 have a highly negative sentiment and tweets receiving a value of 1 have a highly positive sentiment. An example of a tweet with a value of -1 is: "These Uber fares are insane in ATL." An example of a value of 1 is: "#lyft can Ronald get the driver of the year award I thought him showing up in a Santa suit was awesome." While the vast majority of tweets were valued at a relatively low sentiment (both positive and negative), there is a clear positive slant in overall sentiment, as evidenced by the taller green columns.
Plot descriptions
These plots show the state-based geographical distribution of average tweet sentiment, total number of tweets, and total number of favorites for our collection. The "Mean Tweet Sentiment" plot ranges from dark red (negative sentiment) to dark green (positive sentiment). While there are some states on the extremes, those tend to be states with very low total tweet values. The states with a significant number of tweets tend to be in the middle in terms of mean sentiment. Therefore, no clear distribution can be observed. The "Total Number of Tweets" plot ranges from light green (fewest number of tweets) to dark green (greatest number of tweets). It clearly demonstrate that Twitter activity around Uber and Lyft is greatest in liberal states with larger populations, more city centers, and higher proportions of millennials. The "Total (Sum) Number of Favorites from All Tweets" plot also ranges from light green (fewest number of favorites) to dark green (greatest number of favorites). This plot doesn't indicate much, but it does show an interesting phenomenon where certain states (namely New York, Illinois, and California) have Twitter users with large followings. This might mean that Twitter activity in these states has a greater influence than that from other states, but this cannot be definitively concluded with the small sample size.