Stress Analysis on Twitter
Background
The proposal was to use Twitter Analysis as an alternate source of data to document an over-time analysis of stress throughout the United States using the Twitter Streaming API (tweepyAPI).
- Analysis of Tweets
- Mined Additional Data
- Generated Algorithm
- Generated Visualizations
- Determined Results
- Determined Challenges
Project Objective was to perform exploratory analysis on the Stress-related tweets, then Sentiment Analysis, Classification of tweets by the stress-type.
Goals
- Continous Tweet Mining
- Determine Dependent Variables
- Classifying Tweets
Stress Types
Data Flow
User Case
System Architecture
This system has three tiers. The first tier is the User-Interface, where the user will interact with the application. In which the application will provide the user with different visualizations depending on the type of interaction generated by the user. Next is the Logic tier where data munging and data analysis are done. Aggregation, applying analytic models to the data and parsing it. Lastly, Data-tier stores the twitter data obtained from streaming through Python and R extensions. The logic tier and data tier will have a lot of interaction with one another because the data will be constantly pulled out of the database for analysis.
Dataset
Approach
The approach was to obtain the data set, perform analysis of tweets and geo-location. This is where the challenge of incomplete data came into play, resulting in mining more data. The dataset was passed through the sentiment analysis algorithm and the classification algorithms.
Data Mining
Obtained tweets different stress-types based on their respective symptoms as keywords.
Sentiment Analysis
Using the ‘text’ field from the acquired tweets. Passed text into Sentiment Analysis Algorithm to determine the tweets sentiment (Polarity: Positive, Negative, or Neutral) and only used the negative sentiment tweets to determine stress. Sentiment Analysis was performed on each individual dataset. The process that occurred during this step was that we first began processing the tweets text and removing the duplicate tweet. From this subset, we then passed these individual tweets through our normalization process where emoji, punctuation, whitespace, number, signs, and URL are removed from the text. These clean tweets are then passed through the sentiment analysis algorithm, where NLTK package is utilized for its sentiment analysis analyzer. Each document is represented as a label/sentence, the sentence is then tokenized, so it is represented by a list of strings. We separately split instances to keep a balanced uniform class distribution in both the training and testing sets.
The lexicon that was used in this process is from the package name Vader. This Vader package does the sentiment analysis and returns the polarity per each sentence or in our case tweet that is passed into the sentiment analyzer. The polarity that is returned is positive, negative, neutral, and compound. The only two polarities that we’re really interested in is the positive and negative tweet. For our case, we wanted to capture the tweet where the negative polarity is greater than the positive polarity. We used this logic because to us stress is a negative behavior and has negative impact on an individual well-being. The image below is an example of the input tweets and the output of the polarity belonging to that tweet.
Building All Words Corpus
The all word corpus is a corpus that consists of all the words that exist in all three collections of the different stress types. This all word corpus is used later to make a comparison with the sentiment polarity dataset to create our training and testing set. This corpus is later used to pull the useful words also known as the feature set out in a tweet for more accurate analysis.
Building Training/Testing Data
Once we short-listed the collection of the polarity for the tweets, we began building our training and testing data set. The training set is built upon the collection of tweets that had higher negative score than the positive score. From these tweets, we can classify them into the different stress types by referring them back to the keywords that exists for each different stress type. If the keyword exists more frequent for the stress type, then we classify the tweets into that stress type and use that stress type as the tag in our training and testing set.
Our result for this process will be the tweets that we had filtered out through the sentiment analysis and the tag that we had given it from the process above. We split our training set into 80/20 ratio for more accurate prediction. Below is a sample of the training and testing set. The first element inside the parenthesis is the tweet after being normalized, is then split into a list of words. This method is also known as tokenizing. The second element is the tagging of the tweet for which stress
Building Testing/Training Sets
Iterated through all the tweets “text” and checked whether the tweet contains keywords of the stress-type symptoms. Tagged the tweets with the stress-type according to the occurrence of the symptoms in the tweet.
Classfication Algorithm
Randomized document to ensure that all records in the training/testing set is randomized. Naives Bayes, Logistic Regression, Stochastic Gradient Descent and Linear Support Vector were the Classification Algorithms used. Model provided a prediction of the classification between 3 stress-types.
Python Packages: NLTK, Scikit-learn, Numpy, Pandas and Tweepy
Visualization: Tableau
Results
We streamed 17,380 tweets out of which 9,386 tweets were classified as Acute Stress, 7,857 tweets were classified as Episodic Stress and 137 were classified as Chronic Stress.
Challenges
- Cleaning Noisy Tweets
- Incomplete Dataset
- Language Barrier
- Determining the Geo-Coordinates
- Determining the Stress-related Keywords