Emergency Response Organization
Goal: Hands-on experience to process data, to extract information, and discover patterns or knowledge using data mining method
Scenario: You are an information professional for an emergency response organization, and you analyze and monitor social media data streams during a disaster event. You want to identify the information and patterns that are helpful to understand the situation for the unfolding event.
Key task: You want to organize information stored in several event-related documents, to identify patterns relevant to a better understanding of an event. So, use a data mining system of clustering, using Weka tool, to glean patterns of the similar information documents, followed by providing a visual analysis of your discovered patterns, using a tool of your choice.
Milestone 1: Data Acquisition and Conversion
Milestone 2: Data Preprocessing
Milestone 3: Data Mining Tool Preparation – WEKA Tool
Milestone 4: Cluster Analysis
Milestone 5: Visualization
Milestone 1: Data Acquisition and Conversion
To work as an information professional, we were provided with a JSON file that comprised of objects related to social media data streams during a disaster event. I validated the JSON file to check whether the JSON file was valid by using JSON Lint Validator which is a browser-based tool.
I used Sublime Text 3 to view and format the JSON file into an indented structure. The JSON file comprises of 3135 JSON objects that contain attributes namely LATITUDE, MESSAGE, LONGITUDE, DOCUMENT_ID, DATETIME.
Next, I wrote a Python script to convert the JSON file to CSV file and encoded the content in UTF-8 format.
Milestone B: Data Preprocessing
For Data Preprocessing, I wrote a Python script that substitutes @twitterUsername mentions, URL (starting with https|http) by using regular expression pattern in processTweet( ) method.
I then loaded the CSV file into RStudio and separated the DATETIME attribute column into two individual columns of DATE and TIME using sapply and str.split.
ReadCSV = read.csv(“C:/Users/sneha/Desktop/AIT/Sem III/AIT 690-HemantPurohit/IndividualProject/FinalizedCleanData.csv”, header = TRUE)
> ReadCSV$Date <- sapply(strsplit(as.character(ReadCSV$DATETIME),” “),”[“,1)
> ReadCSV$Time <- sapply(strsplit(as.character(ReadCSV$DATETIME),” “),”[“,2)
> write.csv(ReadCSV, file = “C:/Users/sneha/Desktop/AIT/Sem III/AIT 690-HemantPurohit/IndividualProject/FinalizedCleanData.csv”)
For Data Cleaning, I replaced the punctuation marks, digits, RT, Username, URL, Hashtags and # (pound) symbol in RStudio. Converted the content into lowercase and as.character to coerce its argument to character-type.
Compiled R-Script screenshot of cleaning up the data
For text-mining, I installed the “tm” package and loaded the library to utilize its features of creating a corpus and for removing the stop words.
Created an exceptions vector to store the negation exceptional words from the default stop words list.
Python script to remove stop words and to know the term frequency of each term in the CSV file.
Milestone 3: Data Mining Tool Preparation: WEKA tool
In the Weka Explorer, I loaded the CSV to ARFF converted file. The loaded ARFF file comprises of two attributes including the terms and its associated frequency. I then applied the StringToWordVector Filter. In the GUI Object Editor of the StringToWordVector Filter, I changed the value of IDF Transform and TF Transform into TRUE.
Milestone 4: Cluster Analysis
Below is the result of visualizing the cluster assignments. There are 5 clusters from Cluster 0 to Cluster 4 represented in distinct color tones.
Plot Matrix of all the attributes associated.
Milestone 5: Data Visualization
Below are screenshots of each cluster CSV file with its associated word cloud. I used an online word cloud generating tool named TAGXEDO.
The Cluster-0 word cloud has 384 records. The term that is bigger and bolder and the one that stands out in the optimized text is “Hurricane”. Terms like “Affected”, “SandyHelp”, “Donate”, “Victims”, “Text” are the other terms that occur more frequently.
The Cluster-1 word cloud has 1057 records. The terms that are bigger and bolder and the ones that stand out in the optimized text are “Fundraiser”, “Help”, “Efforts”. Terms like “Support”, “Victims”, “Volunteer”, “Donate” are the other terms that occur more frequently.
The Cluster-2 word cloud has 440 records. The terms that are bigger and bolder and the ones that stand out in the optimized text are “American”, “Recovery”, “Redcross”, “Help”. Terms like “Efforts”, “Support”, “Victims”, “Affected”, “Million” are the other terms that occur more frequently
The Cluster-3 word cloud has 1209 records. The terms that are bigger and bolder and the ones that stand out in the optimized text are “Donate”, “Victims”, “Affected”, “Help”. Terms like “Volunteer”, “Support”, “People”, “Raise”, “Need”, “Money” is the other terms that occur more frequently.
The Cluster-4 word cloud has 45 records. The terms that are bigger and bolder and the ones that stand out in the optimized text are “Way”, “Help”, “Relief”, “Victims”. Terms like “Volunteering”, “United”, “Obama”, “Affected”, “Recovery” are the other terms that occur more frequently.
Top 10 words from the tag clouds formed from individual clusters along with their frequency.
|#||Cluster 0||Cluster 1||Cluster 2||Cluster 3||Cluster 4|
|1||Hurricane (353)||Relief (1075)||Redcross (443)||Help (909)||Way (47)|
|2||Donate (241)||Effort (377)||Recovery (368)||Victims (596)||Help (33)|
|3||Victim (147)||Help (272)||Relief (174)||Affected (168)||Relief (32)|
|4||Sandy (140)||Donate (202)||Help (165)||Donation (158)||Recovery (30)|
|5||Immediate (75)||Donation (99)||American (103)||Please (156)||Efforts (28)|
|6||Text (74)||Fund (91)||Support (100)||Redcross (139)||United (25)|
|7||Please (53)||Please (83)||Victim (89)||Text (141)||Affected (23)|
|8||Relief (39)||Fundraiser (72)||Donation (67)||Volunteer (139)||Obama (20)|
|9||Affected (34)||Support (67)||Million (58)||Shelter (130)||Volunteering (18)|
|10||Sandyhelp (31)||Fundraiser (60)||Proceeds (46)||Make (84)||Rebuild (13)|
Data: The JSON file comprising of JSON objects that include attributes like MESSAGE, LATITUDE, LONGITUDE, DATE, TIME.
Information: The information block gives meaning to the data. The JSON file is converted into CSV file for further data processing. The raw data containing digits, punctuation marks, special characters, URL (HTTP/HTTPS), retweets are all cleaned using the regular expressions in RStudio and then stored it in the CSV file format for easy access. Basically, in this context, I have defined information as the transformed clean data.
Knowledge: The CSV file comprising of the transformed clean data is now loaded into WEKA tool for data mining. “StringToWordVector” filter is applied so as to convert the MESSAGE attribute containing strings into word vectors. Then the clustering algorithm is applied on the attribute to form 5 clusters containing a distinct set of information. Visualization of cluster assignments is done and the result buffer is saved in the ARFF file format. The original file comprising of the attributes and the ARFF file after clustering is merged, so as to find the message with its associated cluster. With respect to each cluster, a word cloud is generated.
Understanding: I have generated 5 word-clouds with respect to 5 clusters, and according to the top 10 frequent words in each cluster, I then categorized the terms that stand out in each cluster.
Cluster-0 Category: Immediate Donation
Text “Sandyhelp” for immediate donation to the victims of Hurricane Sandy
Cluster-1 Category: Fund-Raiser Campaign
Support the fund-raiser campaign by donating help for the affected.
Cluster-2 Category: American Redcross Damage Recovery
American Red Cross organization helps in recovering the damage with the donation proceeds.
Cluster-3 Category: Volunteering in providing help
Volunteer in helping affected victims to persevere in the after-math crisis of Hurricane Sandy by providing shelter.
Cluster-4 Category: Political Impact
Obama’s recovery efforts to help Sandy survivors by rebuilding the homes and reopening the businesses.
Wisdom: This phase represents the ability to use the knowledge and experience to make good decisions. As we now know the clustered categories, given any kind of message we can segregate them into its associated cluster.