Analysis of Airlines Passengers

Goal: Hands-on experience to extract, and utilize metadata in the data mining process

Scenario: You are a data scientist for an airline A, and you analyze a customer database. You want to identify the factors that are helpful to understand why some customers are flying your airline, and why others are canceling. Your data science team wants to recommend these factors to advertising team, such as demographic-specific packages to attract more customers.

Key Task: To extract descriptive metadata and create corresponding features for a classification model: given a customer, will it fly or not fly.

P.S. Below is the report, categorized into different milestones with their respective screenshots.

Milestone 1: Data Acquisition and Conversion
Milestone 2: Metadata Extraction and Imputation
Milestone 3: Metadata Exploration
Milestone 4: Attribute Preparation and Engineering – Unsupervised Learning Algorithm
Milestone 5: Prediction Modeling and Visualization

Milestone 1: Data Acquisition and Conversion

I programmatically downloaded the JSON file (dataset) and validated and reformatted the dataset using JSONLint.

          

Converted JSON file into CSV for easier data manipulation and interpretation.

Executed the above python (JSONtoCSV.py) file on the command prompt.

Output Fields

CUSTOMER ID: Unique Numeric-String; DESCRIPTION: String; SEAT CLASS: Categorical {1,2,3}; GUESTS: Numeric; FARE: Numeric; SUCCESS: Binary Categories {0,1}

Milestone 2: Metadata Extraction and Imputation

Metadata types that I observed in the DESCRIPTION data field are First Name, Last Name, and Age. The descriptive metadata that can be derived from the DESCRIPTION data field is the First Name, Last Name, Age, Gender (from the Suffix), distinct kinds of Ethnicities with its probability (e.g. South Slavs, Romania, Italy, Baltics, French, Russian, German)

I installed the “splitstackshape” package and loaded the library “splitstackshape


I used cSplit ( ) to split the DESCRIPTION data field into different columns with respect to the comma (,) separator.

Below is the output of the file after writing to the CSV file.

I split the DESCRIPTION data field into different columns with the separators (; and , and .) using cSplit ( ).

Metadata extraction using a logic in the code and as well as using the external API.
External API used: NAME-PRISM API (http://www.name-prism.com/)
This free API retrieves name-based ethnicity with its respective probabilities. Additional metadata was extracted from the NAME2 data field which is the first name. So, according to each person’s listed first name (NAME2 data field) the API retrieves 39 distinct kinds of ethnicities with its probability value.

Output .txt file:

JSON to CSV Conversion

Imputing Data Values

For imputing the data values, I found the mean of the age column as it had NA values and wrote it into the CSV file.
I then, merged the “JustTheAge.csv” file with the original CSV file.
The Merged File: The CSV file below consists of the imputed mean values of AGE data field and 39 distinct ethnicities retrieved from the NAME-PRISM API. I also retrieved the GENDER of the people listed, by installing the GENDER and GENDERDATA packages in R. I found the levels of the SUFFIX data field and replaced it accordingly with the male, female and unisex (could be either male or female) gender title using the respective index defined in the levels.

Milestone 3: Metadata Exploration – Visualizations of Distributions

As there were 39 distinct ethnicities retrieved, I chose 7 European ethnicities [South Slavs, Romania, Italy, Baltics, French, Russian, German].First, I visualized the count of all the data fields [SUCCESS, FARE, AGE, GENDER, GUESTS, SEATCLASS]. I compared the SUCCESS data field with the rest of the columns GENDER, AGE, FARE, GUESTS, SEAT CLASS. All the visualizations contain the ethnicity (probability) values in them.
Tool Used: Tableau

SUCCESS by AGE & GENDER

The visualization above is divided into 2 categories, one is of the passengers who “Did not Fly” and the other is of passengers who “Will Fly”. Below is the categorization of the success count with respect to the age and gender data fields.

Female – Did not Fly
Average Age: 29.70
Count of additional metadata: 17
Male – Did not Fly
Average Age: 29.70
Count of additional metadata: 108
Unisex – Did not Fly
Average Age: 29.70
Count of additional metadata: 1
Female – Will Fly
Average Age: 29.70
Count of additional metadata: 36
Male – Will Fly
Average Age: 29.70
Count of additional metadata: 16
Unisex – Will Fly
Average Age: 32
Count of additional metadata: 1

 

SUCCESS by GENDER & FARE

Count of the male population who did not fly for the fare amount $7.9 is 36 and who will fly at fare amount $26.6 is 7. The female population count who did not fly for fare amount $31.3 is 5 and who will fly at fare amount $13.0 is 12. The unisex population count who did not fly is 2 for fare amount $26.6 and who will fly for $30.5 is 2. From the observation above we can say that the potential customers are likely to fly with an air-ticket belong to an average fare range of $34.01 (unisex), $32.27 (female) and $32.39 (male).

SUCCESS by GENDER & GUESTS

The box-plot visualization above represents the success count of the potential passengers by the count of guests and gender. Female population count who will not fly 37 (4.15%) with 0 guests, 26 (2.91%) with 1 guest, 3 (0.33%) with 2 guests, 7 (0.7%) with 3 guests, 1 (0.11%) with 5 guests. Female population count who will fly with 0 guests is 136 (15.2%), 1 guest is 80 (8.97%), 2 guests is 10 (1.12%), 3 guests is 4 (0.44%), 4 guests is 2(0.22%). Male population count who will not fly with 0 guests is 354 (39.7%), 1 guest is 70 (7.78%), 2 guests is 11 (1.23%), 3 guests is 5 (0.56%), 5 guests with 4 (0.44%). Male population count who will fly with 0 guests is 70 (7.85%), 1 guest is 32 (3.59%), 2 guests is 2 (0.22%), 4 guests is 1 (0.11%). Unisex population count who will not fly with 0 guests is 6 (0.67%), 1 guest is 1 (0.11%) and who will fly with no guests is 4 (0.44%), with 2 guests is 1 (0.11%)

In the box-plot representation above, there is an outside value under the male category with a count of 354 customers travelling alone (without any guests).

Box plot Details Female Male Unisex
Did not Fly Will Fly Did not Fly Will Fly Did not Fly Will Fly
Upper Whisker

Upper Hinge

Median

Lower Hinger

Lower Whisker

26

16.5

4

3

1

136

80

10

4

2

70

40.5

11

4.5

4

70

51

17

1.5

1

6

3.5

1

1

1

4

4

2.5

1

1

SEATCLASS by GENDER and GUESTS

The plot above interprets that the male population has a maximum of 346 (38.82% of the total) who would prefer travelling in third class (seat class 3), then 112 in the first class (seat class 1) and 106 in second class (seat class 2). The female population has a maximum of 144 (16.16% of the total) who would prefer travelling in third class, then 93 in first class, and 76 in second class. The unisex population has a maximum of 11 (1.23% of the total) who would prefer flying in the first class and 2 in the second class.

SUCCESS by SEATCLASS & GENDER

There is a maximum count of 72 in the female category who did not fly in seat class-3 and 72 who will fly in seat class-3. In case of the male population, 299 did not fly in seat class-3 and 47 will fly in seat class-3. For unisex category, 6 will not fly in seat class-1 and 5 will fly in seat class-1.

Milestone 4: Attribute Preparation and Engineering – Unsupervised Learning Algorithm

I used write.arff ( ) method to convert the CSV file into ARFF format file. I converted all the scientific values of the additional metadata column i.e. the ethnicity data field columns into NUMERIC and wrote them into an ARFF file (separately for each ethnicity, there are a total of 39 ethnicities). As there was a mixed set of numeric and nominal attributes, I used the NumericToNominal filter to convert SEAT CLASS, SUCCESS, and GENDER (numeric values) into nominal values for further classification.

I used 2 Attribute Evaluators InfoGainAttributeEval for feature selection with Ranker Search Method. I applied InfoGainAttributeEval attribute evaluator as it measures how each feature contributes in decreasing the overall entropy (under the Select Attributes tab) with Ranker as the Search Method as it evaluates each attribute and lists the results in a rank order.

The stratified cross validation paints a more realistic picture.

  • The Correctly Classified Instances show that 79.1246% of test instances were correctly classified, often considered as the sample accuracy.
  • The Incorrectly Classified Instances show that 20.8754% of test instances were incorrectly classified (the error rate).
  • The Kappa Statistic value is 0.5533 (which is greater than 0), hence the classifier is doing better than chance.
  • ROC Area = 0.741 (value approaching 1) – an ‘optimal’ classifier. This explains the classifier model to be strong.

The above (2 X 2) confusion matrix consisting of raw numbers.

a = Did not Fly and b = Will Fly

TN (True Negatives) = 467;
The number of instances predicted ‘did not fly’,
is correctly classified as ‘did not fly’.
FP (False Positives) = 82;
The number of instances predicted ‘will fly’,
is incorrectly classified as ‘did not fly’.
FN (False Negatives) = 104;
The number of  instances predicted ‘did not fly’,
is incorrectly classified as ‘will fly’
TP (True Positives) = 238;
The number of  instances predicted ‘will fly’,
is correctly classified as ‘will fly’.

 

ROC Curve

The coordinates if the points of the ROC curve are False Positive Rate (Specificity) and True Positive Rate (Sensitivity) for each of the unique values. The ROC curve is inclined more towards the
upper-left corner and the Area under ROC is 0.7408 (AUC = 74.08%) which interprets a reasonably good classifier as the AUC is close to 1 and stays at the 100% level across the top.

 

Precision-Recall Curve (Alternative to ROC Curve)
When comparing the ROC curve and the Precision-Recall curve, True Positive Rate axis stays common. The Precision-Recall curve is a better  classifier model as it is more towards tthe upper-right
corner.

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt