The Spaceship Titanic Kaggle Competition

Photo: NASA’s James Webb Space Telescope

Doing the original Titanic competition was quite fun, so I decided to do the spaceship version as well. I worked on this competition entirely in Python, the Jupyter Notebook can be found here. The aim is to predict which passengers get transported from the spaceship.

The Data

Alongside the passenger ID and whether or not that person got transported, the dataset includes information on the passenger’s home planet, whether or not they spent the journey in Cryo-Sleep, their cabin number (port or starboard), destination, age, VIP status, and how much money they spent on room service, the food court, the shopping mall, the spa, and the VR deck. I excluded the names of the passengers, simplified the cabin number to only indicate port or starboard side, and simplified the destination names into their respective initial letters. I excluded rows with missing values in home planet, age and destination, and replaced NaN in other columns with 0 / false.

During the initial data exploration, cryo sleep seems to be the most important factor: 66% of those in cryo sleep were transported, compared to only 18% of those who weren’t.

The home planet had a slight effect: passengers from Earth were more likely to not be transported, passengers from Europa were more likely to be transported, and passengers from Mars had a 50/50 chance.

The destination also had a slight effect: passengers on their way to C were more likely to be transported, those on their way to T were more likely not to be transported, and the few passengers to P had a 50/50 chance.

There were very few passengers with VIP status, and this did not considerably affect their transport odds:

Passengers with a port cabin were less likely to be transported, passengers in a starboard cabin more likely, and those few without a cabin had 50/50 odds.

The correlation matrix highlights this as well – transported correlates most strongly with cryo sleep, followed by the spending on room service, the spa, and the VR deck. Destination and home planet each have a weak correlation, while VIP, cabin and age do not seem relevant.

Cryo sleep is itself negatively correlated with the spending on luxuries such as the spa, because people who are asleep cannot buy anything. Destination doesn’t correlate with anything else and home planet only correlates weakly with spending, but also with age and VIP status. So these two should be relevant on their own.

Looking only at passengers who weren’t in cryo sleep, transported still correlates with the individual spending items, though to a lesser extent than before. However, transportation doesn’t correlate with the total spending amount, home planet, or destination. It does however correlate with any spending, i.e. whether or not a person spent anything. It also weakly correlates with age and cabin.

There is no correlation between spending and age but a strong correlation between any spending and age (since children don’t have money to spend).

The individual luxury items were only weakly correlated with each other, so spending money on one does not mean spending money on all.

So far, we know that cryo sleep is the main factor in being transported. For awake passengers, having spent no money increases their odds of transport, as well as being in a certain cabin.

The model

I included the features Cryo Sleep, Cabin, and Any_Spending and tried a logistic regression, k-nearest neighbour, and decision tree model. The logistic regression had the highest score (74%) and lowest mean absolute error (0.28), followed by k-nearest neighbour (65% score, error 0.34).

Next, I tried a logistic regression with some additional features (Destination and Age), which improved the error very slightly (0.26), but also lowered the score slightly (0.73%). Using less features (only Cryo and Any_Spending) resulted in the same score (0.73%) but a lower error (0.25). Including only Cryo and the individual spending features (RoomService etc) resulted in the highest score (77%) and the lowest error (0.23). Finally, including all features with a correlation of above 0.1 (see first correlation matrix above) resulted in a lower score (75%) and a slightly higher error (0.24).

I then used GridSearch to cross validate the model version that had Cryo Sleep and the individual spending options as features (the one that scored the highest). The tuned hyperparameters had an accuracy of 77.73% (a slight improvement from the previous version at 77.69%).

Finally, I ran the tuned model on the test data set and submitted my predictions to the competition with an accuracy score of 77%.

Leave a comment

Design a site like this with WordPress.com
Get started