Exploring the Titanic Dataset with R: A Beginner’s Guide to EDA

Exploring the Titanic Dataset with R: A Beginner’s Guide to EDA It contains information about the passengers who were aboard the ill-fated Titanic, including their demographics, ticket information, cabin information, and survival status.

This dataset is often used for exploring various data analysis techniques and machine learning algorithms. In this article, we will explore the Titanic dataset using R and perform exploratory data analysis (EDA) to understand the data better.

Exploring the Titanic Dataset with R A Beginner's Guide to EDA
A Beginner’s Guide to EDA

Loading the Titanic Dataset

The Titanic Dataset can be downloaded from various sources, but in this article, we will use the “titanic” package, which is available on the Comprehensive R Archive Network (CRAN). To load the package and the dataset, we can use the following code:

# Install the titanic package if not already installed
# install.packages("titanic")

# Load the titanic package
library(titanic)

# Load the titanic dataset
data("Titanic")

Understanding the Titanic Dataset

Before we dive into the EDA, let’s understand the structure of the Titanic dataset. We can use the str() function to get the structure of the dataset:

str(Titanic)

The output of the above code shows that the Titanic dataset is a 4-dimensional array with dimensions Class, Sex, Age, and Survived. The Class dimension has three levels (1st, 2nd, and 3rd class), the Sex dimension has two levels (male and female), the Age dimension has two levels (child and adult), and the Survived dimension has two levels (no and yes).

Exploring the Titanic Dataset

Now that we understand the structure of the Titanic dataset let’s perform an EDA to understand the data better. We will start by looking at the overall survival rate of the passengers.

# Calculate the overall survival rate
overall_survival_rate <- sum(Titanic) / length(Titanic)
overall_survival_rate

The output of the above code shows that the overall survival rate of the passengers was around 32%. Now, let’s look at the survival rate by sex.

# Calculate the survival rate by sex
sex_survival_rate <- prop.table(Titanic, margin = c(2, 4))
sex_survival_rate

The output of the above code shows that the survival rate of female passengers was significantly higher than that of male passengers. Now, let’s look at the survival rate by class.

# Calculate the survival rate by class
class_survival_rate <- prop.table(Titanic, margin = c(1, 4))
class_survival_rate

The output of the above code shows that the survival rate of first-class passengers was significantly higher than that of second and third-class passengers. Finally, let’s look at the survival rate by age group.

# Calculate the survival rate by age group
age_survival_rate <- prop.table(Titanic, margin = c(3, 4))
age_survival_rate

The output of the above code shows that the survival rate of children was significantly higher than that of adults.

Comments are closed.