Data transformation with R

Data transformation is a crucial step in data analysis, and R provides many powerful tools for transforming and manipulating data. Here is an example of data transformation using R: Suppose you have a dataset called “mydata” that contains information about some customers, including their name, age, gender, and income. Here is a sample of what the data might look like:

Data transformation with R
Data transformation with R
   name  age gender income
1   Bob   25      M  50000
2 Alice   30      F  60000
3   Tom   35      M  70000
4   Sue   40      F  80000

Now, let’s say you want to perform some data transformation on this dataset. Here are some common data transformations that you can do with R:

  1. Subset the data:

You can select a subset of the data based on some criteria using the subset() function. For example, you can select only the customers who are over 30 years old:

mydata_subset <- subset(mydata, age > 30)

This will create a new dataset called “mydata_subset” that contains only the rows where age is greater than 30.

  1. Rename columns:

You can rename the columns in the dataset using the colnames() function. For example, you can rename the “gender” column to “sex”:

colnames(mydata)[3] <- "sex"

This will rename the third column (which is the “gender” column) to “sex”.

  1. Reorder columns:

You can reorder the columns in the dataset using the select() function from the dplyr package. For example, you can move the “income” column to the front of the dataset:

library(dplyr)
mydata_new <- select(mydata, income, everything())

This will create a new dataset called “mydata_new” that has the “income” column as the first column, followed by the other columns in the original dataset.

  1. Create new columns:

You can create new columns in the dataset based on some calculation or function using the mutate() function from the dplyr package. For example, you can create a new column called “income_log” that contains the logarithm of the “income” column:

mydata_new <- mutate(mydata, income_log = log(income))

This will create a new dataset called “mydata_new” that has a new column called “income_log” containing the logarithm of the “income” column.

  1. Group and summarize data:

You can group the data based on some variable and summarize the data using the group_by() and summarize() functions from the dplyr package. For example, you can group the data by “sex” and calculate the average income for each sex:

mydata_summary <- mydata %>%
  group_by(sex) %>%
  summarize(avg_income = mean(income))

This will create a new dataset called “mydata_summary” that has two rows (one for each sex) and one column called “avg_income” containing the average income for each sex.

Comments are closed.