Exploring Data Using dplyr in R


When working on data science projects, the data is typically structured in tabular format. While R provides native data table functionality, the dplyr library offers a more powerful and flexible toolset for data manipulation. In this tutorial, you will learn how dplyr can simplify the exploration and manipulation of tabular data.

By the end of this tutorial, you will be familiar with:

  • How to manage a data frame.
  • How to execute common operations on a data frame.

Let’s get started!

Overview

This tutorial is divided into two main sections:

  1. Getting Started with dplyr
  2. Exploring a Dataset

Getting Started with dplyr

To use the dplyr library in R, you must install it using the following command:

install.packages("dplyr")

For broader functionality, you might want to install the tidyverse package, which encompasses several useful packages for data science:

install.packages("tidyverse")

Once installed, load the library to access its features:

library(dplyr)

The dplyr library is designed for manipulating data frames, which are the primary structure for storing tabular data in R. Here’s how to create a simple data frame:

df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 35),
  occupation = c("Software Engineer", "Data Scientist", "Product Manager")
)

dplyr provides several “verbs” to manipulate data frames, including:

  • filter(): Selects rows based on column values.
  • slice(): Selects rows by their position.
  • arrange(): Sorts rows by column values.

For column operations, you have:

  • select(): Chooses a subset of columns.
  • rename(): Changes the names of columns.
  • mutate(): Modifies existing columns or creates new ones.
  • relocate(): Rearranges columns.

Additionally, you can perform group-wise operations similar to SQL:

  • group_by(): Converts a table into a grouped format.
  • ungroup(): Expands a grouped table back into a regular table.
  • summarize(): Collapses groups into a single row.

Exploring a Dataset

Let’s apply dplyr to explore a specific dataset.

We’ll use the Boston housing dataset. You can download this dataset for analysis. Then load it into R:

boston_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
Boston <- read.table(boston_url, col.names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))

Alternatively, if you have the MASS library, you can access the same dataset:

library(MASS)
data("Boston")

Using the as_tibble() function allows for improved data representation:

library(tibble)
as_tibble(Boston)

This will provide a nicely formatted output of the dataset, highlighting its dimensions and variable types. You’ll see something like this:

# A tibble: 506 × 14
      crim    zn indus  chas   nox    rm   age  dis   rad    tax ptratio black  lstat  medv
     <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl>
1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  397.  4.98  24  
...

The output reveals background details about the dataset: there are 506 rows and 14 columns, along with column names and data types.

To focus on specific columns, you can apply the select() function:

select(Boston, c(crim, medv)) |> as_tibble()

This will yield a subset of the dataset with only the columns of interest. To examine relationships, you might want to visualize the data:

Boston |> mutate(inv_crim = 1/crim) |> select(c(inv_crim, medv)) |> plot()

This example computes the inverse of the crime rate and plots it against the median home value, helping you examine potential correlations.

Moreover, you can assess how different columns relate numerically using the summary() function:

summary(Boston)

This generates descriptive statistics for each numerical column. To investigate relationships between classes, such as how median values relate to proximity to the Charles River, use group_by():

group_by(Boston, chas) |> summarize(avg = mean(medv), sd = sd(medv))

This groups the dataset by chas, producing average and standard deviation values for the median home value.

Conclusion

In this tutorial, you learned to utilize the dplyr library for data manipulation in R. Specifically, you discovered:

  • How to create and manage a data frame.
  • Common operations to filter, select, and summarize data.

Further Reading

For additional resources on dplyr and data manipulation in R:

Books:

  • Beginning Data Science in R (4th Edition) by Thomas Mailund

Online Resources:

Feel free to reach out if you need further modifications or additional information!

Leave a Comment