9 The dplyr package

  • The dplyr package by Hadley Wickham contains a framework for data manipulation in R.

  • When working with data you must:

    • Be very clear with what you want to do.
    • Describe those tasks in the form of a computer program.
    • Execute the program.
  • The dplyr package makes these steps fast and easy:

    • By constraining your options, it helps you think about your data manipulation challenges.
    • It provides simple verbs, i.e., functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
  • To understand the dplyr package we will use the starwars database:

> library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
> dim(starwars)
## [1] 87 14
> starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> 
> # to get a wider view run
> 
> # View(starwars)

Note that starwars is a tibble (a modern representation of the data frame). It’s particularly useful for large datasets because it only prints the first few rows. You can learn more about tibbles here. Note that you can convert data frames to tibbles with as_tibble() function.

9.1 The pipe

All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr package.

x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).

9.2 Single table verbs

  • dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organized into three categories based on the component of the dataset that they work with:

    • Rows:
      • filter() chooses rows based on column values.
      • slice() chooses rows based on location.
      • arrange() changes the order of the rows.
    • Columns:
      • select() changes whether or not a column is included.
      • rename() changes the name of columns.
      • mutate() changes the values of columns and creates new columns.
      • relocate() changes the order of the columns.
    • Groups of rows:
      • summarise() collapses a group into a single row.
  • Filter rows with filter():

filter() allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE. For example, we can select all character with light skin color and brown eyes with:

> library(magrittr)
> library(dplyr)
> starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
## 2 Biggs Da…    183    84 black      light      brown             24 male  mascu…
## 3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
## 4 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
## 5 Dormé        165    NA brown      light      brown             NA fema… femin…
## 6 Raymus A…    188    79 brown      light      brown             NA male  mascu…
## 7 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

By the way, the shortcut to the pipe is Shift+Command+M (MacOS) or Ctrl+Shift+M (Win).

This is roughly equivalent to this base R code:

> attach(starwars)
## The following object is masked _by_ .GlobalEnv:
## 
##     gender
## The following object is masked from dat2:
## 
##     gender
## The following object is masked from dat:
## 
##     gender
> starwars[skin_color == "light" & eye_color == "brown", ]
## # A tibble: 7 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
## 2 Biggs Da…    183    84 black      light      brown             24 male  mascu…
## 3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
## 4 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
## 5 Dormé        165    NA brown      light      brown             NA fema… femin…
## 6 Raymus A…    188    79 brown      light      brown             NA male  mascu…
## 7 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
  • Arrange rows with arrange():

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

> starwars %>% arrange(height, mass)
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yoda         66    17 white      green      brown            896 male  mascu…
##  2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
##  3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
##  5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
##  6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
##  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
##  9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Use desc() to order a column in descending order:

> starwars %>% arrange(desc(height))
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
##  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
##  3 Lama Su     229    88 none       grey       black           NA   male  mascu…
##  4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
##  5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
##  6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
##  7 Taun We     213    NA none       grey       black           NA   fema… femin…
##  8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
##  9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
## 10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
  • Choose rows using their position with slice():

slice() lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. We can get characters from row numbers 5 through 10:

> starwars %>% slice(5:10)
## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
## 2 Owen Lars    178   120 brown, gr… light      blue              52 male  mascu…
## 3 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
## 4 R5-D4         97    32 <NA>       white, red red               NA none  mascu…
## 5 Biggs Da…    183    84 black      light      brown             24 male  mascu…
## 6 Obi-Wan …    182    77 auburn, w… fair       blue-gray         57 male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> 
> # remeber head() and tail()?
> starwars %>% slice_head(n = 3)
## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> starwars %>% slice_tail(n = 5)
## # A tibble: 5 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Finn          NA    NA black      dark       dark              NA male  mascu…
## 2 Rey           NA    NA brown      light      hazel             NA fema… femin…
## 3 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
## 4 BB8           NA    NA none       none       black             NA none  mascu…
## 5 Captain …     NA    NA none       none       unknown           NA fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> 
> # slice_sample() randomly selects rows
> starwars %>% slice_sample(n = 4)
## # A tibble: 4 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Adi Gall…    184    50 none       dark       blue              NA fema… femin…
## 2 IG-88        200   140 none       metal      red               15 none  mascu…
## 3 Finis Va…    170    NA blond      fair       blue              91 male  mascu…
## 4 Shmi Sky…    163    NA black      fair       brown             72 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> 
> # use the option prop to choose a certain proportion of the cases
> starwars %>% slice_sample(prop = 0.1)
## # A tibble: 8 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 BB8           NA    NA none       none       black             NA none  mascu…
## 2 Poggle t…    183    80 none       green      yellow            NA male  mascu…
## 3 Adi Gall…    184    50 none       dark       blue              NA fema… femin…
## 4 Jocasta …    167    NA white      fair       blue              NA fema… femin…
## 5 Lobot        175    79 none       light      blue              37 male  mascu…
## 6 Yarael P…    264    NA none       white      yellow            NA male  mascu…
## 7 Jabba De…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
## 8 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

slice_min() and slice_max() select rows with highest or lowest values of a variable. Note that we first must choose only the values which are not NA:

> starwars %>% filter(!is.na(height)) %>% slice_max(height)
## # A tibble: 1 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
> starwars %>% filter(!is.na(height)) %>% slice_max(height, n = 3)
## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
## 2 Tarfful      234   136 brown      brown      blue              NA male  mascu…
## 3 Lama Su      229    88 none       grey       black             NA male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
  • Select columns with select():

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

> # select columns by name
> starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ℹ 77 more rows
> 
> # select all columns between hair_color and eye_color 
> starwars %>% select(hair_color:eye_color)
## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ℹ 77 more rows
> 
> # select all columns except those from hair_color to eye_color 
> starwars %>% select(!(hair_color:eye_color))
## # A tibble: 87 × 11
##    name    height  mass birth_year sex   gender homeworld species films vehicles
##    <chr>    <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
##  1 Luke S…    172    77       19   male  mascu… Tatooine  Human   <chr> <chr>   
##  2 C-3PO      167    75      112   none  mascu… Tatooine  Droid   <chr> <chr>   
##  3 R2-D2       96    32       33   none  mascu… Naboo     Droid   <chr> <chr>   
##  4 Darth …    202   136       41.9 male  mascu… Tatooine  Human   <chr> <chr>   
##  5 Leia O…    150    49       19   fema… femin… Alderaan  Human   <chr> <chr>   
##  6 Owen L…    178   120       52   male  mascu… Tatooine  Human   <chr> <chr>   
##  7 Beru W…    165    75       47   fema… femin… Tatooine  Human   <chr> <chr>   
##  8 R5-D4       97    32       NA   none  mascu… Tatooine  Droid   <chr> <chr>   
##  9 Biggs …    183    84       24   male  mascu… Tatooine  Human   <chr> <chr>   
## 10 Obi-Wa…    182    77       57   male  mascu… Stewjon   Human   <chr> <chr>   
## # ℹ 77 more rows
## # ℹ 1 more variable: starships <list>
> 
> # select all columns ending with color
> starwars %>% select(ends_with("color"))
## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ℹ 77 more rows

There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.

You can rename variables with rename() by using named arguments:

> starwars %>% rename(home_world = homeworld)
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: home_world <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
  • Add new columns with mutate():

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

> starwars %>% mutate(height_m = height / 100)
## # A tibble: 87 × 15
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, height_m <dbl>
  • Change column order with relocate():

Use a similar syntax as select() to move blocks of columns at once:

> starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 × 14
##    name      sex   gender homeworld height  mass hair_color skin_color eye_color
##    <chr>     <chr> <chr>  <chr>      <int> <dbl> <chr>      <chr>      <chr>    
##  1 Luke Sky… male  mascu… Tatooine     172    77 blond      fair       blue     
##  2 C-3PO     none  mascu… Tatooine     167    75 <NA>       gold       yellow   
##  3 R2-D2     none  mascu… Naboo         96    32 <NA>       white, bl… red      
##  4 Darth Va… male  mascu… Tatooine     202   136 none       white      yellow   
##  5 Leia Org… fema… femin… Alderaan     150    49 brown      light      brown    
##  6 Owen Lars male  mascu… Tatooine     178   120 brown, gr… light      blue     
##  7 Beru Whi… fema… femin… Tatooine     165    75 brown      light      blue     
##  8 R5-D4     none  mascu… Tatooine      97    32 <NA>       white, red red      
##  9 Biggs Da… male  mascu… Tatooine     183    84 black      light      brown    
## 10 Obi-Wan … male  mascu… Stewjon      182    77 auburn, w… fair       blue-gray
## # ℹ 77 more rows
## # ℹ 5 more variables: birth_year <dbl>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
  • Summarise values with summarise():

The last verb is summarise(). It collapses a data frame to a single row:

> starwars %>% summarise(height = mean(height, na.rm = TRUE))
## # A tibble: 1 × 1
##   height
##    <dbl>
## 1   175.