9 The dplyr package
The
dplyr
package by Hadley Wickham contains a framework for data manipulation in R.When working with data you must:
- Be very clear with what you want to do.
- Describe those tasks in the form of a computer program.
- Execute the program.
The dplyr package makes these steps fast and easy:
- By constraining your options, it helps you think about your data manipulation challenges.
- It provides simple
verbs
, i.e., functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
To understand the
dplyr
package we will use thestarwars
database:
> library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
> dim(starwars)
## [1] 87 14
> starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
>
> # to get a wider view run
>
> # View(starwars)
Note that starwars
is a tibble
(a modern representation of the data frame). It’s particularly useful for large datasets because it only prints the first few rows. You can learn more about tibbles here. Note that you can convert data frames to tibbles with as_tibble()
function.
9.1 The pipe
All of the dplyr
functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr
provides the %>%
operator from magrittr
package.
x %>% f(y)
turns into f(x, y)
so the result from one step is then “piped” into the next step. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).
9.2 Single table verbs
dplyr
aims to provide a function for each basicverb
of data manipulation. These verbs can be organized into three categories based on the component of the dataset that they work with:- Rows:
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.
- Columns:
select()
changes whether or not a column is included.rename()
changes the name of columns.mutate()
changes the values of columns and creates new columns.relocate()
changes the order of the columns.
- Groups of rows:
summarise()
collapses a group into a single row.
- Rows:
Filter rows with
filter()
:
filter()
allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE
. For example, we can select all character with light skin color and brown eyes with:
> library(magrittr)
> library(dplyr)
> starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Padmé Am… 185 45 brown light brown 46 fema… femin…
## 4 Cordé 157 NA brown light brown NA <NA> <NA>
## 5 Dormé 165 NA brown light brown NA fema… femin…
## 6 Raymus A… 188 79 brown light brown NA male mascu…
## 7 Poe Dame… NA NA brown light brown NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
By the way, the shortcut to the pipe is Shift+Command+M (MacOS) or Ctrl+Shift+M (Win).
This is roughly equivalent to this base R code:
> attach(starwars)
## The following object is masked _by_ .GlobalEnv:
##
## gender
## The following object is masked from dat2:
##
## gender
## The following object is masked from dat:
##
## gender
> starwars[skin_color == "light" & eye_color == "brown", ]
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Padmé Am… 185 45 brown light brown 46 fema… femin…
## 4 Cordé 157 NA brown light brown NA <NA> <NA>
## 5 Dormé 165 NA brown light brown NA fema… femin…
## 6 Raymus A… 188 79 brown light brown NA male mascu…
## 7 Poe Dame… NA NA brown light brown NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
- Arrange rows with
arrange()
:
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
> starwars %>% arrange(height, mass)
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket … 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 R4-P17 96 NA none silver, r… red, blue NA none femin…
## 7 R5-D4 97 32 <NA> white, red red NA none mascu…
## 8 Sebulba 112 40 none grey, red orange NA male mascu…
## 9 Gasgano 122 NA none white, bl… black NA male mascu…
## 10 Watto 137 NA black blue, grey yellow NA male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Use desc()
to order a column in descending order:
> starwars %>% arrange(desc(height))
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael … 264 NA none white yellow NA male mascu…
## 2 Tarfful 234 136 brown brown blue NA male mascu…
## 3 Lama Su 229 88 none grey black NA male mascu…
## 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 5 Roos Ta… 224 82 none grey orange NA male mascu…
## 6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
## 7 Taun We 213 NA none grey black NA fema… femin…
## 8 Rugor N… 206 NA none green orange NA male mascu…
## 9 Tion Me… 206 80 none grey black NA male mascu…
## 10 Darth V… 202 136 none white yellow 41.9 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
- Choose rows using their position with
slice()
:
slice()
lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. We can get characters from row numbers 5 through 10:
> starwars %>% slice(5:10)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 3 Beru Whi… 165 75 brown light blue 47 fema… femin…
## 4 R5-D4 97 32 <NA> white, red red NA none mascu…
## 5 Biggs Da… 183 84 black light brown 24 male mascu…
## 6 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
>
> # remeber head() and tail()?
> starwars %>% slice_head(n = 3)
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
> starwars %>% slice_tail(n = 5)
## # A tibble: 5 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Finn NA NA black dark dark NA male mascu…
## 2 Rey NA NA brown light hazel NA fema… femin…
## 3 Poe Dame… NA NA brown light brown NA male mascu…
## 4 BB8 NA NA none none black NA none mascu…
## 5 Captain … NA NA none none unknown NA fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
>
> # slice_sample() randomly selects rows
> starwars %>% slice_sample(n = 4)
## # A tibble: 4 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Adi Gall… 184 50 none dark blue NA fema… femin…
## 2 IG-88 200 140 none metal red 15 none mascu…
## 3 Finis Va… 170 NA blond fair blue 91 male mascu…
## 4 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
>
> # use the option prop to choose a certain proportion of the cases
> starwars %>% slice_sample(prop = 0.1)
## # A tibble: 8 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 BB8 NA NA none none black NA none mascu…
## 2 Poggle t… 183 80 none green yellow NA male mascu…
## 3 Adi Gall… 184 50 none dark blue NA fema… femin…
## 4 Jocasta … 167 NA white fair blue NA fema… femin…
## 5 Lobot 175 79 none light blue 37 male mascu…
## 6 Yarael P… 264 NA none white yellow NA male mascu…
## 7 Jabba De… 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 8 Luke Sky… 172 77 blond fair blue 19 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
slice_min()
and slice_max()
select rows with highest or lowest values of a variable. Note that we first must choose only the values which are not NA
:
> starwars %>% filter(!is.na(height)) %>% slice_max(height)
## # A tibble: 1 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael P… 264 NA none white yellow NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
> starwars %>% filter(!is.na(height)) %>% slice_max(height, n = 3)
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael P… 264 NA none white yellow NA male mascu…
## 2 Tarfful 234 136 brown brown blue NA male mascu…
## 3 Lama Su 229 88 none grey black NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
- Select columns with
select()
:
Often you work with large datasets with many columns but only a few are actually of interest to you. select()
allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
> # select columns by name
> starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ℹ 77 more rows
>
> # select all columns between hair_color and eye_color
> starwars %>% select(hair_color:eye_color)
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ℹ 77 more rows
>
> # select all columns except those from hair_color to eye_color
> starwars %>% select(!(hair_color:eye_color))
## # A tibble: 87 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke S… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth … 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia O… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen L… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## 7 Beru W… 165 75 47 fema… femin… Tatooine Human <chr> <chr>
## 8 R5-D4 97 32 NA none mascu… Tatooine Droid <chr> <chr>
## 9 Biggs … 183 84 24 male mascu… Tatooine Human <chr> <chr>
## 10 Obi-Wa… 182 77 57 male mascu… Stewjon Human <chr> <chr>
## # ℹ 77 more rows
## # ℹ 1 more variable: starships <list>
>
> # select all columns ending with color
> starwars %>% select(ends_with("color"))
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ℹ 77 more rows
There are a number of helper functions you can use within select()
, like starts_with()
, ends_with()
, matches()
and contains()
. These let you quickly match larger blocks of variables that meet some criterion. See ?select
for more details.
You can rename variables with rename()
by using named arguments:
> starwars %>% rename(home_world = homeworld)
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: home_world <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
- Add new columns with
mutate()
:
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate()
:
> starwars %>% mutate(height_m = height / 100)
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, height_m <dbl>
- Change column order with
relocate()
:
Use a similar syntax as select()
to move blocks of columns at once:
> starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 × 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Sky… male mascu… Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu… Tatooine 167 75 <NA> gold yellow
## 3 R2-D2 none mascu… Naboo 96 32 <NA> white, bl… red
## 4 Darth Va… male mascu… Tatooine 202 136 none white yellow
## 5 Leia Org… fema… femin… Alderaan 150 49 brown light brown
## 6 Owen Lars male mascu… Tatooine 178 120 brown, gr… light blue
## 7 Beru Whi… fema… femin… Tatooine 165 75 brown light blue
## 8 R5-D4 none mascu… Tatooine 97 32 <NA> white, red red
## 9 Biggs Da… male mascu… Tatooine 183 84 black light brown
## 10 Obi-Wan … male mascu… Stewjon 182 77 auburn, w… fair blue-gray
## # ℹ 77 more rows
## # ℹ 5 more variables: birth_year <dbl>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
- Summarise values with
summarise()
:
The last verb is summarise()
. It collapses a data frame to a single row: