3 Data structures
In R we have objects which are functions and objects which are data.
- Function examples:
sin()
integrate()
plot()
paste()
- Data examples:
42
1:5
“R”
matrix(1:12, nrow=4, ncol=3)
data.frame(a=1:5, tmt=c(“a”,“b”,“a”,“b”,“a”))
list(x=2, y=“abc”, x=1:10)
3.1 Vector
> # Vector of numbers, e.g:
> c(1,1.2,pi,exp(1))
## [1] 1.000 1.200 3.142 2.718
>
> # We can have vectors of other things too, e.g:
> c(TRUE,1==2)
## [1] TRUE FALSE
> c("a","ab","abc")
## [1] "a" "ab" "abc"
>
> # But not combinations, e.g:
> c("a",5,1==2)
## [1] "a" "5" "FALSE"
> # Notice that R just turned everything into characters!
3.1.1 Constructing vectors
> # Integers from 9 to 17
> x<-9:17
> x
## [1] 9 10 11 12 13 14 15 16 17
>
> # A sequence of 11 numbers from 0 to 1
> y<-seq(0,1,length=11)
> y
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
>
> # The same number or the same vector several times
> z<-rep(1:2, 5)
> z
## [1] 1 2 1 2 1 2 1 2 1 2
>
> # Combine numbers, vectors or both into a new vector
> xz10<-c(x,z,10)
> xz10
## [1] 9 10 11 12 13 14 15 16 17 1 2 1 2 1 2 1 2 1 2 10
3.1.2 Index and logical index
> # Define a vector with integers from (-5) to 5 and extract the numbers with absolute value less than 3:
> x<- (-5):5
> x
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
>
> # by their index in the vector:
> x[4:8]
## [1] -2 -1 0 1 2
>
> # or, by negative selection (set a minus in front of the indices we don't want):
> x[-c(1:3,9:11)]
## [1] -2 -1 0 1 2
>
> # A logical vector can be defined by:
> index<-abs(x)<3
> index
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
>
> # Now this vector can be used to extract the wanted numbers:
> x[index]
## [1] -2 -1 0 1 2
3.2 Factor
- A special kind of vector is a factor. It has a known finite set of levels (options), e.g:
> # gl = generate levels
> gl(2,10, labels=c("male", "female"))
## [1] male male male male male male male male male male
## [11] female female female female female female female female female female
## Levels: male female
>
> # One could also do:
> as.factor(c(rep("male",10),rep("female",10)))
## [1] male male male male male male male male male male
## [11] female female female female female female female female female female
## Levels: female male
3.3 Matrix and array
- Similar to vectors we can have matrices of objects of the same type, e.g:
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)
## [,1] [,2] [,3]
## [1,] 4.142 6.142 8.142
## [2,] 5.142 7.142 9.142
>
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)<6
## [,1] [,2] [,3]
## [1,] TRUE FALSE FALSE
## [2,] TRUE FALSE FALSE
>
> # We can create higher order arrays, e.g:
> array(c(1:24), dim=c(4,3,2))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
3.3.1 Constructing matrices
>
> # Combine rows into a matrix
> A<-rbind(1:3, c(1,1,2))
> A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 1 2
>
> # Or columns
> B<-cbind(1:3, c(1,1,2))
> B
## [,1] [,2]
## [1,] 1 1
## [2,] 2 1
## [3,] 3 2
>
> # Define a matrix from one long vector
> C<-matrix(c(1,0,0,1,1,0,1,1,1), nrow=3, ncol=3)
> C
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 0 1 1
## [3,] 0 0 1
>
> # Can also be done by rows by adding "byrow=TRUE" before the last parenthesis. Try!
3.3.2 Index and logical index
> A<-matrix((-4):5, nrow=2, ncol=5)
> A
## [,1] [,2] [,3] [,4] [,5]
## [1,] -4 -2 0 2 4
## [2,] -3 -1 1 3 5
>
>
> # Negative values
> A[A<0]
## [1] -4 -3 -2 -1
>
> # Assignments
> A[A<0]<-0
> A
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 2 4
## [2,] 0 0 1 3 5
>
> # Matrix rows can be selected by
> A[2,]
## [1] 0 0 1 3 5
>
> # and similarly for columns
> A[,c(2,4)]
## [,1] [,2]
## [1,] 0 2
## [2,] 0 3
3.3.3 Properties of vectors and matrices
- The
R
functionmode()
when applied to a vector or to a matrix detects the type of singles that is stored:
> A<-matrix(rep(c(TRUE,FALSE),2),nrow=2)
>
> B<-rnorm(4)
>
> C<-matrix(LETTERS[1:9],nrow=3)
>
> A;B;C
## [,1] [,2]
## [1,] TRUE TRUE
## [2,] FALSE FALSE
## [1] -0.006513 -1.435758 0.353105 1.109455
## [,1] [,2] [,3]
## [1,] "A" "D" "G"
## [2,] "B" "E" "H"
## [3,] "C" "F" "I"
>
> mode(A); mode(B); mode(C)
## [1] "logical"
## [1] "numeric"
## [1] "character"
- Vectors and matrices have lengths: the length is the number of elements:
> x<-matrix(c(NA,2:12),ncol=3)
> x
## [,1] [,2] [,3]
## [1,] NA 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
>
> length(x[1,])
## [1] 3
>
> length(x)
## [1] 12
>
> # The dimension of a matrix is the number of rows and columns: The number of columns is the second element:
> dim(x); dim(x)[2]
## [1] 4 3
## [1] 3
3.3.4 Naming rows and columns in a matrix
- We can add names to a matrix with the
colnames()
andrownames()
functions:
> x<-matrix(rnorm(12),nrow=4)
> x
## [,1] [,2] [,3]
## [1,] 1.1041 1.3221 -0.4545
## [2,] 0.7094 1.2795 1.3075
## [3,] 1.2753 0.3815 -0.5322
## [4,] 1.3026 -0.2334 0.8438
>
> colnames(x)<-paste("data",1:3,sep="")
>
> rownames(x)<-paste("obs",1:4,sep="")
>
> x
## data1 data2 data3
## obs1 1.1041 1.3221 -0.4545
## obs2 0.7094 1.2795 1.3075
## obs3 1.2753 0.3815 -0.5322
## obs4 1.3026 -0.2334 0.8438
>
> y<-matrix(rnorm(15),nrow=5)
> y
## [,1] [,2] [,3]
## [1,] 2.2341 2.9116 1.04936
## [2,] 0.2674 -0.3239 0.55235
## [3,] -1.3993 -0.9896 -0.23531
## [4,] -1.1132 -0.4892 -0.53210
## [5,] 0.2939 1.3329 -0.07947
>
> colnames(y)<-LETTERS[1:ncol(y)]
>
> rownames(y)<-letters[1:nrow(y)]
>
> y
## A B C
## a 2.2341 2.9116 1.04936
## b 0.2674 -0.3239 0.55235
## c -1.3993 -0.9896 -0.23531
## d -1.1132 -0.4892 -0.53210
## e 0.2939 1.3329 -0.07947
3.3.5 Matrix multiplication
> M<-matrix(rnorm(20),nrow=4,ncol=5)
> N<-matrix(rnorm(15),nrow=5,ncol=3)
>
> M%*%N
## [,1] [,2] [,3]
## [1,] 7.7378 -3.1252 0.6263
## [2,] -0.3942 2.7825 1.3672
## [3,] 2.7417 -0.9445 2.6622
## [4,] -1.6990 1.0751 -2.3093
>
> # Can we perform N*M? No! A and B are not compatible!! Try to run:
> # N%*%M
3.3.6 Additional functions
> M<-matrix(rnorm(16),nrow=4,ncol=4)
>
> dim(M)
## [1] 4 4
>
> t(M)
## [,1] [,2] [,3] [,4]
## [1,] 0.05753 0.7945 -0.1886 -0.6634
## [2,] -1.40129 -0.2262 0.4228 -2.0906
## [3,] -1.22525 -1.1664 0.6378 0.1316
## [4,] -0.87634 -0.1150 -0.3167 0.6809
>
> det(M)
## [1] 0.7816
>
> (invM <- solve(M))
## [,1] [,2] [,3] [,4]
## [1,] -0.90080 2.9977 3.546 0.99662
## [2,] 0.08001 -0.9423 -1.422 -0.71750
## [3,] -0.57779 1.3912 2.835 0.81010
## [4,] -0.52035 -0.2415 -1.458 0.08008
>
> eigen(M)
## eigen() decomposition
## $values
## [1] 0.8602+0.4772i 0.8602-0.4772i -0.2852+0.8523i -0.2852-0.8523i
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] -0.280724+0.1135i -0.280724-0.1135i -0.6954+0.0000i -0.6954+0.0000i
## [2,] -0.003257-0.2056i -0.003257+0.2056i 0.2166+0.3230i 0.2166-0.3230i
## [3,] -0.352944+0.2702i -0.352944-0.2702i -0.1923-0.1829i -0.1923+0.1829i
## [4,] 0.817598+0.0000i 0.817598+0.0000i -0.3494+0.4156i -0.3494-0.4156i
3.4 Data-frame
- A special data object is called a data frame (
data.frame
). We can create data frames by reading data in from files or by using the functionas.data.frame()
on a set of vectors. A data frame is a set of parallel vectors, where the vectors can be of different types, e.g:
## course hours
## 1 CTA 39
## 2 PSP 65
## 3 RM 52
## course hours
## [1,] "CTA" "39"
## [2,] "PSP" "65"
## [3,] "RM" "52"
3.4.1 Data frames: adding and removing columns
## x y
## 1 A 1
## 2 B 2
## 3 C 3
## [1] "A" "B" "C"
## [1] "A" "B" "C"
> # It is simple to add or remove a column:
>
> dat$z <- dat$y^2
> dat$name <- c("A1", "A2", "A3")
> dat$y<-NULL
> dat
## x z name
## 1 A 1 A1
## 2 B 4 A2
## 3 C 9 A3
3.4.2 Data frames: merging data frames
## course hours
## 1 CTA 39
## 2 PSP 65
## 3 RM 52
## course credits
## 1 RM 6
## 2 CTA 4
## 3 PSP 8
> # We can merge that information into one data set by:
>
> df12 <- merge(df1, df2, by="course")
> df12
## course hours credits
## 1 CTA 39 4
## 2 PSP 65 8
## 3 RM 52 6
3.4.3 Data frames: getting dimension, column info and others
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
## [1] "integer"
## [1] "numeric"
## [1] 153 6
## [1] 153
## [1] 6
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## Ozone Solar.R Wind Temp Month Day
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
## Ozone Solar.R Wind Temp Month Day
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
3.4.4 Data frames: the subset()
function
- Let’s look at the airquality data again:
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
- Logical indexing applies to data frames:
- … but a neat function is built in for making subsets of data:
## Ozone Temp
## 29 45 81
## 35 NA 84
## 36 NA 85
## 38 29 82
## 39 NA 87
## 40 71 90
## 41 39 87
## 42 NA 93
## 43 NA 92
## 44 23 82
## 61 NA 83
## 62 135 84
## 63 49 85
## 64 32 81
## 65 NA 84
## 66 64 83
## 67 40 83
## 68 77 88
## 69 97 92
## 70 97 92
## 71 85 89
## 72 NA 82
## 74 27 81
## 75 NA 91
## 77 48 81
## 78 35 82
## 79 61 84
## 80 79 87
## 81 63 85
## 83 NA 81
## 84 NA 82
## 85 80 86
## 86 108 85
## 87 20 82
## 88 52 86
## 89 82 88
## 90 50 86
## 91 64 83
## 92 59 81
## 93 39 81
## 94 9 81
## 95 16 82
## 96 78 86
## 97 35 85
## 98 66 87
## 99 122 89
## 100 89 90
## 101 110 90
## 102 NA 92
## 103 NA 86
## 104 44 86
## 105 28 82
## 117 168 81
## 118 73 86
## 119 NA 88
## 120 76 97
## 121 118 94
## 122 84 96
## 123 85 94
## 124 96 91
## 125 78 92
## 126 73 93
## 127 91 93
## 128 47 87
## 129 32 84
## 134 44 81
## 143 16 82
## 146 36 81
## Ozone Solar.R Wind Month Day
## 1 41 190 7.4 5 1
## 32 NA 286 8.6 6 1
## 62 135 269 4.1 7 1
## 93 39 83 6.9 8 1
## 124 96 167 6.9 9 1
## Ozone Solar.R Wind
## 1 41 190 7.4
## 2 36 118 8.0
## 3 12 149 12.6
## 4 18 313 11.5
## 5 NA NA 14.3
## 6 28 NA 14.9
## 7 23 299 8.6
## 8 19 99 13.8
## 9 8 19 20.1
## 10 NA 194 8.6
## 11 7 NA 6.9
## 12 16 256 9.7
## 13 11 290 9.2
## 14 14 274 10.9
## 15 18 65 13.2
## 16 14 334 11.5
## 17 34 307 12.0
## 18 6 78 18.4
## 19 30 322 11.5
## 20 11 44 9.7
## 21 1 8 9.7
## 22 11 320 16.6
## 23 4 25 9.7
## 24 32 92 12.0
## 25 NA 66 16.6
## 26 NA 266 14.9
## 27 NA NA 8.0
## 28 23 13 12.0
## 29 45 252 14.9
## 30 115 223 5.7
## 31 37 279 7.4
## 32 NA 286 8.6
## 33 NA 287 9.7
## 34 NA 242 16.1
## 35 NA 186 9.2
## 36 NA 220 8.6
## 37 NA 264 14.3
## 38 29 127 9.7
## 39 NA 273 6.9
## 40 71 291 13.8
## 41 39 323 11.5
## 42 NA 259 10.9
## 43 NA 250 9.2
## 44 23 148 8.0
## 45 NA 332 13.8
## 46 NA 322 11.5
## 47 21 191 14.9
## 48 37 284 20.7
## 49 20 37 9.2
## 50 12 120 11.5
## 51 13 137 10.3
## 52 NA 150 6.3
## 53 NA 59 1.7
## 54 NA 91 4.6
## 55 NA 250 6.3
## 56 NA 135 8.0
## 57 NA 127 8.0
## 58 NA 47 10.3
## 59 NA 98 11.5
## 60 NA 31 14.9
## 61 NA 138 8.0
## 62 135 269 4.1
## 63 49 248 9.2
## 64 32 236 9.2
## 65 NA 101 10.9
## 66 64 175 4.6
## 67 40 314 10.9
## 68 77 276 5.1
## 69 97 267 6.3
## 70 97 272 5.7
## 71 85 175 7.4
## 72 NA 139 8.6
## 73 10 264 14.3
## 74 27 175 14.9
## 75 NA 291 14.9
## 76 7 48 14.3
## 77 48 260 6.9
## 78 35 274 10.3
## 79 61 285 6.3
## 80 79 187 5.1
## 81 63 220 11.5
## 82 16 7 6.9
## 83 NA 258 9.7
## 84 NA 295 11.5
## 85 80 294 8.6
## 86 108 223 8.0
## 87 20 81 8.6
## 88 52 82 12.0
## 89 82 213 7.4
## 90 50 275 7.4
## 91 64 253 7.4
## 92 59 254 9.2
## 93 39 83 6.9
## 94 9 24 13.8
## 95 16 77 7.4
## 96 78 NA 6.9
## 97 35 NA 7.4
## 98 66 NA 4.6
## 99 122 255 4.0
## 100 89 229 10.3
## 101 110 207 8.0
## 102 NA 222 8.6
## 103 NA 137 11.5
## 104 44 192 11.5
## 105 28 273 11.5
## 106 65 157 9.7
## 107 NA 64 11.5
## 108 22 71 10.3
## 109 59 51 6.3
## 110 23 115 7.4
## 111 31 244 10.9
## 112 44 190 10.3
## 113 21 259 15.5
## 114 9 36 14.3
## 115 NA 255 12.6
## 116 45 212 9.7
## 117 168 238 3.4
## 118 73 215 8.0
## 119 NA 153 5.7
## 120 76 203 9.7
## 121 118 225 2.3
## 122 84 237 6.3
## 123 85 188 6.3
## 124 96 167 6.9
## 125 78 197 5.1
## 126 73 183 2.8
## 127 91 189 4.6
## 128 47 95 7.4
## 129 32 92 15.5
## 130 20 252 10.9
## 131 23 220 10.3
## 132 21 230 10.9
## 133 24 259 9.7
## 134 44 236 14.9
## 135 21 259 15.5
## 136 28 238 6.3
## 137 9 24 10.9
## 138 13 112 11.5
## 139 46 237 6.9
## 140 18 224 13.8
## 141 13 27 10.3
## 142 24 238 10.3
## 143 16 201 8.0
## 144 13 238 12.6
## 145 23 14 9.2
## 146 36 139 10.3
## 147 7 49 10.3
## 148 14 20 16.6
## 149 30 193 6.9
## 150 NA 145 13.2
## 151 14 191 14.3
## 152 18 131 8.0
## 153 20 223 11.5
3.4.5 Data frames: the summary()
function
- The summary() function gives you a range of statistics…
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.70 7.40 9.70 9.96 11.50 20.70
… that you could alternatively obtain using the R functions
min(), max(), mean(), median(), quantile()
.The summary of a data frame gives the summary of each column:
> summary(airquality)
## Ozone Solar.R Wind Temp Month
## Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0 Min. :5.00
## 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0 1st Qu.:6.00
## Median : 31.5 Median :205 Median : 9.70 Median :79.0 Median :7.00
## Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9 Mean :6.99
## 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0 3rd Qu.:8.00
## Max. :168.0 Max. :334 Max. :20.70 Max. :97.0 Max. :9.00
## NA's :37 NA's :7
## Day
## Min. : 1.0
## 1st Qu.: 8.0
## Median :16.0
## Mean :15.8
## 3rd Qu.:23.0
## Max. :31.0
##
3.4.6 Data frames: missing values
R uses the special value
NA
to code missing values.The result of arithmetic involving
NAs
becomesNA
as well:
- We need a special function
is.na
to filter outNAs
:
- To get rid of
NAs
in a column we can use:
> s <- subset(airquality, !is.na(Ozone))
>
> colMeans(s)
## Ozone Solar.R Wind Temp Month Day
## 42.129 NA 9.862 77.871 7.198 15.534
- Note that the argument
na.rm=TRUE
can be passed to most summary functions e.g.sum(), mean(), sd()
:
## [1] 42.13
## Ozone Solar.R Wind Temp Month Day
## 42.129 185.932 9.958 77.882 6.993 15.804
3.5 Lists
- A list is a most general object type. Elements can be of different types and lengths, e.g:
> list(a=1,b="Lisbon",c=c(1,2,3),d=list(e=matrix(1:4,2), f=function(x)x^2))
## $a
## [1] 1
##
## $b
## [1] "Lisbon"
##
## $c
## [1] 1 2 3
##
## $d
## $d$e
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $d$f
## function(x)x^2
- The objects returned from many of the built-in functions in R are fairly complicated lists!