3 Data structures

In R we have objects which are functions and objects which are data.

  • Function examples:
    • sin()

    • integrate()

    • plot()

    • paste()

  • Data examples:
    • 42

    • 1:5

    • “R”

    • matrix(1:12, nrow=4, ncol=3)

    • data.frame(a=1:5, tmt=c(“a”,“b”,“a”,“b”,“a”))

    • list(x=2, y=“abc”, x=1:10)

3.1 Vector

> # Vector of numbers, e.g:
> c(1,1.2,pi,exp(1))
## [1] 1.000 1.200 3.142 2.718
> 
> # We can have vectors of other things too, e.g:
> c(TRUE,1==2)
## [1]  TRUE FALSE
> c("a","ab","abc")
## [1] "a"   "ab"  "abc"
> 
> # But not combinations, e.g:
> c("a",5,1==2)
## [1] "a"     "5"     "FALSE"
> # Notice that R just turned everything into characters!

3.1.1 Constructing vectors

> # Integers from 9 to 17
> x<-9:17
> x
## [1]  9 10 11 12 13 14 15 16 17
> 
> # A sequence of 11 numbers from 0 to 1
> y<-seq(0,1,length=11)
> y
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> 
> # The same number or the same vector several times
> z<-rep(1:2, 5)
> z
##  [1] 1 2 1 2 1 2 1 2 1 2
> 
> # Combine numbers, vectors or both into a new vector
> xz10<-c(x,z,10)
> xz10
##  [1]  9 10 11 12 13 14 15 16 17  1  2  1  2  1  2  1  2  1  2 10

3.1.2 Index and logical index

> # Define a vector with integers from (-5) to 5 and extract the numbers with absolute value less than 3:
> x<- (-5):5
> x
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
> 
> # by their index in the vector:
> x[4:8]
## [1] -2 -1  0  1  2
> 
> # or, by negative selection (set a minus in front of the indices we don't want):
> x[-c(1:3,9:11)]
## [1] -2 -1  0  1  2
> 
> # A logical vector can be defined by:
> index<-abs(x)<3
> index 
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
> 
> # Now this vector can be used to extract the wanted numbers:
> x[index]
## [1] -2 -1  0  1  2

3.2 Factor

  • A special kind of vector is a factor. It has a known finite set of levels (options), e.g:
> # gl = generate levels
> gl(2,10, labels=c("male", "female"))
##  [1] male   male   male   male   male   male   male   male   male   male  
## [11] female female female female female female female female female female
## Levels: male female
> 
> # One could also do:
> as.factor(c(rep("male",10),rep("female",10)))
##  [1] male   male   male   male   male   male   male   male   male   male  
## [11] female female female female female female female female female female
## Levels: female male

3.3 Matrix and array

  • Similar to vectors we can have matrices of objects of the same type, e.g:
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)
##       [,1]  [,2]  [,3]
## [1,] 4.142 6.142 8.142
## [2,] 5.142 7.142 9.142
> 
> matrix(c(1,2,3,4,5,6)+pi,nrow=2)<6
##      [,1]  [,2]  [,3]
## [1,] TRUE FALSE FALSE
## [2,] TRUE FALSE FALSE
> 
> # We can create higher order arrays, e.g:
> array(c(1:24), dim=c(4,3,2))
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   13   17   21
## [2,]   14   18   22
## [3,]   15   19   23
## [4,]   16   20   24

3.3.1 Constructing matrices

> 
> # Combine rows into a matrix
> A<-rbind(1:3, c(1,1,2))
> A
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    1    2
> 
> # Or columns
> B<-cbind(1:3, c(1,1,2))
> B
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    1
## [3,]    3    2
> 
> # Define a matrix from one long vector
> C<-matrix(c(1,0,0,1,1,0,1,1,1), nrow=3, ncol=3)
> C
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    0    1    1
## [3,]    0    0    1
> 
> # Can also be done by rows by adding "byrow=TRUE" before the last parenthesis. Try!

3.3.2 Index and logical index

> A<-matrix((-4):5, nrow=2, ncol=5)
> A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -4   -2    0    2    4
## [2,]   -3   -1    1    3    5
> 
> 
> # Negative values 
> A[A<0]
## [1] -4 -3 -2 -1
> 
> # Assignments
> A[A<0]<-0
> A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    2    4
## [2,]    0    0    1    3    5
> 
> # Matrix rows can be selected by
> A[2,]
## [1] 0 0 1 3 5
> 
> # and similarly for columns
> A[,c(2,4)] 
##      [,1] [,2]
## [1,]    0    2
## [2,]    0    3

3.3.3 Properties of vectors and matrices

  • The R function mode() when applied to a vector or to a matrix detects the type of singles that is stored:
> A<-matrix(rep(c(TRUE,FALSE),2),nrow=2)
> 
> B<-rnorm(4)
> 
> C<-matrix(LETTERS[1:9],nrow=3)
> 
> A;B;C
##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,] FALSE FALSE
## [1] -0.006513 -1.435758  0.353105  1.109455
##      [,1] [,2] [,3]
## [1,] "A"  "D"  "G" 
## [2,] "B"  "E"  "H" 
## [3,] "C"  "F"  "I"
> 
> mode(A); mode(B); mode(C)
## [1] "logical"
## [1] "numeric"
## [1] "character"
  • Vectors and matrices have lengths: the length is the number of elements:
> x<-matrix(c(NA,2:12),ncol=3)
> x
##      [,1] [,2] [,3]
## [1,]   NA    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
> 
> length(x[1,])
## [1] 3
> 
> length(x)
## [1] 12
> 
> # The dimension of a matrix is the number of rows and columns: The number of columns is the second element:
> dim(x); dim(x)[2]
## [1] 4 3
## [1] 3

3.3.4 Naming rows and columns in a matrix

  • We can add names to a matrix with the colnames() and rownames() functions:
> x<-matrix(rnorm(12),nrow=4)
> x
##        [,1]    [,2]    [,3]
## [1,] 1.1041  1.3221 -0.4545
## [2,] 0.7094  1.2795  1.3075
## [3,] 1.2753  0.3815 -0.5322
## [4,] 1.3026 -0.2334  0.8438
> 
> colnames(x)<-paste("data",1:3,sep="")
> 
> rownames(x)<-paste("obs",1:4,sep="")
> 
> x
##       data1   data2   data3
## obs1 1.1041  1.3221 -0.4545
## obs2 0.7094  1.2795  1.3075
## obs3 1.2753  0.3815 -0.5322
## obs4 1.3026 -0.2334  0.8438
> 
> y<-matrix(rnorm(15),nrow=5)
> y
##         [,1]    [,2]     [,3]
## [1,]  2.2341  2.9116  1.04936
## [2,]  0.2674 -0.3239  0.55235
## [3,] -1.3993 -0.9896 -0.23531
## [4,] -1.1132 -0.4892 -0.53210
## [5,]  0.2939  1.3329 -0.07947
> 
> colnames(y)<-LETTERS[1:ncol(y)]
> 
> rownames(y)<-letters[1:nrow(y)]
> 
> y
##         A       B        C
## a  2.2341  2.9116  1.04936
## b  0.2674 -0.3239  0.55235
## c -1.3993 -0.9896 -0.23531
## d -1.1132 -0.4892 -0.53210
## e  0.2939  1.3329 -0.07947

3.3.5 Matrix multiplication

> M<-matrix(rnorm(20),nrow=4,ncol=5)
> N<-matrix(rnorm(15),nrow=5,ncol=3)
> 
> M%*%N
##         [,1]    [,2]    [,3]
## [1,]  7.7378 -3.1252  0.6263
## [2,] -0.3942  2.7825  1.3672
## [3,]  2.7417 -0.9445  2.6622
## [4,] -1.6990  1.0751 -2.3093
> 
> # Can we perform N*M? No! A and B are not compatible!! Try to run:
> # N%*%M

3.3.6 Additional functions

> M<-matrix(rnorm(16),nrow=4,ncol=4)
> 
> dim(M)
## [1] 4 4
> 
> t(M)
##          [,1]    [,2]    [,3]    [,4]
## [1,]  0.05753  0.7945 -0.1886 -0.6634
## [2,] -1.40129 -0.2262  0.4228 -2.0906
## [3,] -1.22525 -1.1664  0.6378  0.1316
## [4,] -0.87634 -0.1150 -0.3167  0.6809
> 
> det(M)
## [1] 0.7816
> 
> (invM <- solve(M))
##          [,1]    [,2]   [,3]     [,4]
## [1,] -0.90080  2.9977  3.546  0.99662
## [2,]  0.08001 -0.9423 -1.422 -0.71750
## [3,] -0.57779  1.3912  2.835  0.81010
## [4,] -0.52035 -0.2415 -1.458  0.08008
> 
> eigen(M)
## eigen() decomposition
## $values
## [1]  0.8602+0.4772i  0.8602-0.4772i -0.2852+0.8523i -0.2852-0.8523i
## 
## $vectors
##                   [,1]              [,2]            [,3]            [,4]
## [1,] -0.280724+0.1135i -0.280724-0.1135i -0.6954+0.0000i -0.6954+0.0000i
## [2,] -0.003257-0.2056i -0.003257+0.2056i  0.2166+0.3230i  0.2166-0.3230i
## [3,] -0.352944+0.2702i -0.352944-0.2702i -0.1923-0.1829i -0.1923+0.1829i
## [4,]  0.817598+0.0000i  0.817598+0.0000i -0.3494+0.4156i -0.3494-0.4156i

3.4 Data-frame

  • A special data object is called a data frame (data.frame). We can create data frames by reading data in from files or by using the function as.data.frame() on a set of vectors. A data frame is a set of parallel vectors, where the vectors can be of different types, e.g:
> MAS <- data.frame(course=c("CTA","PSP","RM"), hours=c(39,65,52))
> MAS
##   course hours
## 1    CTA    39
## 2    PSP    65
## 3     RM    52
> # Compare to a matrix
> cbind(course=c("CTA","PSP","RM"), hours=c(39,65,52))
##      course hours
## [1,] "CTA"  "39" 
## [2,] "PSP"  "65" 
## [3,] "RM"   "52"

3.4.1 Data frames: adding and removing columns

> dat <- data.frame(x=LETTERS[1:3], y=1:3)
> dat
##   x y
## 1 A 1
## 2 B 2
## 3 C 3
> dat[,1]
## [1] "A" "B" "C"
> dat$x
## [1] "A" "B" "C"
> # It is simple to add or remove a column:
> 
> dat$z <- dat$y^2
> dat$name <- c("A1", "A2", "A3")
> dat$y<-NULL
> dat
##   x z name
## 1 A 1   A1
## 2 B 4   A2
## 3 C 9   A3

3.4.2 Data frames: merging data frames

> df1 <- data.frame(course=c("CTA","PSP","RM"), hours=c(39,65,52))
> df1
##   course hours
## 1    CTA    39
## 2    PSP    65
## 3     RM    52
> df2 <- data.frame(course=c("RM","CTA","PSP"), credits=c(6,4,8))
> df2
##   course credits
## 1     RM       6
## 2    CTA       4
## 3    PSP       8
> # We can merge that information into one data set by:
> 
> df12 <- merge(df1, df2, by="course")
> df12
##   course hours credits
## 1    CTA    39       4
## 2    PSP    65       8
## 3     RM    52       6

3.4.3 Data frames: getting dimension, column info and others

> df <- airquality
> 
> names(df)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
> class(df$Ozone)
## [1] "integer"
> class(df$Wind)
## [1] "numeric"
> dim(df)
## [1] 153   6
> nrow(df)
## [1] 153
> ncol(df)
## [1] 6
> # Get an overview of the object structure:
> 
> str(df)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
> # First rows of a data frame:
> 
> head(airquality, 3)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
> head(airquality, 10)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
> # Last rows of a data frame:
> 
> tail(airquality, 3)
##     Ozone Solar.R Wind Temp Month Day
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30
> tail(airquality, 9)
##     Ozone Solar.R Wind Temp Month Day
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

3.4.4 Data frames: the subset() function

  • Let’s look at the airquality data again:
> head(airquality, 3)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
  • Logical indexing applies to data frames:
> datA <- airquality[airquality$Temp>80,c("Ozone","Temp")]
  • … but a neat function is built in for making subsets of data:
> (datA <- subset(airquality, Temp > 80, select = c(Ozone, Temp)))
##     Ozone Temp
## 29     45   81
## 35     NA   84
## 36     NA   85
## 38     29   82
## 39     NA   87
## 40     71   90
## 41     39   87
## 42     NA   93
## 43     NA   92
## 44     23   82
## 61     NA   83
## 62    135   84
## 63     49   85
## 64     32   81
## 65     NA   84
## 66     64   83
## 67     40   83
## 68     77   88
## 69     97   92
## 70     97   92
## 71     85   89
## 72     NA   82
## 74     27   81
## 75     NA   91
## 77     48   81
## 78     35   82
## 79     61   84
## 80     79   87
## 81     63   85
## 83     NA   81
## 84     NA   82
## 85     80   86
## 86    108   85
## 87     20   82
## 88     52   86
## 89     82   88
## 90     50   86
## 91     64   83
## 92     59   81
## 93     39   81
## 94      9   81
## 95     16   82
## 96     78   86
## 97     35   85
## 98     66   87
## 99    122   89
## 100    89   90
## 101   110   90
## 102    NA   92
## 103    NA   86
## 104    44   86
## 105    28   82
## 117   168   81
## 118    73   86
## 119    NA   88
## 120    76   97
## 121   118   94
## 122    84   96
## 123    85   94
## 124    96   91
## 125    78   92
## 126    73   93
## 127    91   93
## 128    47   87
## 129    32   84
## 134    44   81
## 143    16   82
## 146    36   81
> (datB <- subset(airquality, Day == 1, select = -Temp))
##     Ozone Solar.R Wind Month Day
## 1      41     190  7.4     5   1
## 32     NA     286  8.6     6   1
## 62    135     269  4.1     7   1
## 93     39      83  6.9     8   1
## 124    96     167  6.9     9   1
> (datC <- subset(airquality, select = Ozone:Wind))
##     Ozone Solar.R Wind
## 1      41     190  7.4
## 2      36     118  8.0
## 3      12     149 12.6
## 4      18     313 11.5
## 5      NA      NA 14.3
## 6      28      NA 14.9
## 7      23     299  8.6
## 8      19      99 13.8
## 9       8      19 20.1
## 10     NA     194  8.6
## 11      7      NA  6.9
## 12     16     256  9.7
## 13     11     290  9.2
## 14     14     274 10.9
## 15     18      65 13.2
## 16     14     334 11.5
## 17     34     307 12.0
## 18      6      78 18.4
## 19     30     322 11.5
## 20     11      44  9.7
## 21      1       8  9.7
## 22     11     320 16.6
## 23      4      25  9.7
## 24     32      92 12.0
## 25     NA      66 16.6
## 26     NA     266 14.9
## 27     NA      NA  8.0
## 28     23      13 12.0
## 29     45     252 14.9
## 30    115     223  5.7
## 31     37     279  7.4
## 32     NA     286  8.6
## 33     NA     287  9.7
## 34     NA     242 16.1
## 35     NA     186  9.2
## 36     NA     220  8.6
## 37     NA     264 14.3
## 38     29     127  9.7
## 39     NA     273  6.9
## 40     71     291 13.8
## 41     39     323 11.5
## 42     NA     259 10.9
## 43     NA     250  9.2
## 44     23     148  8.0
## 45     NA     332 13.8
## 46     NA     322 11.5
## 47     21     191 14.9
## 48     37     284 20.7
## 49     20      37  9.2
## 50     12     120 11.5
## 51     13     137 10.3
## 52     NA     150  6.3
## 53     NA      59  1.7
## 54     NA      91  4.6
## 55     NA     250  6.3
## 56     NA     135  8.0
## 57     NA     127  8.0
## 58     NA      47 10.3
## 59     NA      98 11.5
## 60     NA      31 14.9
## 61     NA     138  8.0
## 62    135     269  4.1
## 63     49     248  9.2
## 64     32     236  9.2
## 65     NA     101 10.9
## 66     64     175  4.6
## 67     40     314 10.9
## 68     77     276  5.1
## 69     97     267  6.3
## 70     97     272  5.7
## 71     85     175  7.4
## 72     NA     139  8.6
## 73     10     264 14.3
## 74     27     175 14.9
## 75     NA     291 14.9
## 76      7      48 14.3
## 77     48     260  6.9
## 78     35     274 10.3
## 79     61     285  6.3
## 80     79     187  5.1
## 81     63     220 11.5
## 82     16       7  6.9
## 83     NA     258  9.7
## 84     NA     295 11.5
## 85     80     294  8.6
## 86    108     223  8.0
## 87     20      81  8.6
## 88     52      82 12.0
## 89     82     213  7.4
## 90     50     275  7.4
## 91     64     253  7.4
## 92     59     254  9.2
## 93     39      83  6.9
## 94      9      24 13.8
## 95     16      77  7.4
## 96     78      NA  6.9
## 97     35      NA  7.4
## 98     66      NA  4.6
## 99    122     255  4.0
## 100    89     229 10.3
## 101   110     207  8.0
## 102    NA     222  8.6
## 103    NA     137 11.5
## 104    44     192 11.5
## 105    28     273 11.5
## 106    65     157  9.7
## 107    NA      64 11.5
## 108    22      71 10.3
## 109    59      51  6.3
## 110    23     115  7.4
## 111    31     244 10.9
## 112    44     190 10.3
## 113    21     259 15.5
## 114     9      36 14.3
## 115    NA     255 12.6
## 116    45     212  9.7
## 117   168     238  3.4
## 118    73     215  8.0
## 119    NA     153  5.7
## 120    76     203  9.7
## 121   118     225  2.3
## 122    84     237  6.3
## 123    85     188  6.3
## 124    96     167  6.9
## 125    78     197  5.1
## 126    73     183  2.8
## 127    91     189  4.6
## 128    47      95  7.4
## 129    32      92 15.5
## 130    20     252 10.9
## 131    23     220 10.3
## 132    21     230 10.9
## 133    24     259  9.7
## 134    44     236 14.9
## 135    21     259 15.5
## 136    28     238  6.3
## 137     9      24 10.9
## 138    13     112 11.5
## 139    46     237  6.9
## 140    18     224 13.8
## 141    13      27 10.3
## 142    24     238 10.3
## 143    16     201  8.0
## 144    13     238 12.6
## 145    23      14  9.2
## 146    36     139 10.3
## 147     7      49 10.3
## 148    14      20 16.6
## 149    30     193  6.9
## 150    NA     145 13.2
## 151    14     191 14.3
## 152    18     131  8.0
## 153    20     223 11.5

3.4.5 Data frames: the summary() function

  • The summary() function gives you a range of statistics…
> summary(airquality$Wind)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.70    7.40    9.70    9.96   11.50   20.70
  • … that you could alternatively obtain using the R functions min(), max(), mean(), median(), quantile().

  • The summary of a data frame gives the summary of each column:

> summary(airquality)
##      Ozone          Solar.R         Wind            Temp          Month     
##  Min.   :  1.0   Min.   :  7   Min.   : 1.70   Min.   :56.0   Min.   :5.00  
##  1st Qu.: 18.0   1st Qu.:116   1st Qu.: 7.40   1st Qu.:72.0   1st Qu.:6.00  
##  Median : 31.5   Median :205   Median : 9.70   Median :79.0   Median :7.00  
##  Mean   : 42.1   Mean   :186   Mean   : 9.96   Mean   :77.9   Mean   :6.99  
##  3rd Qu.: 63.2   3rd Qu.:259   3rd Qu.:11.50   3rd Qu.:85.0   3rd Qu.:8.00  
##  Max.   :168.0   Max.   :334   Max.   :20.70   Max.   :97.0   Max.   :9.00  
##  NA's   :37      NA's   :7                                                  
##       Day      
##  Min.   : 1.0  
##  1st Qu.: 8.0  
##  Median :16.0  
##  Mean   :15.8  
##  3rd Qu.:23.0  
##  Max.   :31.0  
## 

3.4.6 Data frames: missing values

  • R uses the special value NA to code missing values.

  • The result of arithmetic involving NAs becomes NA as well:

> colMeans(airquality)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA      NA   9.958  77.882   6.993  15.804
  • We need a special function is.na to filter out NAs:
> is.na(NA)
## [1] TRUE
  • To get rid of NAs in a column we can use:
> s <- subset(airquality, !is.na(Ozone))
> 
> colMeans(s)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##  42.129      NA   9.862  77.871   7.198  15.534
  • Note that the argument na.rm=TRUE can be passed to most summary functions e.g. sum(), mean(), sd():
> mean(airquality$Ozone, na.rm=TRUE)
## [1] 42.13
> # or
> 
> colMeans(airquality,na.rm=TRUE)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##  42.129 185.932   9.958  77.882   6.993  15.804

3.5 Lists

  • A list is a most general object type. Elements can be of different types and lengths, e.g:
> list(a=1,b="Lisbon",c=c(1,2,3),d=list(e=matrix(1:4,2), f=function(x)x^2))
## $a
## [1] 1
## 
## $b
## [1] "Lisbon"
## 
## $c
## [1] 1 2 3
## 
## $d
## $d$e
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $d$f
## function(x)x^2
  • The objects returned from many of the built-in functions in R are fairly complicated lists!