4 Reading in data

  • R can import data from (but not only):

    • Excel
    • SAS
    • SPSS
    • STATA
  • Typically, is easier to install a package to read in data. For instance, the package Hmisc contains functions that handles import from SPSS.

4.1 Reading data from user input

  • Frequently, data is collected in white space separated columns, where the first line indicate the variable name (try to use the scan() function):
> # Read 7 items for each x and y:
> # x <- scan()
> # y <- scan()

4.2 The R working directory

  • R has a search path, the R working directory, where it stores its workspace and look for files. You can locate the working directory with the “get working directory” command:
> getwd()
> # will show your own working directory
  • The working directory can be changed with the “set working directory” command:
> setwd("c:/otherdir")
  • For files stored in the working directory or subfolders, you can just specify the path from the working directory when reading them. Example: if the data is located in the “Data” folder in your working directory, write:
> mydat<-read.table("Data/filename.xyz")

4.3 Reading data from a text file (*.dat)

  • The function read.table() is designed to read this format:
> mydat <- read.table("InFiles/filename.dat", header = TRUE)
> # mydat <- read.table("use your own working directory", header = TRUE)
> mydat
##   x1  x2   x3
## 1  2 0.3 0.01
## 2  2 1.0 0.11
## 3  3 2.1 0.04
## 4  3 2.2 0.02
## 5  1 0.1 0.10
## 6  1 0.2 0.60
> str(mydat)
## 'data.frame':    6 obs. of  3 variables:
##  $ x1: int  2 2 3 3 1 1
##  $ x2: num  0.3 1 2.1 2.2 0.1 0.2
##  $ x3: num  0.01 0.11 0.04 0.02 0.1 0.6
  • The read.table() function has a lot of optional arguments:
> args(read.table)
## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
##     numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
##     col.names, as.is = !stringsAsFactors, tryLogical = TRUE, 
##     na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, 
##     check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, 
##     blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, 
##     flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", 
##     encoding = "unknown", text, skipNul = FALSE) 
## NULL
  • Some of the important ones are:

    • header: is the first line variable names or not?
    • sep: what character is used to separate the columns?
    • dec: what character is used as decimal separator?
    • nrows: how many rows do we want to read?
    • na.strings: what string represent a missing value?
    • skip: how many lines to skip before start reading?
    • comment.char: what character in the beginning of a line should indicate that the line should be skipped?
  • Consider the data file testdat1.dat and open it with Excel. Read the file with attention.

  • Consider now the same file imported to R:

> dat<-read.table("InFiles/testdat1.dat", header=TRUE, skip=5, nrow=2)
> dat
##   a b c
## 1 1 2 3
## 2 4 5 6
  • Consider the data file testdat2.dat and open it with Excel. Read the file with attention.

  • Consider now the same file imported to R:

> dat2<-read.table("InFiles/testdat2.dat", header=TRUE, na.strings=".", comment.char=";", dec=",")
> dat2
##   A   B  C
## 1 1 2.0  3
## 2 4 3.2  2
## 3 1 5.0 NA
## 4 5 4.0  6

4.4 Variants of read.table()

  • Other functions which are useful for reading data frames from files are:

    • read.csv(): comma separated, dot as decimal point
    • read.csv2(): semicolon separated, comma as decimal point
    • read.fwf(): fixed width format
  • read.csv() and read.csv2() are adapted to Excel tables saved as csv files. Which one you need to use depends on your system’s regional settings.

  • Additional arguments are similar to those of read.table().

  • Eg:

> read.csv("file", header=TRUE, sep=";", dec=",")

4.5 Reading text files from Excel

  • Install package readxl and use the read_excel function.
> library(readxl)
> my_excel_data <- read_excel("InFiles/Babies.xlsx", sheet= "Babies")
> my_excel_data <- as.data.frame(my_excel_data)
> my_excel_data
##     Age Male babies Female babies
## 1  1841       40.15         42.25
## 2  1842       40.07         41.92
## 3  1843       40.59         42.62
## 4  1844       40.20         42.28
## 5  1845       41.12         43.22
## 6  1846       39.08         41.22
## 7  1847       37.46         39.57
## 8  1848       38.81         40.97
## 9  1849       36.82         38.54
## 10 1850       41.83         43.71
## 11 1851       39.97         41.93
## 12 1852       39.35         41.47
## 13 1853       38.86         41.11
## 14 1854       38.52         40.44
## 15 1855       39.62         41.86
## 16 1856       41.36         43.58
## 17 1857       39.99         41.86
## 18 1858       38.59         40.50
## 19 1859       39.44         41.37
## 20 1860       40.76         43.13
## 21 1861       40.45         42.79
## 22 1862       40.94         43.32
## 23 1863       39.19         41.55
## 24 1864       38.37         40.80
## 25 1865       38.40         41.10
## 26 1866       38.77         41.40
## 27 1867       40.54         43.45
## 28 1868       40.37         43.03
## 29 1869       39.94         42.74
## 30 1870       39.16         42.05
## 31 1871       39.62         42.65
## 32 1872       41.17         44.27
## 33 1873       41.66         44.92
## 34 1874       40.62         43.59
## 35 1875       39.91         42.99
## 36 1876       40.96         44.37
## 37 1877       41.93         45.45
## 38 1878       40.45         43.63
## 39 1879       41.88         45.13
  • Other possibility:
> my_excel_data <- read_excel(file.choose())
> 
> # Try this one using the babies file!