4 Reading in data
R can import data from (but not only):
- Excel
- SAS
- SPSS
- STATA
- …
Typically, is easier to install a package to read in data. For instance, the package
Hmisc
contains functions that handles import from SPSS.
4.1 Reading data from user input
- Frequently, data is collected in white space separated columns, where the first line indicate the variable name (try to use the
scan()
function):
4.2 The R working directory
- R has a search path, the R working directory, where it stores its workspace and look for files. You can locate the working directory with the “get working directory” command:
- The working directory can be changed with the “set working directory” command:
- For files stored in the working directory or subfolders, you can just specify the path from the working directory when reading them. Example: if the data is located in the “Data” folder in your working directory, write:
4.3 Reading data from a text file (*.dat)
- The function
read.table()
is designed to read this format:
> mydat <- read.table("InFiles/filename.dat", header = TRUE)
> # mydat <- read.table("use your own working directory", header = TRUE)
> mydat
## x1 x2 x3
## 1 2 0.3 0.01
## 2 2 1.0 0.11
## 3 3 2.1 0.04
## 4 3 2.2 0.02
## 5 1 0.1 0.10
## 6 1 0.2 0.60
> str(mydat)
## 'data.frame': 6 obs. of 3 variables:
## $ x1: int 2 2 3 3 1 1
## $ x2: num 0.3 1 2.1 2.2 0.1 0.2
## $ x3: num 0.01 0.11 0.04 0.02 0.1 0.6
- The
read.table()
function has a lot of optional arguments:
> args(read.table)
## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
## numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
## col.names, as.is = !stringsAsFactors, tryLogical = TRUE,
## na.strings = "NA", colClasses = NA, nrows = -1, skip = 0,
## check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE,
## blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE,
## flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "",
## encoding = "unknown", text, skipNul = FALSE)
## NULL
Some of the important ones are:
header
: is the first line variable names or not?sep
: what character is used to separate the columns?dec
: what character is used as decimal separator?nrows
: how many rows do we want to read?na.strings
: what string represent a missing value?skip
: how many lines to skip before start reading?comment.char
: what character in the beginning of a line should indicate that the line should be skipped?
Consider the data file
testdat1.dat
and open it with Excel. Read the file with attention.Consider now the same file imported to R:
> dat<-read.table("InFiles/testdat1.dat", header=TRUE, skip=5, nrow=2)
> dat
## a b c
## 1 1 2 3
## 2 4 5 6
Consider the data file
testdat2.dat
and open it with Excel. Read the file with attention.Consider now the same file imported to R:
4.4 Variants of read.table()
Other functions which are useful for reading data frames from files are:
read.csv()
: comma separated, dot as decimal pointread.csv2()
: semicolon separated, comma as decimal pointread.fwf()
: fixed width format
read.csv()
andread.csv2()
are adapted to Excel tables saved as csv files. Which one you need to use depends on your system’s regional settings.Additional arguments are similar to those of
read.table()
.Eg:
4.5 Reading text files from Excel
- Install package
readxl
and use theread_excel
function.
> library(readxl)
> my_excel_data <- read_excel("InFiles/Babies.xlsx", sheet= "Babies")
> my_excel_data <- as.data.frame(my_excel_data)
> my_excel_data
## Age Male babies Female babies
## 1 1841 40.15 42.25
## 2 1842 40.07 41.92
## 3 1843 40.59 42.62
## 4 1844 40.20 42.28
## 5 1845 41.12 43.22
## 6 1846 39.08 41.22
## 7 1847 37.46 39.57
## 8 1848 38.81 40.97
## 9 1849 36.82 38.54
## 10 1850 41.83 43.71
## 11 1851 39.97 41.93
## 12 1852 39.35 41.47
## 13 1853 38.86 41.11
## 14 1854 38.52 40.44
## 15 1855 39.62 41.86
## 16 1856 41.36 43.58
## 17 1857 39.99 41.86
## 18 1858 38.59 40.50
## 19 1859 39.44 41.37
## 20 1860 40.76 43.13
## 21 1861 40.45 42.79
## 22 1862 40.94 43.32
## 23 1863 39.19 41.55
## 24 1864 38.37 40.80
## 25 1865 38.40 41.10
## 26 1866 38.77 41.40
## 27 1867 40.54 43.45
## 28 1868 40.37 43.03
## 29 1869 39.94 42.74
## 30 1870 39.16 42.05
## 31 1871 39.62 42.65
## 32 1872 41.17 44.27
## 33 1873 41.66 44.92
## 34 1874 40.62 43.59
## 35 1875 39.91 42.99
## 36 1876 40.96 44.37
## 37 1877 41.93 45.45
## 38 1878 40.45 43.63
## 39 1879 41.88 45.13
- Other possibility: