3 Working With Data in R
In this chapter we will apply what we learned in the previous section to real data. We will learn how to read and write data in R and manipulate the data, using native functions from R.
3.1 Setting the Working Directory
When you open R and RStudio it will work in a default directory. So, before we can load data, we have to set our working directory, in which R looks for our data files.
To see our current working directory, type:getwd()
. We can change or set our working director using the setwd("")
function. When running R on a Windows machine, make sure to change the slashes to forward slashes “/
”, or escape the backwards slashes to set a path like "C:\\Users\\Name\\"
.
3.2 Clearing Objects
As a next step, you may want to start each R-script with a clean slate (e.g. “clear all” at the beginning of STATA do-file).
To empty the memory and remove all objects, type:
3.3 Saving and Loading R Scripts and Native R Files
You can save your current R script (*.R
) by clicking on the save
button in the top left of RStudio. To save one or multiple objects (e.g. data frames) from your current work space as an *.Rdata
file, type save()
:
x <- 1:10
save(x, file = "X.RData")
To load them again, use
load(file = "X.RData")
x
#> [1] 1 2 3 4 5 6 7 8 9 10
We strongly recommend to NEVER overwrite your original data source!
3.4 Loading Data from CSV Files
We usually begin by loading data from some format. In this course, we will mostly use CSV files, but R can read a large variety of formats (.dta, .shp, .txt, .nc, etc.). To get started, we load the following CSV file directly from the web, that contains information on movie ratings:
3.5 Diagnostics in R
str()
reports the internal structure of an object. The argument vec.len = 1
controls the length
of the printed output.
str(df[,1:3], vec.len=1)
#> 'data.frame': 62 obs. of 3 variables:
#> $ BOX : int 19167085 63106589 ...
#> $ MPRATING: int 4 2 ...
#> $ BUDGET : num 28 150 ...
The head()
function displays the first few (5 by default) rows of an object and is particularly useful,
when you work with larger datasets.
head(df,3)
#> BOX MPRATING BUDGET STARPOWR SEQUEL
#> 1 19167085 4 28.0 19.83 0
#> 2 63106589 2 150.0 32.69 1
#> 3 5401605 4 37.4 15.69 0
The summary()
function reports the summary statistics of an object.
summary(df[,1:3])
#> BOX MPRATING BUDGET
#> Min. : 511920 Min. :1.000 Min. : 5.00
#> 1st Qu.: 6956492 1st Qu.:2.000 1st Qu.: 30.50
#> Median :16930926 Median :3.000 Median : 37.40
#> Mean :20720651 Mean :2.968 Mean : 53.29
#> 3rd Qu.:26696144 3rd Qu.:4.000 3rd Qu.: 60.00
#> Max. :70950500 Max. :4.000 Max. :200.00
The # View()
functions opens the dataset for you. Only use this, if the data is not very big.
# View(df[,1:3])
Exercise 7: Problems when Loading Data
Load
test.csv
in the exercise folder.Inspect the objects using some commands we just applied. Try
str()
,head()
,summary()
and# View()
.Only looking at the first two columns: What kind of problems do you encounter and how to solve the issue? Hint: Go to the helpfile of
read.csv()
and watch out for thedec
argument.The third column is a character vector. Use the following line to convert the entries to numerical units
df$d <- as.numeric(df$c)
- Calculate the sum of all entries in the third column. Did you expect the result?
Answer
## Problems when loading data
### 1.
csv_file <- "data/test.csv"
df <- read.csv(csv_file)
### 2.
str(df)
#> 'data.frame': 3 obs. of 3 variables:
#> $ a: chr "1,2" "1,4" "1,8"
#> $ b: chr "2,2" "2,4" NA
#> $ c: chr "3,2" "3,4" "."
head(df)
#> a b c
#> 1 1,2 2,2 3,2
#> 2 1,4 2,4 3,4
#> 3 1,8 <NA> .
summary(df)
#> a b c
#> Length:3 Length:3 Length:3
#> Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character
# View(df)
### 3.
df <- read.csv(csv_file, dec=",")
### 4.
df$d <- as.numeric(df$c)
sum(df$d)
#> [1] NA
# This result is not correct. Solution?
df$d <- df$c
str(df$d) # still a factor variable with old factors
#> chr [1:3] "3,2" "3,4" "."
df$d <- as.numeric(df$d)
str(df$d) # this is better, but wait the numbers are different. Solution?
#> num [1:3] NA NA NA
#start all over again
df$d <- df$c # copy the column
head(df$d)
#> [1] "3,2" "3,4" "."
df$d <- sub(",", ".", df$d)
df$d <- as.numeric(as.character(df$d))
# now get the sum
sum(df$d) # sum with any NA is NA
#> [1] NA
sum(df$d, na.rm=TRUE)
#> [1] 6.6
# good thing to do:
df <- read.csv(csv_file, dec=",", stringsAsFactors = FALSE)
3.6 Subsetting Data Frames
When manipulating data, we often would like to only address some elements of a data frame. Hence, we need subsetting. Before we begin, remember that a data frame is a collection of row- and column-vectors. Much like vectors, we can subset a data frame by specifying which parts of rows and columns we would like to work with. To subset a data frame df
, we use square brackets, and first name rows and then columns (just like for matrices), i.e. df[rows,columns]
. If we leave one side blank, we keep either all rows or all columns, respectively.
We can subset a data frame in many ways, again, just like for vectors or matrices:
- Choose the elements to keep, e.g.
1:5
- Alternatively, choose the elements not to keep, e.g.
-c(1,3,5)
- Further, we can use logical vectors, e.g.
x[x>=3]
- Or, again, keep the opposite, e.g.
x[x!=3]
In addition, data frames have special operators. The $
sign allows addressing a specific column, e.g. df$country
. Alternatively we can use the function subset()
. In addition, we can also subset data conditionally on other functions.
Let’s see how subsetting data frames work in practice:
# Select some rows directly, keep all columns:
df[1:5,]
# Now, take the same rows, but only for
# one variable/column:
df[1:5,"BOX"]
# The same can be done in further ways:
df$BOX[1:5]
df[1:5,1] # If BOX is the first Column!
df[1:5,colnames(df)=="BOX"]
subset(df,select=BOX) # subset df to column "BOX"
subset(df[1:5,],select=BOX) # subset and keep rows 1 to 5
df[df$BOX <= mean(df$BOX),] # select all rows smaller or equal to the mean of the variable
df[!is.na(df$BOX),] # To drop missing values
df[df$BOX != 0,] # Drop Null-Values
3.7 Generating and Replacing Variables in Data Frames
Our data on movies, df
, contains data on box office returns. Box office returns are measured in USD. Suppose, we would like to change the scale to millions of USD instead. We would need to either replace “BOX” or create a new variable. We can overwrite a specific column in a data frame using the assign function <-
.
# replace, different Syntax same result
df$BOX <- df$BOX / 10^6
df[,"BOX"] <- df[,"BOX"] / 10^6
Note, how you always need to specify the data frame you are using even on the right hand side. Otherwise R will search for a vector named “BOX” in the workspace and not in the data frame.
Let’s save the code we have written so far by clicking on save
in RStudio. You may call the file Example1.R
. You may also want to save the data frame we have created, so that next time you don’t have to run the entire script to continue working with the movie ratings data in another example.
As mentioned in Section 3.3 we can simply save the data using the save()
function. Note that you have name the file and use the .RData
file ending.
save(df, file = "MovieData.RData")
Run and save the script again. Next time we can open the data with load(file = "MovieData.RData")
and the data frame df
will show up in your workspace again.
Exercise 8: Working with Data Frames
Question 1: Getting familiar with subsetting
Get some data and try to subset it (don’t overwrite the df):
data(WorldPhones)
df <- data.frame(WorldPhones)
- Keep the first 2 observations
- Drop the last 2 observations; Hint:
nrow()
displays the number of rows - Drop Africa
- Keep only North America, Europe, Oceania, and Asia
- Keep only the years where Europeans had more than 35,000 phones
- Keep Oceania and Asia when Asia has more than 5,000 phones, use here the
subset()
function
Answer
### Question 1: Getting familiar with subsetting
## Load Data:
data(WorldPhones)
df <- data.frame(WorldPhones)
# Keep the first 2 observations
subs <- df[1:2,]
# Drop the last 2 observations
subs <- df[-c(nrow(df),(nrow(df)-1)),]
subs1 <- df[1:5,]
# Drop Africa
subs <- df[,colnames(df)!="Africa"]
# alternative
subs1 <- df[,-6]
# Keep only North America, Europe, Oceania, and Asia
subs <- df[,c("N.Amer","Europe","Asia","Oceania")]
# Keep only the years where Europeans had more than 35,000 phones
subs <- df[df$Europe>=35000,]
# Keep Oceania and Asia when Asia has more than 5,000 phones, use `subset()` function
subs <- subset(df,subset=Asia>=5000,select=c("Oceania","Asia"))
Question 2: Generating and Replacing
Let’s use data on 18th century Switzerland and make a “livelihood”-index for the typical Swiss person of that time…
data(swiss)
df <- swiss
- Start by adding some indicator variables to the data frame. For each, first add the column by assigning 0 to all observations, then add a 1 for those where the condition is fulfilled:
- The majority of people are catholic
- Agriculture is above 30 percent
- Infant mortality is below 20
- Fertility is above 60, but below 80
- Calculate how each observation scores over all dummies (check out the
rowSums()
function) and sort the data frame descending in the points they scored. You sort the data frame via “subsetting” its rows with the ordered vector, which in turn you produce with theorder()
function.
Answer
### Question 2: Generating and Replacing ###
## Load Data
data(swiss)
df <- swiss
head(df)
#> Fertility Agriculture Examination Education
#> Courtelary 80.2 17.0 15 12
#> Delemont 83.1 45.1 6 9
#> Franches-Mnt 92.5 39.7 5 5
#> Moutier 85.8 36.5 12 7
#> Neuveville 76.9 43.5 17 15
#> Porrentruy 76.1 35.3 9 7
#> Catholic Infant.Mortality
#> Courtelary 9.96 22.2
#> Delemont 84.84 22.2
#> Franches-Mnt 93.40 20.2
#> Moutier 33.77 20.3
#> Neuveville 5.16 20.6
#> Porrentruy 90.57 26.6
# The Majority of People are Catholic
df$cath.dummy <- 0
df$cath.dummy[df$Catholic>=50] <- 1
# Agriculture is above 30 percent
df$agri.dummy <- 0
df$agri.dummy[df$Agriculture>30] <- 1
# Infant Mortality is below 20
df$infant.dummy <- 0
df$infant.dummy[df$Infant.Mortality<20] <- 1
# Fertility is above 60, but below 80
df$fert.dummy <- 0
df$fert.dummy[df$Fertility>60 & df$Fertility<80] <- 1
# Add up dummies...
df$total <- rowSums(df[,7:10])
# ...and sort descending in the "points"
index <- order(df$total,decreasing = TRUE)
df <- df[index,]
# View(df)