3 Working With Data in R

In this chapter we will apply what we learned in the previous section to real data. We will learn how to read and write data in R and manipulate the data, using native functions from R.

3.1 Setting the Working Directory

When you open R and RStudio it will work in a default directory. So, before we can load data, we have to set our working directory, in which R looks for our data files.

To see our current working directory, type:getwd(). We can change or set our working director using the setwd("") function. When running R on a Windows machine, make sure to change the slashes to forward slashes/”, or escape the backwards slashes to set a path like "C:\\Users\\Name\\".

3.2 Clearing Objects

As a next step, you may want to start each R-script with a clean slate (e.g. “clear all” at the beginning of STATA do-file).

To empty the memory and remove all objects, type:

rm(list = ls())

3.3 Saving and Loading R Scripts and Native R Files

You can save your current R script (*.R) by clicking on the save button in the top left of RStudio. To save one or multiple objects (e.g. data frames) from your current work space as an *.Rdata file, type save():

x <- 1:10
save(x, file = "X.RData")

To load them again, use

load(file = "X.RData")
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

We strongly recommend to NEVER overwrite your original data source!

3.4 Loading Data from CSV Files

We usually begin by loading data from some format. In this course, we will mostly use CSV files, but R can read a large variety of formats (.dta, .shp, .txt, .nc, etc.). To get started, we load the following CSV file directly from the web, that contains information on movie ratings:

url <- "http://www.stern.nyu.edu/~wgreene/Text/"
df <- read.csv(paste0(url,"Edition7/TableF4-3.csv"))
df <- df[,1:5] # keep only first 5 variables

3.5 Diagnostics in R

str() reports the internal structure of an object. The argument vec.len = 1 controls the length of the printed output.

str(df[,1:3], vec.len=1)
#> 'data.frame':    62 obs. of  3 variables:
#>  $ BOX     : int  19167085 63106589 ...
#>  $ MPRATING: int  4 2 ...
#>  $ BUDGET  : num  28 150 ...

The head() function displays the first few (5 by default) rows of an object and is particularly useful, when you work with larger datasets.

head(df,3)
#>        BOX MPRATING BUDGET STARPOWR SEQUEL
#> 1 19167085        4   28.0    19.83      0
#> 2 63106589        2  150.0    32.69      1
#> 3  5401605        4   37.4    15.69      0

The summary() function reports the summary statistics of an object.

summary(df[,1:3])
#>       BOX              MPRATING         BUDGET      
#>  Min.   :  511920   Min.   :1.000   Min.   :  5.00  
#>  1st Qu.: 6956492   1st Qu.:2.000   1st Qu.: 30.50  
#>  Median :16930926   Median :3.000   Median : 37.40  
#>  Mean   :20720651   Mean   :2.968   Mean   : 53.29  
#>  3rd Qu.:26696144   3rd Qu.:4.000   3rd Qu.: 60.00  
#>  Max.   :70950500   Max.   :4.000   Max.   :200.00

The # View() functions opens the dataset for you. Only use this, if the data is not very big.

# View(df[,1:3])

Exercise 7: Problems when Loading Data

  1. Load test.csv in the exercise folder.

  2. Inspect the objects using some commands we just applied. Try str(),head(), summary() and # View().

  3. Only looking at the first two columns: What kind of problems do you encounter and how to solve the issue? Hint: Go to the helpfile of read.csv() and watch out for the dec argument.

  4. The third column is a character vector. Use the following line to convert the entries to numerical units

df$d <- as.numeric(df$c)
  1. Calculate the sum of all entries in the third column. Did you expect the result?
Answer
## Problems when loading data
### 1.
csv_file <- "data/test.csv"
df <- read.csv(csv_file)

### 2.
str(df)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ a: chr  "1,2" "1,4" "1,8"
#>  $ b: chr  "2,2" "2,4" NA
#>  $ c: chr  "3,2" "3,4" "."
head(df)
#>     a    b   c
#> 1 1,2  2,2 3,2
#> 2 1,4  2,4 3,4
#> 3 1,8 <NA>   .
summary(df)
#>       a                  b                  c            
#>  Length:3           Length:3           Length:3          
#>  Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character
# View(df)

### 3. 
df <- read.csv(csv_file, dec=",")

### 4. 
df$d <- as.numeric(df$c)
sum(df$d) 
#> [1] NA

# This result is not correct. Solution?
df$d <- df$c
str(df$d) # still a factor variable with old factors
#>  chr [1:3] "3,2" "3,4" "."
df$d <- as.numeric(df$d)
str(df$d) # this is better, but wait the numbers are different. Solution?
#>  num [1:3] NA NA NA

#start all over again
df$d <- df$c # copy the column
head(df$d)
#> [1] "3,2" "3,4" "."
df$d <- sub(",", ".", df$d)
df$d <- as.numeric(as.character(df$d))

# now get the sum
sum(df$d) # sum with any NA is NA
#> [1] NA
sum(df$d, na.rm=TRUE)
#> [1] 6.6

# good thing to do:
df <- read.csv(csv_file, dec=",", stringsAsFactors = FALSE)

3.6 Subsetting Data Frames

When manipulating data, we often would like to only address some elements of a data frame. Hence, we need subsetting. Before we begin, remember that a data frame is a collection of row- and column-vectors. Much like vectors, we can subset a data frame by specifying which parts of rows and columns we would like to work with. To subset a data frame df, we use square brackets, and first name rows and then columns (just like for matrices), i.e. df[rows,columns]. If we leave one side blank, we keep either all rows or all columns, respectively.

We can subset a data frame in many ways, again, just like for vectors or matrices:

  • Choose the elements to keep, e.g. 1:5
  • Alternatively, choose the elements not to keep, e.g. -c(1,3,5)
  • Further, we can use logical vectors, e.g. x[x>=3]
  • Or, again, keep the opposite, e.g. x[x!=3]

In addition, data frames have special operators. The $ sign allows addressing a specific column, e.g. df$country. Alternatively we can use the function subset(). In addition, we can also subset data conditionally on other functions.

Let’s see how subsetting data frames work in practice:

# Select some rows directly, keep all columns:
df[1:5,]
# Now, take the same rows, but only for
# one variable/column:
df[1:5,"BOX"]
# The same can be done in further ways:
df$BOX[1:5]
df[1:5,1] # If BOX is the first Column!
df[1:5,colnames(df)=="BOX"]
subset(df,select=BOX) # subset df to column "BOX"
subset(df[1:5,],select=BOX) # subset and keep rows 1 to 5
df[df$BOX <= mean(df$BOX),] # select all rows smaller or equal to the mean of the variable 
df[!is.na(df$BOX),] # To drop missing values
df[df$BOX != 0,] # Drop Null-Values

3.7 Generating and Replacing Variables in Data Frames

Our data on movies, df, contains data on box office returns. Box office returns are measured in USD. Suppose, we would like to change the scale to millions of USD instead. We would need to either replace “BOX” or create a new variable. We can overwrite a specific column in a data frame using the assign function <-.

# replace, different Syntax same result
df$BOX <- df$BOX / 10^6
df[,"BOX"] <- df[,"BOX"] / 10^6

Note, how you always need to specify the data frame you are using even on the right hand side. Otherwise R will search for a vector named “BOX” in the workspace and not in the data frame.

Let’s save the code we have written so far by clicking on save in RStudio. You may call the file Example1.R. You may also want to save the data frame we have created, so that next time you don’t have to run the entire script to continue working with the movie ratings data in another example.

As mentioned in Section 3.3 we can simply save the data using the save() function. Note that you have name the file and use the .RData file ending.

save(df, file = "MovieData.RData")

Run and save the script again. Next time we can open the data with load(file = "MovieData.RData") and the data frame df will show up in your workspace again.

Exercise 8: Working with Data Frames

Question 1: Getting familiar with subsetting

Get some data and try to subset it (don’t overwrite the df):

data(WorldPhones)
df <- data.frame(WorldPhones)
  • Keep the first 2 observations
  • Drop the last 2 observations; Hint: nrow()displays the number of rows
  • Drop Africa
  • Keep only North America, Europe, Oceania, and Asia
  • Keep only the years where Europeans had more than 35,000 phones
  • Keep Oceania and Asia when Asia has more than 5,000 phones, use here the subset() function
Answer
### Question 1: Getting familiar with subsetting 

## Load Data:
data(WorldPhones)
df <- data.frame(WorldPhones)

# Keep the first 2 observations
subs <- df[1:2,]

# Drop the last 2 observations
subs <- df[-c(nrow(df),(nrow(df)-1)),]
subs1 <- df[1:5,]

# Drop Africa
subs <- df[,colnames(df)!="Africa"]

# alternative
subs1 <- df[,-6]

# Keep only North America, Europe, Oceania, and Asia
subs <- df[,c("N.Amer","Europe","Asia","Oceania")]

# Keep only the years where Europeans had more than 35,000 phones
subs <- df[df$Europe>=35000,]

# Keep Oceania and Asia when Asia has more than 5,000 phones, use `subset()` function
subs <- subset(df,subset=Asia>=5000,select=c("Oceania","Asia"))

Question 2: Generating and Replacing

Let’s use data on 18th century Switzerland and make a “livelihood”-index for the typical Swiss person of that time…

data(swiss)
df <- swiss
  1. Start by adding some indicator variables to the data frame. For each, first add the column by assigning 0 to all observations, then add a 1 for those where the condition is fulfilled:
    • The majority of people are catholic
    • Agriculture is above 30 percent
    • Infant mortality is below 20
    • Fertility is above 60, but below 80
  2. Calculate how each observation scores over all dummies (check out the rowSums() function) and sort the data frame descending in the points they scored. You sort the data frame via “subsetting” its rows with the ordered vector, which in turn you produce with the order() function.
Answer
### Question 2: Generating and Replacing ###

## Load Data
data(swiss)
df <- swiss
head(df)
#>              Fertility Agriculture Examination Education
#> Courtelary        80.2        17.0          15        12
#> Delemont          83.1        45.1           6         9
#> Franches-Mnt      92.5        39.7           5         5
#> Moutier           85.8        36.5          12         7
#> Neuveville        76.9        43.5          17        15
#> Porrentruy        76.1        35.3           9         7
#>              Catholic Infant.Mortality
#> Courtelary       9.96             22.2
#> Delemont        84.84             22.2
#> Franches-Mnt    93.40             20.2
#> Moutier         33.77             20.3
#> Neuveville       5.16             20.6
#> Porrentruy      90.57             26.6

# The Majority of People are Catholic
df$cath.dummy <- 0
df$cath.dummy[df$Catholic>=50] <- 1

# Agriculture is above 30 percent
df$agri.dummy <- 0
df$agri.dummy[df$Agriculture>30] <- 1

# Infant Mortality is below 20
df$infant.dummy <- 0
df$infant.dummy[df$Infant.Mortality<20] <- 1

# Fertility is above 60, but below 80
df$fert.dummy <- 0
df$fert.dummy[df$Fertility>60 & df$Fertility<80] <- 1

# Add up dummies...
df$total <- rowSums(df[,7:10])

#  ...and sort descending in the "points"
index <- order(df$total,decreasing = TRUE)
df <- df[index,]
# View(df)