Introduction to R

Francisco Rowe (@fcorowe)

2021-11-05

Why R?

RStudio is often used to implement R coding

To run R or RStudio, just double click on the R or RStudio icon. We will use RStudio:

Fig. 1. RStudio features.

If you would like to know more about the various features of RStudio, watch this video

Setting the working directory

Use

setwd()

Note: It is good practice to not include spaces when naming folders and files. Use underscores or dots.

R Scripts vs Notebooks

An R script :

To create an R script in RStudio:

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

An R Notebook

To create an R Notebook:

Fig. 2. YAML metadata for notebooks.

  1. Use the Insert command on the editor toolbar;
  2. Use the keyboard shortcut Ctrl + Alt + I or Cmd + Option + I (Mac); or,
  3. Type the chunk delimiters ```{r} and ```

In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets eg.

Fig. 3. Code chunk example. Details on the various options can be found here

Rstudio also offers a Preview option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu knit to ...

Getting Help

You can use help or ? to ask for details for a specific function:

help(sqrt) #or ?sqrt
And using example provides examples for said function:

Example sqrt Example sqrt

example(sqrt)
## 
## sqrt> require(stats) # for spline
## 
## sqrt> require(graphics)
## 
## sqrt> xx <- -9:9
## 
## sqrt> plot(xx, sqrt(abs(xx)),  col = "red")
## 
## sqrt> lines(spline(xx, sqrt(abs(xx)), n=101), col = "pink")

Variables and objects

An object is a data structure having attributes and methods. In fact, everything in R is an object!

A variable is a type of data object. Data objects also include list, vector, matrices and text.

In R a variable can be created by using the symbol <- to assign a value to a variable name. The variable name is entered on the left <- and the value on the right. Note: Data objects can be given any name, provided that they start with a letter of the alphabet, and include only letters of the alphabet, numbers and the characters . and _. Hence AgeGroup, Age_Group and Age.Group are all valid names for an R data object. Note also that R is case-sensitive, to agegroup and AgeGroup would be treated as different data objects.

To save the value 28 to a variable (data object) labelled age, run the code:

age <- 28

To inspect the contents of the data object age run the following line of code:

age
## [1] 28

Find out what kind (class) of data object age is using:

class(age) 
## [1] "numeric"

Inspect the structure of the age data object:

str(age) 
##  num 28

What if we have more than one response? We can use the c( ) function to combine multiple values into one data vector object:

age <- c(28, 36, 25, 24, 32)
age
## [1] 28 36 25 24 32
class(age) #Still numeric..
## [1] "numeric"
str(age) #..but now a vector (set) of 5 separate values
##  num [1:5] 28 36 25 24 32

Note that on each line in the code above any text following the # character is ignored by R when executing the code. Instead, text following a # can be used to add comments to the code to make clear what the code is doing. Two marks of good code are a clear layout and clear commentary on the code.

Basic Data Types in R

There are a number of data types. Four are the most common. In R, numeric is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). This is because most statistical calculations deal with numbers with up to two decimals.

num <- 4.5 # Decimal values
class(num)
## [1] "numeric"
int <- as.integer(4) # Natural numbers. Note integers are also numerics.
class(int)
## [1] "integer"
cha <- "are you enjoying this?" # text or string. You can also type `as.character("are you enjoying this?")`
class(cha)
## [1] "character"
log <- 2 < 1 # assigns TRUE or FALSE. In this case, FALSE as 2 is greater than 1
log
## [1] FALSE
class(log)
## [1] "logical"

A factor variable assigns a numeric code to each possible category (level) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000 males and females to a list of 10,000 1s and 0s. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.

For example, the variable gender, converted to a factor, would be stored as a series of 1s and 2s, where 1 = female and 2 = male; but would be displayed in all outputs using their category labels of female and male.

Creating a factor

To convert a numeric or character vector into a factor use the factor( ) function. For instance:

gender <- c("female","male","male","female","female") # create a gender variable
gender <- factor(gender) # replace character vector with a factor version
gender
## [1] female male   male   female female
## Levels: female male
class(gender)
## [1] "factor"
str(gender)
##  Factor w/ 2 levels "female","male": 1 2 2 1 1

Now gender is a factor and is stored as a series of 1s and 2s, with 1s representing females and 2s representing males. The function levels( ) lists the levels (categories) associated with a given factor variable:

levels(gender)
## [1] "female" "male"

The categories are reported in the order that they have been numbered (starting from 1). Hence from the output we can infer that females are coded as 1, and males as 2.

Data Frames

R stores different types of data using different types of data structure. Data are normally stored as a data.frame. A data frames contain one row per observation (e.g. wards) and one column per attribute (eg. population and health).

We create three variables wards, population (pop) and people with good health (ghealth). We use 2011 census data counts for total population and good health for wards in Liverpool.

wards <- c("Allerton and Hunts Cross","Anfield","Belle Vale","Central","Childwall","Church","Clubmoor","County","Cressington","Croxteth","Everton","Fazakerley","Greenbank","Kensington and Fairfield","Kirkdale","Knotty Ash","Mossley Hill","Norris Green","Old Swan","Picton","Princes Park","Riverside","St Michael's","Speke-Garston","Tuebrook and Stoneycroft","Warbreck","Wavertree","West Derby","Woolton","Yew Tree")
pop <- c(14853,14510,15004,20340,13908,13974,15272,14045,14503,
                14561,14782,16786,16132,15377,16115,13312,13816,15047,
                16461,17009,17104,18422,12991,20300,16489,16481,14772,
                14382,12921,16746)
ghealth <- c(7274,6124,6129,11925,7219,7461,6403,5930,7094,6992,
                 5517,7879,8990,6495,6662,5981,7322,6529,7192,7953,
                 7636,9001,6450,8973,7302,7521,7268,7013,6025,7717)

Note that pop and ghealth and wards contains characters.

Creating A Data Frame

We can create a data frame and examine its structure:

df <- data.frame(wards, pop, ghealth)
df # or use view(data)
##                       wards   pop ghealth
## 1  Allerton and Hunts Cross 14853    7274
## 2                   Anfield 14510    6124
## 3                Belle Vale 15004    6129
## 4                   Central 20340   11925
## 5                 Childwall 13908    7219
## 6                    Church 13974    7461
## 7                  Clubmoor 15272    6403
## 8                    County 14045    5930
## 9               Cressington 14503    7094
## 10                 Croxteth 14561    6992
## 11                  Everton 14782    5517
## 12               Fazakerley 16786    7879
## 13                Greenbank 16132    8990
## 14 Kensington and Fairfield 15377    6495
## 15                 Kirkdale 16115    6662
## 16               Knotty Ash 13312    5981
## 17             Mossley Hill 13816    7322
## 18             Norris Green 15047    6529
## 19                 Old Swan 16461    7192
## 20                   Picton 17009    7953
## 21             Princes Park 17104    7636
## 22                Riverside 18422    9001
## 23             St Michael's 12991    6450
## 24            Speke-Garston 20300    8973
## 25 Tuebrook and Stoneycroft 16489    7302
## 26                 Warbreck 16481    7521
## 27                Wavertree 14772    7268
## 28               West Derby 14382    7013
## 29                  Woolton 12921    6025
## 30                 Yew Tree 16746    7717
str(df) # or use glimpse(data) 
## 'data.frame':    30 obs. of  3 variables:
##  $ wards  : chr  "Allerton and Hunts Cross" "Anfield" "Belle Vale" "Central" ...
##  $ pop    : num  14853 14510 15004 20340 13908 ...
##  $ ghealth: num  7274 6124 6129 11925 7219 ...

Referencing Data Frames

Throughout this module, you will need to refer to particular parts of a dataframe - perhaps a particular column (an area attribute); or a particular subset of respondents. Hence it is worth spending some time now mastering this particular skill.

The relevant R function, [ ], has the format [row,col] or, more generally, [set of rows, set of cols].

Run the following commands to get a feel of how to extract different slices of the data:

df # whole data.frame
df[1, 1] # contents of first row and column
df[2, 2:3] # contents of the second row, second and third columns
df[1, ] # first row, ALL columns [the default if no columns specified]
df[ ,1:2] # ALL rows; first and second columns
df[c(1,3,5), ] # rows 1,3,5; ALL columns
df[ , 2] # ALL rows; second column (by default results containing only 
             #one column are converted back into a vector)
df[ , 2, drop=FALSE] # ALL rows; second column (returned as a data.frame)

In the above, note that we have used two other R functions:

Run both of these fuctions on their own to get a better understanding of what they do.

Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:

df[,"pop"] # variable name in quotes inside the square brackets
df$pop # variable name prefixed with $ and appended to the data.frame name
# or you can use attach
attach(df)
pop # but be careful if you already have an age variable in your local workspace

Want to check the variables available, use the names( ):

names(df)
## [1] "wards"   "pop"     "ghealth"

Read Data

Ensure your memory is clear

rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects

There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in csv format from Excel or other software packages. So we use either:

To read files in other formats, refer to this useful DataCamp tutorial

census <- read.csv("../data/census_data.csv")
head(census)
##        code                     ward pop16_74 higher_managerial   pop ghealth
## 1 E05000886 Allerton and Hunts Cross    10930              1103 14853    7274
## 2 E05000887                  Anfield    10712               312 14510    6124
## 3 E05000888               Belle Vale    10987               432 15004    6129
## 4 E05000889                  Central    19174              1346 20340   11925
## 5 E05000890                Childwall    10410              1123 13908    7219
## 6 E05000891                   Church    10569              1843 13974    7461
# NOTE: always ensure your are setting the correct directory leading to the data. 
# It may differ from your existing working directory

Quickly inspect the data 1. What class? 2. What R data types? 3. What data types?

# 1
class(census)
# 2 & 3
str(census)

Just interested in the variable names:

names(census)
## [1] "code"              "ward"              "pop16_74"         
## [4] "higher_managerial" "pop"               "ghealth"

or want to view the data:

View(census)

Manipulation Data Using tidyverse

Adding New Variables

Usually you want to add / create new variables to your data frame using existing variables eg. computing percentages by dividing two variables. There are many ways in which you can do this i.e. referencing a data frame as we have done above, or using $ (e.g. census$pop). For this module, we’ll use tidyverse:

census <- census %>% mutate(per_ghealth = ghealth / pop)

Note we used a pipe operator %>%, which helps make the code more efficient and readable - more details, see Grolemund, Garrett, and Hadley Wickham, 2019. When using the pipe operator, recall to first indicate the data frame before %>%.

Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly, US.

Note also the use a variable name before the = sign in brackets to indicate the name of the new variable after mutate.

Selecting Variables

Usually you want to select a subset of variables for your analysis as storing to large data sets in your R memory can reduce the processing speed of your machine. A selection of data can be achieved by using the select function:

ndf <- census %>% select(ward, pop16_74, per_ghealth)

Again first indicate the data frame and then the variable you want to select to build a new data frame. Note the code chunk above has created a new data frame called ndf. Explore it.

Filtering Data

You may also want to filter values based on defined conditions. You may want to filter observations greater than a certain threshold or only areas within a certain region. For example, you may want to select areas with a percentage of good health population over 50%:

ndf2 <- census %>% filter(per_ghealth < 0.5)

You can use more than one variables to set conditions. Use “,” to add a condition.

Joining Data Drames

When working with spatial data, we often need to join data. To this end, you need a common unique id variable. Let’s say, we want to add a data frame containing census data on households for Liverpool, and join the new attributes to one of the existing data frames in the workspace. First we will read the data frame we want to join (ie. census_data2.csv).

# read data
census2 <- read.csv("../data/census_data2.csv")
# visualise data structure
str(census2)
## 'data.frame':    30 obs. of  3 variables:
##  $ geo_code               : chr  "E05000886" "E05000887" "E05000888" "E05000889" ...
##  $ households             : int  6359 6622 6622 7139 5391 5884 6576 6745 6317 6024 ...
##  $ socialrented_households: int  827 1508 2818 1311 374 178 2859 1564 1023 1558 ...

The variable geo_code in this data frame corresponds to the code in the existing data frame and they are unique so they can be automatically matched by using the merge() function. The merge() function uses two arguments: x and y. The former refers to data frame 1 and the latter to data frame 2. Both of these two data frames must have a id variable containing the same information. Note they can have different names. Another key argument to include is all.x=TRUE which tells the function to keep all the records in x, but only those in y that match in case there are discrepancies in the id variable.

# join data frames
join_dfs <- merge(census, census2, by.x="code", by.y="geo_code", all.x = TRUE)
# check data
head(join_dfs)
##        code                     ward pop16_74 higher_managerial   pop ghealth
## 1 E05000886 Allerton and Hunts Cross    10930              1103 14853    7274
## 2 E05000887                  Anfield    10712               312 14510    6124
## 3 E05000888               Belle Vale    10987               432 15004    6129
## 4 E05000889                  Central    19174              1346 20340   11925
## 5 E05000890                Childwall    10410              1123 13908    7219
## 6 E05000891                   Church    10569              1843 13974    7461
##   per_ghealth households socialrented_households
## 1   0.4897327       6359                     827
## 2   0.4220538       6622                    1508
## 3   0.4084911       6622                    2818
## 4   0.5862832       7139                    1311
## 5   0.5190538       5391                     374
## 6   0.5339201       5884                     178

Saving Data

It may also be convenient to save your R projects. They contains all the objects that you have created in your workspace by using the save.image( ) function:

save.image("gds.RData")

This creates a file labelled “week1_envs453.RData” in your working directory. You can load this at a later stage using the load( ) function.

load("gds.RData")

Alternatively you can save / export your data into a csv file. The first argument in the function is the object name, and the second: the name of the csv we want to create.

write.csv(join_dfs, "join_censusdfs.csv")