@fcorowe
)RStudio is often used to implement R coding
To run R or RStudio, just double click on the R or RStudio icon. We will use RStudio:
If you would like to know more about the various features of RStudio, watch this video
Use
setwd()
Note: It is good practice to not include spaces when naming folders and files. Use underscores or dots.
An R script :
Open a new script file: File > New File > R Script
Write some code on your new script window by typing eg. mtcars
Run the script. Click anywhere on the line of code, then hit Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run the command or select the code chunk and click run on the right-top corner of your script window. If do that, you should get:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
An R Notebook
```{r}
and ```
In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets eg.
Execute code: hit “Run Current Chunk”, Ctrl + Shift + Enter or Cmd + Shift + Enter (Mac)
Save an R notebook: File > Save As. A notebook has a *.Rmd
extension and when it is saved a *.nb.html
file is automatically created. The latter is a self-contained HTML file which contains both a rendered copy of the notebook with all current chunk outputs and a copy of the *.Rmd file itself.
Rstudio also offers a Preview option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu knit to ...
You can use help
or ?
to ask for details for a specific function:
help(sqrt) #or ?sqrt
And using example
provides examples for said function:
Example sqrt
example(sqrt)
##
## sqrt> require(stats) # for spline
##
## sqrt> require(graphics)
##
## sqrt> xx <- -9:9
##
## sqrt> plot(xx, sqrt(abs(xx)), col = "red")
##
## sqrt> lines(spline(xx, sqrt(abs(xx)), n=101), col = "pink")
An object is a data structure having attributes and methods. In fact, everything in R is an object!
A variable is a type of data object. Data objects also include list, vector, matrices and text.
In R a variable can be created by using the symbol <-
to assign a value to a variable name. The variable name is entered on the left <-
and the value on the right. Note: Data objects can be given any name, provided that they start with a letter of the alphabet, and include only letters of the alphabet, numbers and the characters .
and _
. Hence AgeGroup, Age_Group and Age.Group are all valid names for an R data object. Note also that R is case-sensitive, to agegroup and AgeGroup would be treated as different data objects.
To save the value 28 to a variable (data object) labelled age, run the code:
age <- 28
To inspect the contents of the data object age run the following line of code:
age
## [1] 28
Find out what kind (class) of data object age is using:
class(age)
## [1] "numeric"
Inspect the structure of the age data object:
str(age)
## num 28
What if we have more than one response? We can use the c( )
function to combine multiple values into one data vector object:
age <- c(28, 36, 25, 24, 32)
age
## [1] 28 36 25 24 32
class(age) #Still numeric..
## [1] "numeric"
str(age) #..but now a vector (set) of 5 separate values
## num [1:5] 28 36 25 24 32
Note that on each line in the code above any text following the #
character is ignored by R when executing the code. Instead, text following a #
can be used to add comments to the code to make clear what the code is doing. Two marks of good code are a clear layout and clear commentary on the code.
There are a number of data types. Four are the most common. In R, numeric is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). This is because most statistical calculations deal with numbers with up to two decimals.
num <- 4.5 # Decimal values
class(num)
## [1] "numeric"
int <- as.integer(4) # Natural numbers. Note integers are also numerics.
class(int)
## [1] "integer"
cha <- "are you enjoying this?" # text or string. You can also type `as.character("are you enjoying this?")`
class(cha)
## [1] "character"
log <- 2 < 1 # assigns TRUE or FALSE. In this case, FALSE as 2 is greater than 1
log
## [1] FALSE
class(log)
## [1] "logical"
A factor variable assigns a numeric code to each possible category (level) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000
males and females to a list of 10,000
1s
and 0s
. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.
For example, the variable gender, converted to a factor, would be stored as a series of 1s
and 2s
, where 1 = female
and 2 = male
; but would be displayed in all outputs using their category labels of female and male.
Creating a factor
To convert a numeric or character vector into a factor use the factor( )
function. For instance:
gender <- c("female","male","male","female","female") # create a gender variable
gender <- factor(gender) # replace character vector with a factor version
gender
## [1] female male male female female
## Levels: female male
class(gender)
## [1] "factor"
str(gender)
## Factor w/ 2 levels "female","male": 1 2 2 1 1
Now gender is a factor and is stored as a series of 1s
and 2s
, with 1s
representing females
and 2s
representing males
. The function levels( )
lists the levels (categories) associated with a given factor variable:
levels(gender)
## [1] "female" "male"
The categories are reported in the order that they have been numbered (starting from 1
). Hence from the output we can infer that females
are coded as 1
, and males
as 2
.
R stores different types of data using different types of data structure. Data are normally stored as a data.frame. A data frames contain one row per observation (e.g. wards) and one column per attribute (eg. population and health).
We create three variables wards, population (pop
) and people with good health (ghealth
). We use 2011 census data counts for total population and good health for wards in Liverpool.
wards <- c("Allerton and Hunts Cross","Anfield","Belle Vale","Central","Childwall","Church","Clubmoor","County","Cressington","Croxteth","Everton","Fazakerley","Greenbank","Kensington and Fairfield","Kirkdale","Knotty Ash","Mossley Hill","Norris Green","Old Swan","Picton","Princes Park","Riverside","St Michael's","Speke-Garston","Tuebrook and Stoneycroft","Warbreck","Wavertree","West Derby","Woolton","Yew Tree")
pop <- c(14853,14510,15004,20340,13908,13974,15272,14045,14503,
14561,14782,16786,16132,15377,16115,13312,13816,15047,
16461,17009,17104,18422,12991,20300,16489,16481,14772,
14382,12921,16746)
ghealth <- c(7274,6124,6129,11925,7219,7461,6403,5930,7094,6992,
5517,7879,8990,6495,6662,5981,7322,6529,7192,7953,
7636,9001,6450,8973,7302,7521,7268,7013,6025,7717)
Note that pop
and ghealth
and wards
contains characters.
We can create a data frame and examine its structure:
df <- data.frame(wards, pop, ghealth)
df # or use view(data)
## wards pop ghealth
## 1 Allerton and Hunts Cross 14853 7274
## 2 Anfield 14510 6124
## 3 Belle Vale 15004 6129
## 4 Central 20340 11925
## 5 Childwall 13908 7219
## 6 Church 13974 7461
## 7 Clubmoor 15272 6403
## 8 County 14045 5930
## 9 Cressington 14503 7094
## 10 Croxteth 14561 6992
## 11 Everton 14782 5517
## 12 Fazakerley 16786 7879
## 13 Greenbank 16132 8990
## 14 Kensington and Fairfield 15377 6495
## 15 Kirkdale 16115 6662
## 16 Knotty Ash 13312 5981
## 17 Mossley Hill 13816 7322
## 18 Norris Green 15047 6529
## 19 Old Swan 16461 7192
## 20 Picton 17009 7953
## 21 Princes Park 17104 7636
## 22 Riverside 18422 9001
## 23 St Michael's 12991 6450
## 24 Speke-Garston 20300 8973
## 25 Tuebrook and Stoneycroft 16489 7302
## 26 Warbreck 16481 7521
## 27 Wavertree 14772 7268
## 28 West Derby 14382 7013
## 29 Woolton 12921 6025
## 30 Yew Tree 16746 7717
str(df) # or use glimpse(data)
## 'data.frame': 30 obs. of 3 variables:
## $ wards : chr "Allerton and Hunts Cross" "Anfield" "Belle Vale" "Central" ...
## $ pop : num 14853 14510 15004 20340 13908 ...
## $ ghealth: num 7274 6124 6129 11925 7219 ...
Throughout this module, you will need to refer to particular parts of a dataframe - perhaps a particular column (an area attribute); or a particular subset of respondents. Hence it is worth spending some time now mastering this particular skill.
The relevant R function, [ ]
, has the format [row,col]
or, more generally, [set of rows, set of cols]
.
Run the following commands to get a feel of how to extract different slices of the data:
df # whole data.frame
df[1, 1] # contents of first row and column
df[2, 2:3] # contents of the second row, second and third columns
df[1, ] # first row, ALL columns [the default if no columns specified]
df[ ,1:2] # ALL rows; first and second columns
df[c(1,3,5), ] # rows 1,3,5; ALL columns
df[ , 2] # ALL rows; second column (by default results containing only
#one column are converted back into a vector)
df[ , 2, drop=FALSE] # ALL rows; second column (returned as a data.frame)
In the above, note that we have used two other R functions:
1:3
The colon operator tells R to produce a list of numbers including the named start and end points.
c(1,3,5)
Tells R to combine the contents within the brackets into one list of objects
Run both of these fuctions on their own to get a better understanding of what they do.
Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:
df[,"pop"] # variable name in quotes inside the square brackets
df$pop # variable name prefixed with $ and appended to the data.frame name
# or you can use attach
attach(df)
pop # but be careful if you already have an age variable in your local workspace
Want to check the variables available, use the names( )
:
names(df)
## [1] "wards" "pop" "ghealth"
Ensure your memory is clear
rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects
There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in csv format from Excel or other software packages. So we use either:
df <- read.table("path/file_name.csv", header = FALSE, sep =",")
df <- read("path/file_name.csv", header = FALSE)
df <- read.csv2("path/file_name.csv", header = FALSE)
To read files in other formats, refer to this useful DataCamp tutorial
census <- read.csv("../data/census_data.csv")
head(census)
## code ward pop16_74 higher_managerial pop ghealth
## 1 E05000886 Allerton and Hunts Cross 10930 1103 14853 7274
## 2 E05000887 Anfield 10712 312 14510 6124
## 3 E05000888 Belle Vale 10987 432 15004 6129
## 4 E05000889 Central 19174 1346 20340 11925
## 5 E05000890 Childwall 10410 1123 13908 7219
## 6 E05000891 Church 10569 1843 13974 7461
# NOTE: always ensure your are setting the correct directory leading to the data.
# It may differ from your existing working directory
Quickly inspect the data 1. What class? 2. What R data types? 3. What data types?
# 1
class(census)
# 2 & 3
str(census)
Just interested in the variable names:
names(census)
## [1] "code" "ward" "pop16_74"
## [4] "higher_managerial" "pop" "ghealth"
or want to view the data:
View(census)
tidyverse
Usually you want to add / create new variables to your data frame using existing variables eg. computing percentages by dividing two variables. There are many ways in which you can do this i.e. referencing a data frame as we have done above, or using $
(e.g. census$pop
). For this module, we’ll use tidyverse
:
census <- census %>% mutate(per_ghealth = ghealth / pop)
Note we used a pipe operator %>%
, which helps make the code more efficient and readable - more details, see Grolemund, Garrett, and Hadley Wickham, 2019. When using the pipe operator, recall to first indicate the data frame before %>%
.
Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly, US.
Note also the use a variable name before the =
sign in brackets to indicate the name of the new variable after mutate
.
Usually you want to select a subset of variables for your analysis as storing to large data sets in your R memory can reduce the processing speed of your machine. A selection of data can be achieved by using the select
function:
ndf <- census %>% select(ward, pop16_74, per_ghealth)
Again first indicate the data frame and then the variable you want to select to build a new data frame. Note the code chunk above has created a new data frame called ndf
. Explore it.
You may also want to filter values based on defined conditions. You may want to filter observations greater than a certain threshold or only areas within a certain region. For example, you may want to select areas with a percentage of good health population over 50%:
ndf2 <- census %>% filter(per_ghealth < 0.5)
You can use more than one variables to set conditions. Use “,
” to add a condition.
When working with spatial data, we often need to join data. To this end, you need a common unique id variable
. Let’s say, we want to add a data frame containing census data on households for Liverpool, and join the new attributes to one of the existing data frames in the workspace. First we will read the data frame we want to join (ie. census_data2.csv
).
# read data
census2 <- read.csv("../data/census_data2.csv")
# visualise data structure
str(census2)
## 'data.frame': 30 obs. of 3 variables:
## $ geo_code : chr "E05000886" "E05000887" "E05000888" "E05000889" ...
## $ households : int 6359 6622 6622 7139 5391 5884 6576 6745 6317 6024 ...
## $ socialrented_households: int 827 1508 2818 1311 374 178 2859 1564 1023 1558 ...
The variable geo_code
in this data frame corresponds to the code
in the existing data frame and they are unique so they can be automatically matched by using the merge()
function. The merge()
function uses two arguments: x
and y
. The former refers to data frame 1 and the latter to data frame 2. Both of these two data frames must have a id
variable containing the same information. Note they can have different names. Another key argument to include is all.x=TRUE
which tells the function to keep all the records in x
, but only those in y
that match in case there are discrepancies in the id
variable.
# join data frames
join_dfs <- merge(census, census2, by.x="code", by.y="geo_code", all.x = TRUE)
# check data
head(join_dfs)
## code ward pop16_74 higher_managerial pop ghealth
## 1 E05000886 Allerton and Hunts Cross 10930 1103 14853 7274
## 2 E05000887 Anfield 10712 312 14510 6124
## 3 E05000888 Belle Vale 10987 432 15004 6129
## 4 E05000889 Central 19174 1346 20340 11925
## 5 E05000890 Childwall 10410 1123 13908 7219
## 6 E05000891 Church 10569 1843 13974 7461
## per_ghealth households socialrented_households
## 1 0.4897327 6359 827
## 2 0.4220538 6622 1508
## 3 0.4084911 6622 2818
## 4 0.5862832 7139 1311
## 5 0.5190538 5391 374
## 6 0.5339201 5884 178
It may also be convenient to save your R projects. They contains all the objects that you have created in your workspace by using the save.image( )
function:
save.image("gds.RData")
This creates a file labelled “week1_envs453.RData” in your working directory. You can load this at a later stage using the load( )
function.
load("gds.RData")
Alternatively you can save / export your data into a csv
file. The first argument in the function is the object name, and the second: the name of the csv we want to create.
write.csv(join_dfs, "join_censusdfs.csv")