This session1 Part of Introduction to Statistical Learning in R
Introduction – Data Types & Probability Distributions by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. introduces the structure, tools and key concepts that we will use during the duration of the course. While this first session may seem to revolve around basic ideas, they provide the foundation for the rest. Let’s start by generally defining Statistics.
Statistics =
Descriptive Statistics +
Inferential Statistics
Descriptive Statistics: organise, summarise, describe and present quantitative data.
Inferential Statistics: draw conclusions about a population by examining a small representative sample - subject to sampling error.
R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques. It has gained widespread use in academia and industry. R offers a wider array of functionality than a traditional statistics package, such as SPSS and is composed of core (base) functionality, and is expandable through libraries hosted on CRAN. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
Commands are sent to R using either the terminal / command line or the R Console which is installed with R on either Windows or OS X. On Linux, there is no equivalent of the console, however, third party solutions exist. On your own machine, R can be installed from here.
Normally RStudio is used to implement R coding. RStudio is an integrated development environment (IDE) for R and provides a more user-friendly front-end to R than the front-end provided with R.
To run R or RStudio, just double click on the R or RStudio icon. Throughout this course, we will be using RStudio:
If you would like to know more about the various features of RStudio, watch this video
Before we start any analysis, ensure to set the path to the directory where we are working. We can easily do that with setwd()
. Please replace in the following line the path to the folder where you have placed this file -and where the data
folder lives.
#setwd('../data/sar.csv')
#setwd('.')
Note: It is good practice to not include spaces when naming folders and files. Use underscores or dots.
You can check your current working directory by typing:
getwd()
## [1] "/Users/franciscorowe/Dropbox/Francisco/Research/github_projects/sl/code"
The Console window provides a means of entering commands for immediate execution.
To demonstrate, we will use the Console window to introduce the use of R as a simple calculator.
In the Console window, type the sum:
Hit enter to find the result.
The full set of mathematical operators used by R are:
20 / 10
## [1] 2
20 * 10
## [1] 200
20 + 10
## [1] 30
20 - 10
## [1] 10
20^10
## [1] 1.024e+13
sqrt(20)
## [1] 4.472136
log(20)
## [1] 2.995732
R uses something known as operator precedence: some mathematical operations, such as multiplication, are undertaken before other lower priority operations, such as addition. Use brackets ()
for the operations you want R performs first.
log(20+10*(4/2))
## [1] 3.688879
An R script is a series of commands that you can execute at one time and help you save time. So you don’t repeat the same steps every time you want to execute the same process with different datasets. An R script is just a plain text file with R commands in it.
To create an R script in RStudio, you need to
Open a new script file: File > New File > R Script
Write some code on your new script window by typing eg. mtcars
Run the script. Click anywhere on the line of code, then hit Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run the command or select the code chunk and click run on the right-top corner of your script window. If do that, you should get:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
An R Notebook is an R Markdown document with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019Xie, Yihui, JJ Allaire, and Garrett Grolemund. 2019. R Markdown: The Definitive Guide. CRC Press, Taylor & Francis, Chapman & Hall Book. https://bookdown.org/yihui/rmarkdown/.).
To create an R Notebook, you need to:
```{r}
and ```
In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets eg.
Execute code: hit “Run Current Chunk”, Ctrl + Shift + Enter or Cmd + Shift + Enter (Mac)
Save an R notebook: File > Save As. A notebook has a *.Rmd
extension and when it is saved a *.nb.html
file is automatically created. The latter is a self-contained HTML file which contains both a rendered copy of the notebook with all current chunk outputs and a copy of the *.Rmd file itself.
Rstudio also offers a Preview option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu knit to ...
You can use help
or ?
to ask for details for a specific function:
help(sqrt) #or ?sqrt
And using example
provides examples for said function:
Example sqrt
example(sqrt)
##
## sqrt> require(stats) # for spline
##
## sqrt> require(graphics)
##
## sqrt> xx <- -9:9
##
## sqrt> plot(xx, sqrt(abs(xx)), col = "red")
##
## sqrt> lines(spline(xx, sqrt(abs(xx)), n=101), col = "pink")
An object is a data structure having attributes and methods. In fact, everything in R is an object!
A variable is a type of data object. Data objects also include list, vector, matrices and text.
In R a variable can be created by using the symbol <-
to assign a value to a variable name. The variable name is entered on the left <-
and the value on the right. Note: Data objects can be given any name, provided that they start with a letter of the alphabet, and include only letters of the alphabet, numbers and the characters .
and _
. Hence AgeGroup, Age_Group and Age.Group are all valid names for an R data object. Note also that R is case-sensitive, to agegroup and AgeGroup would be treated as different data objects.
To save the value 28 to a variable (data object) labelled age, run the code:
age <- 28
To inspect the contents of the data object age run the following line of code:
age
## [1] 28
Find out what kind (class) of data object age is using:
class(age)
## [1] "numeric"
Inspect the structure of the age data object:
str(age)
## num 28
What if we have more than one response? We can use the c( )
function to combine multiple values into one data vector object:
age <- c(28, 36, 25, 24, 32)
age
## [1] 28 36 25 24 32
class(age) #Still numeric..
## [1] "numeric"
str(age) #..but now a vector (set) of 5 separate values
## num [1:5] 28 36 25 24 32
Note that on each line in the code above any text following the #
character is ignored by R when executing the code. Instead, text following a #
can be used to add comments to the code to make clear what the code is doing. Two marks of good code are a clear layout and clear commentary on the code.
There are a number of data types. Four are the most common. In R, numeric is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). This is because most statistical calculations deal with numbers with up to two decimals.
num <- 4.5 # Decimal values
class(num)
## [1] "numeric"
int <- as.integer(4) # Natural numbers. Note integers are also numerics.
class(int)
## [1] "integer"
cha <- "are you enjoying this?" # text or string. You can also type `as.character("are you enjoying this?")`
class(cha)
## [1] "character"
log <- 2 < 1 # assigns TRUE or FALSE. In this case, FALSE as 2 is greater than 1
log
## [1] FALSE
class(log)
## [1] "logical"
TASK #1: Create a variable called income
, with the following five respondent values: high
, low
, low
, middle
, high
.
In statistics, we differentiate between data to capture:
Qualitative attributes categorise objects eg.gender, marital status. To measure these attributes, we use Categorical data which can be divided into:
Quantitative attributes:
In R these three types of random variables are represented by the following types of R data object:
variables | objects |
---|---|
nominal | factor |
ordinal | ordered factor |
discrete | numeric |
continuous | numeric |
We have already encountered the R data type numeric. The next section introduces the factor data type.
A factor variable assigns a numeric code to each possible category (level) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000
males and females to a list of 10,000
1s
and 0s
. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.
For example, the variable gender, converted to a factor, would be stored as a series of 1s
and 2s
, where 1 = female
and 2 = male
; but would be displayed in all outputs using their category labels of female and male.
To convert a numeric or character vector into a factor use the factor( )
function. For instance:
gender <- c("female","male","male","female","female") # create a gender variable
gender <- factor(gender) # replace character vector with a factor version
gender
## [1] female male male female female
## Levels: female male
class(gender)
## [1] "factor"
str(gender)
## Factor w/ 2 levels "female","male": 1 2 2 1 1
Now gender is a factor and is stored as a series of 1s
and 2s
, with 1s
representing females
and 2s
representing males
. The function levels( )
lists the levels (categories) associated with a given factor variable:
levels(gender)
## [1] "female" "male"
The categories are reported in the order that they have been numbered (starting from 1
). Hence from the output we can infer that females
are coded as 1
, and males
as 2
.
By default the levels of the factor (variable categories) are allocated in alphabetical order. Hence in the example above female = 1
and male = 2
.
Sometimes an alternative ordering is required, for example male = 1
and female = 2
.
For nominal variables the solution is to specify the required order of the levels when calling the factor( )
function via the levels( )
sub-command:
gender2 <- factor(gender, levels= c("male", "female"))
gender2
## [1] female male male female female
## Levels: male female
For ordinal variables, such as income (income bracket), we create an ordered factor by calling the ordered( )
rather than factor( )
function, including a call to the sub-command levels( )
which specifies the required category order:
income <- ordered(income, levels = c("low", "middle", "high"))
income
## [1] high low low middle high
## Levels: low < middle < high
class(income)
## [1] "ordered" "factor"
str(income)
## Ord.factor w/ 3 levels "low"<"middle"<..: 3 1 1 2 3
levels(income)
## [1] "low" "middle" "high"
Note that if we didn’t use the levels( )
sub-command, then the default behaviour of ordered( )
, like factor( )
, is to order the categories alphabetically
TASK #2: Run the following line of code, then convert the resulting variable into a factor with the categories ordered Car, Train, Bus, Bicycle:
travel_mode <- c("train", "bicycle", "bus", "car", "car")
R stores different types of data using different types of data structure. Survey data are normally stored as a data.frame. A data.frame for a survey contains one row per respondent and one column per respondent attribute (eg. age, gender and income).
For example, if we have:
age <- c(28, 36, 25, 24, 32)
gender <- c("female", "male", "male", "female", "female")
income <- c("high", "low", "low", "middle", "high")
We can create a data frame and examine its structure:
df <- data.frame(age, gender, income)
df # or use view(data)
## age gender income
## 1 28 female high
## 2 36 male low
## 3 25 male low
## 4 24 female middle
## 5 32 female high
str(df) # or use glimpse(data)
## 'data.frame': 5 obs. of 3 variables:
## $ age : num 28 36 25 24 32
## $ gender: chr "female" "male" "male" "female" ...
## $ income: chr "high" "low" "low" "middle" ...
Throughout this course, you will need to refer to particular parts of a dataframe - perhaps a particular column (respondent attribute); or a particular subset of respondents. Hence it is worth spending some time now mastering this particular skill.
The relevant R function, [ ]
, has the format [row,col]
or, more generally, [set of rows, set of cols]
.
Run the following commands to get a feel of how to extract different slices of the survey data:
df # whole data.frame
## age gender income
## 1 28 female high
## 2 36 male low
## 3 25 male low
## 4 24 female middle
## 5 32 female high
df[1, 1] # contents of first row and column
## [1] 28
df[2, 2:3] # contents of the second row, second and third columns
## gender income
## 2 male low
df[1, ] # first row, ALL columns [the default if no columns specified]
## age gender income
## 1 28 female high
df[ ,1:2] # ALL rows; first and second columns
## age gender
## 1 28 female
## 2 36 male
## 3 25 male
## 4 24 female
## 5 32 female
df[c(1,3,5), ] # rows 1,3,5; ALL columns
## age gender income
## 1 28 female high
## 3 25 male low
## 5 32 female high
df[ , 2] # ALL rows; second column (by default results containing only
## [1] "female" "male" "male" "female" "female"
#one column are converted back into a vector)
df[ , 2, drop=FALSE] # ALL rows; second column (returned as a data.frame)
## gender
## 1 female
## 2 male
## 3 male
## 4 female
## 5 female
In the above, note that we have used two other R functions:
1:3
The colon operator tells R to produce a list of numbers including the named start and end points.
c(1,3,5)
Tells R to combine the contents within the brackets into one list of objects
Run both of these fuctions on their own to get a better understanding of what they do.
Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:
df[,"age"] # variable name in quotes inside the square brackets
## [1] 28 36 25 24 32
df$age # variable name prefixed with $ and appended to the data.frame name
## [1] 28 36 25 24 32
# or you can use attach
attach(df)
## The following objects are masked _by_ .GlobalEnv:
##
## age, gender, income
age # but be careful if you already have an age variable in your local workspace
## [1] 28 36 25 24 32
Want to check the variables available, use the names( )
:
names(df)
## [1] "age" "gender" "income"
TASK #3: Given the above, can you find three different ways of extracting the income of the first and third respondents in your survey dataset?
We will use the Quarterly Labour Force Survey (QLFS). QLFS is a quarterly sample survey of households living at private addresses in the United Kingdom. The survey is conducted by the Office for National Statistics. Its purpose is to provide information on the UK labour market. We will be using the file ‘qlfs.Rdata’, which contains a small sub-set of the information collected by the QLFS in the first quarter (January to March) 2012. For the purposes of this course, I have cleaned and pruned the original dataset, and saved the resulting file in R format (.Rdata
). The data and relevant documentation are available in the data folder.
Ensure your memory is clear
rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects
There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in csv format from Excel or other software packages. So we use either:
df <- read.table("path/file_name.csv", header = FALSE, sep =",")
df <- read("path/file_name.csv", header = FALSE)
df <- read.csv2("path/file_name.csv", header = FALSE)
To read files in other formats, refer to this useful DataCamp tutorial
Because I have already saved the data into an R project, we can use load()
:
load("../data/data_qlfs.RData")
# NOTE: always ensure your are setting the correct directory leading to the data.
# It may differ from your existing working directory
What class?
What R data types?
What Survey data types?
# 1
class(qlfs)
# 2 & 3
str(qlfs)
Just interested in the variable names:
names(qlfs)
## [1] "Case_ID" "Hhold_ID" "Family_ID" "Person_ID"
## [5] "HHoldSize" "FamilySize" "Age" "AgeGroup"
## [9] "Sex" "MaritalStatus" "EthnicGroup" "CountryOfBirth"
## [13] "Religion" "WorkStatus" "NSSEC" "HighestQual"
## [17] "DegreeClass" "HoursWorked" "GrossPay" "NetPay"
## [21] "LastMoved" "TravelMode" "TravelTime" "GovtRegion"
## [25] "Tenure" "FamilyType" "YoungestChild" "FamilyToddlers"
or want to view the data:
View(qlfs)
We can think of two types of data distributions: theorethical distributions and empirical distributions.
The four most commonly used data distributions in Social Sciences are:
Continuous probability distributions: Normal and t Student
Discrete probability distributions: Binomial and Poisson.
Continuous
Discrete
Understanding probability distributions
Random variables can have a set of different values. Every value has a probability of occurrence. The probabilities of the values form a probability distribution for the random variable. If you want to learn more about probabilities, this is an excellent course: Statistics 110
Let’s say we want to understand the gender and ethnic composition of the population in our data.
# attach our dataset (ie. qlfs) to the R search path.
attach(qlfs)
# create table with counts
counts <- table(EthnicGroup, Sex)
counts
## Sex
## EthnicGroup Male Female
## White 34672 37951
## Mixed/multiple ethnic background 267 360
## Indian 836 851
## Pakistani 521 534
## Bangladeshi 190 201
## Chinese 143 204
## Other Asian background 360 420
## Black/African/Caribbean/Black British 755 964
## Other ethnic group 479 511
# row proportion
prop.table(counts, 1)
## Sex
## EthnicGroup Male Female
## White 0.4774245 0.5225755
## Mixed/multiple ethnic background 0.4258373 0.5741627
## Indian 0.4955542 0.5044458
## Pakistani 0.4938389 0.5061611
## Bangladeshi 0.4859335 0.5140665
## Chinese 0.4121037 0.5878963
## Other Asian background 0.4615385 0.5384615
## Black/African/Caribbean/Black British 0.4392088 0.5607912
## Other ethnic group 0.4838384 0.5161616
# column proportion
prop.table(counts, 2)
# missing information ethinicity by sex
table(Sex, is.na(EthnicGroup))
You can also use bar graphs for categorical variables using ggplot
which has a basic structure of three components:
The data
Geometries
Aesthetic mapping:
In terms of code, these components are structured:
ggplot( data = *data frame*) +
geom_xxx( aes(x=*variable*, y=*variable*) )
For more details, refer to https://rafalab.github.io/dsbook/ggplot2.html
eg.Distribution by ethnicity
# one variable
ggplot(data=qlfs) +
geom_bar( aes( x= EthnicGroup) )
Distribution by sex
# two variables: split by sex
ggplot(data=qlfs) +
geom_bar( aes( x= EthnicGroup, fill= Sex) )
Of course, there are various options to make your plots look pretty. Look at the various options here: https://ggplot2.tidyverse.org/reference/
A quick way is creating a histogram:
Histogram of Net Pay
# create histogram
hist(NetPay)
We can draw a better version using ggplot but first let’s create a data frame excluding all negative and non-finite values:
# create new data frame
df <- qlfs %>% filter(!is.na(NetPay)) %>%
filter(NetPay >= 0)
# remember to:
detach(qlfs)
# and:
attach(df)
Note we used a pipe operator to make the code more efficient and readable - more details, see Grolemund and Wickham (2019Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly, US. https://r4ds.had.co.nz.).
Histogram of Net Pay by Sex
# create new histogram
# 1) the base
hist_pay <- ggplot(data = df) +
# 2) the histogram
geom_histogram(bins = 100, aes(x = NetPay, y = ..density..)) +
# 3) overlay density plot
geom_density(alpha=0.5, colour="#FF6666", aes(x = NetPay))
hist_pay
Another way to visualise data distributions is using boxplots:
Boxplot of Net Pay by Sex
box_pay <- ggplot(data = df) +
geom_boxplot(aes(x = Sex, y= NetPay))
box_pay
Exercise to work on your own
Function | Description |
---|---|
setwd() | set working directory |
getwd() | visualise working directory |
help() | activate help menu |
class() | check data class |
srt() | inspect data structure |
c() | combine values into one vector or data frame |
factor() | create a factor variable |
levels() | ask for levels of a variable |
data.frame() | create a data frame |
View() | open data frame |
attach()/dettach() | attach/detacch a data frame |
ggplot() | plot data |