Introduction

Data Types & Probability Distributions

Francisco Rowe

2020-08-31

This session1 Part of Introduction to Statistical Learning in R Creative Commons License
Introduction – Data Types & Probability Distributions by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
introduces the structure, tools and key concepts that we will use during the duration of the course. While this first session may seem to revolve around basic ideas, they provide the foundation for the rest. Let’s start by generally defining Statistics.

Statistics = Descriptive Statistics + Inferential Statistics

1 Introducing R

R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques. It has gained widespread use in academia and industry. R offers a wider array of functionality than a traditional statistics package, such as SPSS and is composed of core (base) functionality, and is expandable through libraries hosted on CRAN. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.

Commands are sent to R using either the terminal / command line or the R Console which is installed with R on either Windows or OS X. On Linux, there is no equivalent of the console, however, third party solutions exist. On your own machine, R can be installed from here.

Normally RStudio is used to implement R coding. RStudio is an integrated development environment (IDE) for R and provides a more user-friendly front-end to R than the front-end provided with R.

To run R or RStudio, just double click on the R or RStudio icon. Throughout this course, we will be using RStudio:

Fig. 1. RStudio features.

If you would like to know more about the various features of RStudio, watch this video

2 Setting the working directory

Before we start any analysis, ensure to set the path to the directory where we are working. We can easily do that with setwd(). Please replace in the following line the path to the folder where you have placed this file -and where the data folder lives.

#setwd('../data/sar.csv')
#setwd('.')

Note: It is good practice to not include spaces when naming folders and files. Use underscores or dots.

You can check your current working directory by typing:

getwd()
## [1] "/Users/franciscorowe/Dropbox/Francisco/Research/github_projects/sl/code"

3 R as a calculator

3.1 The Console window

The Console window provides a means of entering commands for immediate execution.

To demonstrate, we will use the Console window to introduce the use of R as a simple calculator.

In the Console window, type the sum:

Hit enter to find the result.

3.2 Mathematical operators

The full set of mathematical operators used by R are:

20 / 10
## [1] 2
20 * 10
## [1] 200
20 + 10
## [1] 30
20 - 10
## [1] 10
20^10
## [1] 1.024e+13
sqrt(20)
## [1] 4.472136
log(20)
## [1] 2.995732

3.3 Operator precedence

R uses something known as operator precedence: some mathematical operations, such as multiplication, are undertaken before other lower priority operations, such as addition. Use brackets () for the operations you want R performs first.

log(20+10*(4/2))
## [1] 3.688879

4 R Scripts and Notebooks

An R script is a series of commands that you can execute at one time and help you save time. So you don’t repeat the same steps every time you want to execute the same process with different datasets. An R script is just a plain text file with R commands in it.

To create an R script in RStudio, you need to

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

An R Notebook is an R Markdown document with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019Xie, Yihui, JJ Allaire, and Garrett Grolemund. 2019. R Markdown: The Definitive Guide. CRC Press, Taylor & Francis, Chapman & Hall Book. https://bookdown.org/yihui/rmarkdown/.).

To create an R Notebook, you need to:

Fig. 2. YAML metadata for notebooks.

  1. use the Insert command on the editor toolbar;
  2. use the keyboard shortcut Ctrl + Alt + I or Cmd + Option + I (Mac); or,
  3. type the chunk delimiters ```{r} and ```

In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets eg.

Fig. 3. Code chunk example. Details on the various options: https://rmarkdown.rstudio.com/lesson-3.html

Rstudio also offers a Preview option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu knit to ...

5 Getting Help

You can use help or ? to ask for details for a specific function:

help(sqrt) #or ?sqrt
And using example provides examples for said function:

Example sqrt Example sqrt

example(sqrt)
## 
## sqrt> require(stats) # for spline
## 
## sqrt> require(graphics)
## 
## sqrt> xx <- -9:9
## 
## sqrt> plot(xx, sqrt(abs(xx)),  col = "red")
## 
## sqrt> lines(spline(xx, sqrt(abs(xx)), n=101), col = "pink")

6 Variables and objects

An object is a data structure having attributes and methods. In fact, everything in R is an object!

A variable is a type of data object. Data objects also include list, vector, matrices and text.

In R a variable can be created by using the symbol <- to assign a value to a variable name. The variable name is entered on the left <- and the value on the right. Note: Data objects can be given any name, provided that they start with a letter of the alphabet, and include only letters of the alphabet, numbers and the characters . and _. Hence AgeGroup, Age_Group and Age.Group are all valid names for an R data object. Note also that R is case-sensitive, to agegroup and AgeGroup would be treated as different data objects.

To save the value 28 to a variable (data object) labelled age, run the code:

age <- 28

To inspect the contents of the data object age run the following line of code:

age
## [1] 28

Find out what kind (class) of data object age is using:

class(age) 
## [1] "numeric"

Inspect the structure of the age data object:

str(age) 
##  num 28

What if we have more than one response? We can use the c( ) function to combine multiple values into one data vector object:

age <- c(28, 36, 25, 24, 32)
age
## [1] 28 36 25 24 32
class(age) #Still numeric..
## [1] "numeric"
str(age) #..but now a vector (set) of 5 separate values
##  num [1:5] 28 36 25 24 32

Note that on each line in the code above any text following the # character is ignored by R when executing the code. Instead, text following a # can be used to add comments to the code to make clear what the code is doing. Two marks of good code are a clear layout and clear commentary on the code.

7 Basic Data Types

There are a number of data types. Four are the most common. In R, numeric is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). This is because most statistical calculations deal with numbers with up to two decimals.

num <- 4.5 # Decimal values
class(num)
## [1] "numeric"
int <- as.integer(4) # Natural numbers. Note integers are also numerics.
class(int)
## [1] "integer"
cha <- "are you enjoying this?" # text or string. You can also type `as.character("are you enjoying this?")`
class(cha)
## [1] "character"
log <- 2 < 1 # assigns TRUE or FALSE. In this case, FALSE as 2 is greater than 1
log
## [1] FALSE
class(log)
## [1] "logical"

TASK #1: Create a variable called income, with the following five respondent values: high, low, low, middle, high.

8 Random Variables

In statistics, we differentiate between data to capture:

In R these three types of random variables are represented by the following types of R data object:

variables objects
nominal factor
ordinal ordered factor
discrete numeric
continuous numeric

We have already encountered the R data type numeric. The next section introduces the factor data type.

9 Factor

9.1 What is a factor?

A factor variable assigns a numeric code to each possible category (level) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000 males and females to a list of 10,000 1s and 0s. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.

For example, the variable gender, converted to a factor, would be stored as a series of 1s and 2s, where 1 = female and 2 = male; but would be displayed in all outputs using their category labels of female and male.

9.2 Using factors to define nominal variables

To convert a numeric or character vector into a factor use the factor( ) function. For instance:

gender <- c("female","male","male","female","female") # create a gender variable
gender <- factor(gender) # replace character vector with a factor version
gender
## [1] female male   male   female female
## Levels: female male
class(gender)
## [1] "factor"
str(gender)
##  Factor w/ 2 levels "female","male": 1 2 2 1 1

Now gender is a factor and is stored as a series of 1s and 2s, with 1s representing females and 2s representing males. The function levels( ) lists the levels (categories) associated with a given factor variable:

levels(gender)
## [1] "female" "male"

The categories are reported in the order that they have been numbered (starting from 1). Hence from the output we can infer that females are coded as 1, and males as 2.

9.3 Ordering factor levels for nominal variables

By default the levels of the factor (variable categories) are allocated in alphabetical order. Hence in the example above female = 1 and male = 2.

Sometimes an alternative ordering is required, for example male = 1 and female = 2.

For nominal variables the solution is to specify the required order of the levels when calling the factor( ) function via the levels( ) sub-command:

gender2 <- factor(gender, levels= c("male", "female"))
gender2
## [1] female male   male   female female
## Levels: male female

9.4 Using factors to define nominal variables

For ordinal variables, such as income (income bracket), we create an ordered factor by calling the ordered( ) rather than factor( ) function, including a call to the sub-command levels( ) which specifies the required category order:

income <- ordered(income, levels = c("low", "middle", "high"))
income
## [1] high   low    low    middle high  
## Levels: low < middle < high
class(income)
## [1] "ordered" "factor"
str(income)
##  Ord.factor w/ 3 levels "low"<"middle"<..: 3 1 1 2 3
levels(income)
## [1] "low"    "middle" "high"

Note that if we didn’t use the levels( ) sub-command, then the default behaviour of ordered( ), like factor( ), is to order the categories alphabetically

TASK #2: Run the following line of code, then convert the resulting variable into a factor with the categories ordered Car, Train, Bus, Bicycle:

travel_mode <- c("train", "bicycle", "bus", "car", "car")

10 Data Frames

R stores different types of data using different types of data structure. Survey data are normally stored as a data.frame. A data.frame for a survey contains one row per respondent and one column per respondent attribute (eg. age, gender and income).

For example, if we have:

age <- c(28, 36, 25, 24, 32)
gender <- c("female", "male", "male", "female", "female")
income <- c("high", "low", "low", "middle", "high")

10.1 Create a data frame

We can create a data frame and examine its structure:

df <- data.frame(age, gender, income)
df # or use view(data)
##   age gender income
## 1  28 female   high
## 2  36   male    low
## 3  25   male    low
## 4  24 female middle
## 5  32 female   high
str(df) # or use glimpse(data) 
## 'data.frame':    5 obs. of  3 variables:
##  $ age   : num  28 36 25 24 32
##  $ gender: chr  "female" "male" "male" "female" ...
##  $ income: chr  "high" "low" "low" "middle" ...

10.2 Referencing data frame

Throughout this course, you will need to refer to particular parts of a dataframe - perhaps a particular column (respondent attribute); or a particular subset of respondents. Hence it is worth spending some time now mastering this particular skill.

The relevant R function, [ ], has the format [row,col] or, more generally, [set of rows, set of cols].

Run the following commands to get a feel of how to extract different slices of the survey data:

df # whole data.frame
##   age gender income
## 1  28 female   high
## 2  36   male    low
## 3  25   male    low
## 4  24 female middle
## 5  32 female   high
df[1, 1] # contents of first row and column
## [1] 28
df[2, 2:3] # contents of the second row, second and third columns
##   gender income
## 2   male    low
df[1, ] # first row, ALL columns [the default if no columns specified]
##   age gender income
## 1  28 female   high
df[ ,1:2] # ALL rows; first and second columns
##   age gender
## 1  28 female
## 2  36   male
## 3  25   male
## 4  24 female
## 5  32 female
df[c(1,3,5), ] # rows 1,3,5; ALL columns
##   age gender income
## 1  28 female   high
## 3  25   male    low
## 5  32 female   high
df[ , 2] # ALL rows; second column (by default results containing only 
## [1] "female" "male"   "male"   "female" "female"
             #one column are converted back into a vector)
df[ , 2, drop=FALSE] # ALL rows; second column (returned as a data.frame)
##   gender
## 1 female
## 2   male
## 3   male
## 4 female
## 5 female

In the above, note that we have used two other R functions:

Run both of these fuctions on their own to get a better understanding of what they do.

Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:

df[,"age"] # variable name in quotes inside the square brackets
## [1] 28 36 25 24 32
df$age # variable name prefixed with $ and appended to the data.frame name
## [1] 28 36 25 24 32
# or you can use attach
attach(df)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, gender, income
age # but be careful if you already have an age variable in your local workspace
## [1] 28 36 25 24 32

Want to check the variables available, use the names( ):

names(df)
## [1] "age"    "gender" "income"

TASK #3: Given the above, can you find three different ways of extracting the income of the first and third respondents in your survey dataset?

11 Read Data

11.1 Introducing the data

We will use the Quarterly Labour Force Survey (QLFS). QLFS is a quarterly sample survey of households living at private addresses in the United Kingdom. The survey is conducted by the Office for National Statistics. Its purpose is to provide information on the UK labour market. We will be using the file ‘qlfs.Rdata’, which contains a small sub-set of the information collected by the QLFS in the first quarter (January to March) 2012. For the purposes of this course, I have cleaned and pruned the original dataset, and saved the resulting file in R format (.Rdata). The data and relevant documentation are available in the data folder.

11.2 Load data

Ensure your memory is clear

rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects

There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in csv format from Excel or other software packages. So we use either:

To read files in other formats, refer to this useful DataCamp tutorial

Because I have already saved the data into an R project, we can use load():

load("../data/data_qlfs.RData") 
# NOTE: always ensure your are setting the correct directory leading to the data. 
# It may differ from your existing working directory

11.3 Quickly inspect the data

  1. What class?

  2. What R data types?

  3. What Survey data types?

# 1
class(qlfs)
# 2 & 3
str(qlfs)

Just interested in the variable names:

names(qlfs)
##  [1] "Case_ID"        "Hhold_ID"       "Family_ID"      "Person_ID"     
##  [5] "HHoldSize"      "FamilySize"     "Age"            "AgeGroup"      
##  [9] "Sex"            "MaritalStatus"  "EthnicGroup"    "CountryOfBirth"
## [13] "Religion"       "WorkStatus"     "NSSEC"          "HighestQual"   
## [17] "DegreeClass"    "HoursWorked"    "GrossPay"       "NetPay"        
## [21] "LastMoved"      "TravelMode"     "TravelTime"     "GovtRegion"    
## [25] "Tenure"         "FamilyType"     "YoungestChild"  "FamilyToddlers"

or want to view the data:

View(qlfs)

12 Data Distributions

We can think of two types of data distributions: theorethical distributions and empirical distributions.

12.1 Theorethical data distributions

The four most commonly used data distributions in Social Sciences are:

Continuous probability distributions: Normal and t Student

Discrete probability distributions: Binomial and Poisson.

Continuous Continuous

Discrete Discrete

Understanding probability distributions

Random variables can have a set of different values. Every value has a probability of occurrence. The probabilities of the values form a probability distribution for the random variable. If you want to learn more about probabilities, this is an excellent course: Statistics 110

12.2 Visualising data distributions

12.2.1 Categorical variables

Let’s say we want to understand the gender and ethnic composition of the population in our data.

# attach our dataset (ie. qlfs) to the R search path.
attach(qlfs)
# create table with counts
counts <- table(EthnicGroup, Sex)
counts
##                                        Sex
## EthnicGroup                              Male Female
##   White                                 34672  37951
##   Mixed/multiple ethnic background        267    360
##   Indian                                  836    851
##   Pakistani                               521    534
##   Bangladeshi                             190    201
##   Chinese                                 143    204
##   Other Asian background                  360    420
##   Black/African/Caribbean/Black British   755    964
##   Other ethnic group                      479    511
# row proportion
prop.table(counts, 1)
##                                        Sex
## EthnicGroup                                  Male    Female
##   White                                 0.4774245 0.5225755
##   Mixed/multiple ethnic background      0.4258373 0.5741627
##   Indian                                0.4955542 0.5044458
##   Pakistani                             0.4938389 0.5061611
##   Bangladeshi                           0.4859335 0.5140665
##   Chinese                               0.4121037 0.5878963
##   Other Asian background                0.4615385 0.5384615
##   Black/African/Caribbean/Black British 0.4392088 0.5607912
##   Other ethnic group                    0.4838384 0.5161616
# column proportion
prop.table(counts, 2)
# missing information ethinicity by sex
table(Sex, is.na(EthnicGroup))

You can also use bar graphs for categorical variables using ggplot which has a basic structure of three components:

  • The data

  • Geometries

  • Aesthetic mapping:

In terms of code, these components are structured:

ggplot( data = *data frame*) +

geom_xxx( aes(x=*variable*, y=*variable*) )

For more details, refer to https://rafalab.github.io/dsbook/ggplot2.html

eg.

Distribution by ethnicity Distribution by ethnicity

# one variable
ggplot(data=qlfs) +
  geom_bar( aes( x= EthnicGroup) )

Distribution by sex Distribution by sex

# two variables: split by sex
ggplot(data=qlfs) +
  geom_bar( aes( x= EthnicGroup, fill= Sex) )

Of course, there are various options to make your plots look pretty. Look at the various options here: https://ggplot2.tidyverse.org/reference/

12.2.2 Continuous variables

A quick way is creating a histogram:

Histogram of Net Pay Histogram of Net Pay

# create histogram
hist(NetPay)

We can draw a better version using ggplot but first let’s create a data frame excluding all negative and non-finite values:

# create new data frame
df <- qlfs %>% filter(!is.na(NetPay)) %>% 
    filter(NetPay >= 0)
# remember to:
detach(qlfs)
# and:
attach(df)

Note we used a pipe operator to make the code more efficient and readable - more details, see Grolemund and Wickham (2019Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly, US. https://r4ds.had.co.nz.).

Histogram of Net Pay by Sex Histogram of Net Pay by Sex

# create new histogram 
  # 1) the base
hist_pay <- ggplot(data = df) +
  # 2) the histogram
  geom_histogram(bins = 100, aes(x = NetPay, y = ..density..)) +
  # 3) overlay density plot
  geom_density(alpha=0.5, colour="#FF6666", aes(x = NetPay))
hist_pay

Another way to visualise data distributions is using boxplots:

Boxplot of Net Pay by Sex Boxplot of Net Pay by Sex

box_pay <- ggplot(data = df) +
  geom_boxplot(aes(x = Sex, y= NetPay))
box_pay

Exercise to work on your own

  • Explore if the net pay distribution differ by ethnic groups by creating a boxplot
  • What is the largest socioeconomic group?
  • What socioeconomic group shows the largest gender net pay gap?

13 Appendix: Concepts and Functions to Remember

Function Description
setwd() set working directory
getwd() visualise working directory
help() activate help menu
class() check data class
srt() inspect data structure
c() combine values into one vector or data frame
factor() create a factor variable
levels() ask for levels of a variable
data.frame() create a data frame
View() open data frame
attach()/dettach() attach/detacch a data frame
ggplot() plot data