Introduction to Statistical Learning in R

Computational Notebooks

Francisco Rowe (@fcorowe)

Description

This site introduces the course Introduction to Statistical Learning in R. The course provides an introduction to statistics and probability covering essential topics in descriptive and inferential statistics and supervised machine learning. It adopts a problem-to-solution teaching approach, defining a practical problem and illustrating how statistics can enable understanding to make critically informed decisions about a population by examining a random sample. It uses a learning-by-doing approach based on real-world examples in various contexts. This also teaches how to conduct statistical data analysis in R. The course is organised around 6 sessions. Each session is designed to provide a combination of key statistical concepts and practical application through the use of R.

The course comprises three main components. The first component focuses on descriptive statistics, including descriptive statistics of different data types, common probability distributions and measures of centrality and dispersion. The second component involves inferential statistics covering hypothesis testing, confidence intervals, correlation, regression analysis, supervised machine learning approaches and cross-validation.

Learning outcomes

Having successfully completed this course, you will be able to:

  1. Conduct exploratory statistical data analysis.
  2. Have an understanding of elementary probability distributions and data types.
  3. Perform correlation and regression data analysis using real-world data.
  4. Assess the statistical significance between different data types.
  5. Carry out statistical data analysis in R.
  6. Have a basic understanding of supervised machine learning and cross-validation.

Structure

The notes for each session are:

Data

For this course, we uses 2011 Census data for England and Wales, and the Quarterly Labour Force Survey (QLFS). QLFS is a quarterly sample survey of households living at private addresses in the United Kingdom. The survey is conducted by the Office for National Statistics. Its purpose is to provide information on the UK labour market. A file labelled ‘qlfs.Rdata’ contains the data. It is a small sub-set of the information collected by the QLFS in the first quarter (January to March) 2012.

For the purposes of this course, I have cleaned and pruned the original dataset, and saved the resulting file in R format (.Rdata). The census data are in aggregate format at the district level and is labelled ‘data_census’. The data and relevant documentation will be available in the data folder.

Dependencies

This course requires the following libraries installed on your machine:

Each notebook has slightly different dependencies that cater to the topics covered in each session. To run the notebook for the first session, for example, you will need to install the following libraries: tufte, knitr, tidyverse and kableExtra You can install package mypackage by running the command install.packages("mypackage") on the R prompt or through the Tools --> Install Packages... menu in RStudio.

Further reading on tidyverse, see R for Data Science Grolemund and Wickham (2019Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly, US. https://r4ds.had.co.nz.).

If you use material from this course, you can give appropriate attribution by using the following citation: Rowe (2020Rowe, Francisco. 2020. “Introduction to Statistical Learning in R.” https://doi.org/10.5281/zenodo.4007043.)

You can get the materials from this course as a download of a .zip file or by going directly to the GitHub repository.