This site introduces the course Introduction to Statistical Learning in R. The course provides an introduction to statistics and probability covering essential topics in descriptive and inferential statistics and supervised machine learning. It adopts a problem-to-solution teaching approach, defining a practical problem and illustrating how statistics can enable understanding to make critically informed decisions about a population by examining a random sample. It uses a learning-by-doing approach based on real-world examples in various contexts. This also teaches how to conduct statistical data analysis in R. The course is organised around 6 sessions. Each session is designed to provide a combination of key statistical concepts and practical application through the use of R.
The course comprises three main components. The first component focuses on descriptive statistics, including descriptive statistics of different data types, common probability distributions and measures of centrality and dispersion. The second component involves inferential statistics covering hypothesis testing, confidence intervals, correlation, regression analysis, supervised machine learning approaches and cross-validation.
Having successfully completed this course, you will be able to:
The notes for each session are:
Session 1 Introduction to R: Data types & probability distributions
Session 2 Descriptive Statistics: Measures of centrality & dispersion for continuous & categorical data
Session 3 Statistical Significance: Hypothesis testing & confidence intervals
Session 4 Correlation: Correlation visualisation & measures
Session 5 Regression Analysis: Linear regression, dummy variables & logistic regression
Session 6 Supervised Machine Learning: Tree Regressions, Random Forest & Cross-validation
For this course, we uses 2011 Census data for England and Wales, and the Quarterly Labour Force Survey (QLFS). QLFS is a quarterly sample survey of households living at private addresses in the United Kingdom. The survey is conducted by the Office for National Statistics. Its purpose is to provide information on the UK labour market. A file labelled ‘qlfs.Rdata’ contains the data. It is a small sub-set of the information collected by the QLFS in the first quarter (January to March) 2012.
For the purposes of this course, I have cleaned and pruned the original dataset, and saved the resulting file in R format (
.Rdata). The census data are in aggregate format at the district level and is labelled ‘data_census’. The data and relevant documentation will be available in the data folder.
This course requires the following libraries installed on your machine:
Each notebook has slightly different dependencies that cater to the topics covered in each session. To run the notebook for the first session, for example, you will need to install the following libraries:
kableExtra You can install package
mypackage by running the command
install.packages("mypackage") on the R prompt or through the
Tools --> Install Packages... menu in RStudio.
If you use material from this course, you can give appropriate attribution by using the following citation: Rowe (2020Rowe, Francisco. 2020. “Introduction to Statistical Learning in R.” https://doi.org/10.5281/zenodo.4007043.)