Training Calendar

An Introduction to Machine Learning using Stata - In collaboration with Lancaster University

Online 2 days (7th April 2021 - 8th April 2021) Stata Intermediate, Introductory
Automation, Big Data, Data Management, Programming, Statistics

Overview

Recent years have witnessed an unprecedented availability of information on social, economic, and health-related phenomena. Researchers, practitioners, and policymakers now have access to huge datasets (so-called “Big Data”) on people, companies and institutions, web and mobile devices, satellites, etc., at increasing speed and detail.

Machine learning is a relatively new approach to data analytics, which places itself in the intersection between statistics, computer science, and artificial intelligence. Its primary objective is to turn information into knowledge and value by “letting the data speak”. Machine learning limits prior assumptions on data structure, and relies on a model-free philosophy supporting algorithm development, computational procedures, and graphical inspection more than tight assumptions, algebraic development and analytical solutions. Computationally unfeasible few years ago, machine learning is a product of the computer’s era, of today machines’ computing power and ability to learn, of hardware development and continuous software upgrading.

This course is a primer to machine learning techniques using Stata. Today, various machine learning packages are available within Stata, but some of tghese are not known to all Stata users. This course fills this gap by making participants familiar with Stata's potential to draw knowledge and value from rows of large, and possibly noisy data. The teaching approach will be based on the graphical language and intuition more than on algebra. The sessions will make use of instructional as well as real-world examples, and will balance theory and practical sessions evenly.

After the course, participants are expected to have an improved understanding of Stata's potential to perform machine learning, becoming able to master research tasks including, among others:

  • factor-importance detection,
  • signal-from-noise extraction,
  • correct model specification,
  • model-free classification, both from a data-mining and a causal perspective.

Course Agenda

    DAY 1:
    1. The basics of Machine Learning
  • Machine Learning: definition, rational, usefulness
  • Supervised vs. unsupervised learning
  • Regression vs. classification problems
  • Inference vs. prediction
  • Sampling vs. specification error
  • Coping with the fundamental non-identifiability of E(y|x)
  • Parametric vs. non-parametric models
  • The trade-off between prediction accuracy and model interpretability
  • Goodness-of-fit measures
  • Measuring the quality of fit: in-sample vs. out-of-sample prediction power
  • The bias-variance trade-off and the Mean Square Error (MSE) minimization
  • Training vs. test mean square error
  • The information criteria approach
  • Machine Learning and Artificial Intelligence
  • The Stata/Python integration: an overview

  • 2. Resampling and validation methods
  • Estimating training and test error
  • Validation
  • The validation set approach
  • Training and test mean square error
  • Cross-Validation
  • K-fold cross-validation
  • Leave-one-out cross-validation
  • Bootstrap
  • The bootstrap algorithm
  • Bootstrap vs. cross-validation for validation purposes

  • 3. Model Selection and regularization
  • Model selection as a correct specification procedure
  • The information criteria approach
  • Subset Selection
  • Best subset selection
  • Backward stepwise selection
  • Forward stepwise Selection
  • Shrinkage Methods
  • Lasso and Ridge, and Elastic regression
  • Adaptive Lasso
  • Information criteria and cross validation for Lasso
  • Stata implementation


  • DAY 2:
    4. Discriminant analysis and nearest-neighbour classification
  • The classification setting
  • Bayes optimal classifier and decision boundary
  • Misclassification error rate
  • Discriminant analysis
  • Linear and quadratic discriminant analysis
  • Naive Bayes classifier
  • The K-nearest neighbours classifier
  • Stata implementation

  • 5. Nonparametric regression
  • Beyond parametric models: an overview
  • Local, semi-global, and global approaches
  • Local methods
  • Kernel-based regression
  • Nearest-neighbour regression
  • Semi-global methods
  • Constant step-function
  • Piecewise polynomials
  • Spline regression
  • Global methods
  • Polynomial and series estimators
  • Partially linear models
  • Generalized additive models
  • Stata implementation

  • DAY 3:
    6. Tree-based regression
  • Regression and classification trees
  • Growing a tree via recursive binary splitting
  • Optimal tree pruning via cross-validation
  • Tree-based ensemble methods
  • Bagging, Random Forests, and Boosting
  • Stata implementation

  • 7. Neural networks
  • The neural network model
  • Neurons, hidden layers, and multi-outcomes
  • Training a neural networks
  • Back-propagation via gradient descent
  • Fitting with high dimensional data
  • Fitting remarks
  • Cross-validating neural network hyperparameters
  • Stata implementation

  • Pre-course Reading List: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning with Applications in R, Springer, New York, 2013.
    Post-course Reading List: Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2008), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition, Springer.

Prerequisites

  • It is required some knowledge of basic statistics and econometrics: notion of conditional expectation and related properties; point and interval estimation; regression model and related properties; probit and logit regression.
  • Basic knowledge of the Stata software

Terms & Conditions

  • Student registrations: Attendees must provide proof of full time student status at the time of booking to qualify for student registration rate (valid student ID card or authorised letter of enrolment).
  • Additional discounts are available for multiple registrations.
  • Delegates are provided with temporary licences for the principal software package(s) used in the delivery of the course. It is essential that these temporary training licenses are installed on your computers prior to the start of the course.
  • Payment of course fees required prior to the course start date.
  • Registration closes 1 calendar day prior to the start of the course.
    • 100% fee returned for cancellations made more than 28-calendar days prior to start of the course.
    • 50% fee returned for cancellations made 14-calendar days prior to the start of the course.
    • No fee returned for cancellations made less than 14-calendar days prior to the start of the course.

The number of attendees is restricted. Please register early to guarantee your place.

  •  CommercialAcademicStudent
    2-day pass (07/04/2021 - 08/04/2021)

All prices exclude VAT or local taxes where applicable.

* Required Fields

£0
Post your comment

Timberlake Consultants