This course will review the application of machine learning techniques to both prediction problems and so-called causal problems where a firm or policy maker needs to understand the impact of some form of intervention on a heterogeneous population.
One example, is a firm that wishes to understand how the introduction of a change in pricing impacts both aggregate demand, and the demand on different segments of the population. In another example, a policymaker seeks to understand the impact of an intervention both in terms of some form of average effect, but also how individuals differ in the magnitude of the effect. Examples include the impact of job training programmes, the impact of education policies in developing economies, and the differential impact of drugs on survival and recovery.
In this context we make the distinction between the ex post assessment of a change and the ex ante identification of characteristics of individuals that are predictive of the likely impact of such a change.
Using Breiman’s (2001) notion of two cultures in the use of statistical modelling, the course begins with a review of the fundamental differences between machine learning and econometrics.
There are two cultures in the use of statistical modelling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. - Breiman , p199.
We contrast a modelling approach where the analyst makes certain assumption on model specification, including functional form, with an approach where the data mechanism is presumed unknown. In this context we consider the econometrician’s concern for internal validity, alongside the focus within machine learning of ensuring that a model is robust in the sense of generalising to unseen data (external validity).
The course will focus upon topics at the intersection of machine learning and econometrics, covering a mix of theory and applications. In making the distinction between models which are used to solve a prediction problem and models which are used to estimate some form of causal effect, we introduce participants to identification strategies in econometrics. Here it is important to demonstrate how empirical strategies such as unconfoundedness, instrumental variables, and difference-in-difference can be used alongside machine learning methods for prediction.
As a point of departure we make reference to the two broad types of machine learning in terms of supervised and unsupervised learning, making the link to nonparametric regression. We then consider a number of fundamental building blocks, starting with error decomposition in terms of bias and variance, the role of training, estimation and test samples, and the role of regularization as a means to avoid overfitting.
In covering two broad areas where machine learning is used, namely prediction, classification and causal effects, for each case we link the exposition to parametric bench- marks. For prediction we consider the piecewise nonlinear regression model, and high dimensional methods; and for causal effects we consider the specification of models with instrumental variables and treatment effects.
Participants will also be introduced to the use of ensemble methods as an averaging and regularization device. In this context we will explore a number of general methods for model averaging including bootstrap sampling (so-called bagging) and random forests. For Machine Learning models in prediction, classification and causal effects we provide examples using Stata, R and Python.
The introduction of time-of-use electricity prices is an example of a policy with heterogeneous effects. Consumers in different socioeconomic groups and with distinct historical intra-day load profiles and behavioural characteristics, may respond differently to the introduction of tariffs that charge different prices for electricity at different times of the day. Customers who can (cannot) adapt their consumption profile to tou tariffs will accrue a benefit (cost). Those who consume electricity at more expensive peak peri- ods, and who are unable to change their consumption patterns, could end up paying significantly more.
Analysts often describe subpopulations that are of interest a priori, and which can be defined by a known combination of covariates. However, increasingly researchers face a selection problem given a large number of possible covariates alongside uncertainty as to which covariates are important for heterogeneity, and what functional form best describes the association between these covariates and treatment effects.
In assessing whether demographic variables are informative in terms of the impact of tou tariffs on load profiles, the Customer-Led Network Revolution project noted:
.. a relatively consistent average demand profile across the different demo- graphic groups, with much higher variability within groups than between them. This high variability is seen both in total consumption and in peak demand.
In addition, the question of which demographic variables are important when considering the impact of energy policies ignores the fact that many of these variables should be considered together, in a multiplicative fashion. One reason for this finding might be that, for example, it is the (unknown) combination of income, household size, education, and daily usage patterns that describes a particularly responsive or unresponsive group.
Throughout the course we make reference to the problem of identifying the distributional effects of some intervention, without succumbing to the problems of data mining (multiplicity). Here we examine the empirical problem of identifying the characteristics of winners and losers subsequent to the introduction of tou tariffs following the intro- duction of a Time-of-Use (tou) pricing scheme where the price per kWh of electricity usage depends on the time of consumption. The pricing scheme is enabled by smart meters, which records consumption every half-hour.
Using machine learning methods we describe the association between the effect of tou pricing schemes on household electricity demand and a range of variables that are observable before the introduction of the new pricing schemes.Readings
The course is designed to provide both the tools to undertake projects using machine learning (ml), and critically ensure that participants understand and can communicate how the methods work.
Towards this objective, on Day 1, Session 1 we introduce participants to the vernacular of machine learning tools.
In Session 2 will further explore the links between ml, econometrics and data mining. We also examine how ml utilise data mining tools, suitably adapted to allow inference. The course is designed in such as way to ensure that participants are given the necessary context to understand the genesis of ml methods. To this end, the first point of departure reviews the ordinary least squares estimator and provides links to ml using kernel density estimation. We also provide the necessary links to econometrics and nonparametric statistics.
Course Notes: Overview, Prediction and Evaluation
Course Notes: ML and Econometrics, Point Dep OLS
Day 2, Session 1 begins with the second point of departure - high dimensional methods in statistics. These methods are used when analysts face a big data problem in terms of which of a large set of explanatory variables to include in a regression model.
We follow this with a practical where participants can explore the use of regularised regression tools with a number of empirical applications.
In session 2 we provide an introduction to a number of machine learning methods including regression trees and forests. This is then followed by a practical where we examine the use of ml methods for prediction.
Course Notes: Point Dep II High Dimens Methods, Applications of Regularised Regression
Course Notes: ML and Decision Trees
On Day 3, Session 1, we review some of the fundamentals of machine learning that have been introduced. This includes the use of ml for prediction, classification and causal effects, alongside the key methodological concepts such as the bias-variance trade-off and methods to achieve regularisation.
Session 2 begins with the third point of departure - programme evaluation and treatment effects. We make reference to the work of the Nobel Laureate Esther Duflo who has made significant contributions to the use of randomised control trials, in addition to the utilisation of machine learning methods in this context.
In Session 2 we examine the use of machine learning methods for causal inference. Relative to some econometric methods, ml techniques have sort to exploit so-called big data to provide a coherent approach to uncover variation in treatment effects without succumbing to the pitfalls of data mining.
This is followed by a practical where we examine the use of ml methods applied to the impact of time-of-use electricity on individual-level demand response. A key question here is whether it is possible to identify characteristics of households that enable policy makers to identify so-called winners and losers once we move to a price system where prices vary throughout the day.
The number of attendees is restricted. Please register early to guarantee your place.