The 25th UK Stata Conference (London) – First Announcement and Call for Presentations

The 25th UK Stata Conference (London) takes place on Thursday, 5 and Friday, 6 September 2019 at Cass Business School, London.

The 25th UK Stata Conference is a two-day international event will provide Stata users from across the United Kingdom and the world the opportunity to exchange ideas, experiences, and information on new applications of the software.

Quick Links

Scientific Organisers

Michael Crowther, University of Leicester
Michael.crowther@le.ac.uk

Stephen Jenkins, LSE
s.jenkins@lse.ac.uk

Roger Newson, Imperial College London
r.newson@imperial.ac.uk

Please email the scientific organisers if you are interested in presenting, sending an abstract and indicating whether you wish to give:

(i) a 20 min talk (followed by 10 min discussion);
(ii) a 10 min talk (followed by 5 min discussion);
(iii) a longer review or tutorial (about an hour); or
(iv) a poster presentation.

Presentation topics might include:

- discussion of user-written Stata programs,
- case studies of research or teaching using Stata,
- discussions of data management problems,
- reviews of analytical issues,
- surveys or critiques of Stata facilities in specific fields, etc.

The deadline for submission of abstracts is 31 May 2019. The final programme will be announced before the end of July 2019.

Please see below for further information about the User Meeting, including registration fees and reduced rates for paper presenters and students.

Fees & Accommodation

Cost (per participant):

Meeting Fees Price (inc. VAT)
Non-students - attendance to both days £96.00
Non-students - attendance to one day only £66.00
Students - attendance to both days £66.00
Students - attendance to one day only £48.00
UGM Dinner (optional) £36.00
      • All prices include VAT.
      • Lunches, refreshments and all meeting materials are included within the registration fees.

Registration

Timberlake Consultants, generously sponsors registration fee waivers for presentations (one fee waiver per presentation, regardless of number of authors involved). Timberlake will also pay a small fee to a presenter of a longer review or tutorial paper. Presenters need to register.

We can also assist delegates with sourcing accommodation and other general inquiries regarding traveling or staying in London.

To find out more and register for the meeting, please register on our events page here or contact us by email or call us on +44 (0)20 8697 3377.

Proceedings of Past Meetings

Click here for Past Meetings Proceedings

Stata Tips #21 – Stata 15’s new survival analysis with interval-censored event times

Stata 15's new survival analysis with interval-censored event times

What is it for?

Often, time-to-event or survival data are gathered at particular observation times. A physician will detect the recurrence of cancer only when there is a follow-up appointment, and a biologist might know that a study animal in the wild has died when they visit the site, but not exactly when it happened. In either case, all we know is that the even happened between the two appointments or visits, which is to say that there is a lower limit and an upper limit to the survival time. In statistical parlance, data like this are interval-censored.

Stata version 15 includes a new command, stintreg, which provides you with the familiar streg parametric survival regressions, while allowing for interval-censored data. Just by typing estat sbcusum, you obtain test statistics, critical values at 1, 5 and 10 percent, and a cumulative sum (CUSUM) plot, which shows when, and in what way, the assumption is broken if it is.

An example with medical follow-up data

In this post, we'll apply stintreg to a dataset from David Collett's textbook, "Modelling Survival Data in Medical Research" (3rd edition) (Example 9.4). This is from a real-life study that examined breast retraction, a side-effect of breast cancer treatment, and compared women treated with radiotherapy alone and those treated with radiotherapy and chemotherapy. You can download the data in CSV form here and follow along with this do-file.

There are three variables: treatment group, start time of the interval (the last appointment at which there was no sign of retraction) and end time of the interval (the first appointment where there was retraction). First, let's visualise the dataset. Time is on the vertical axis, and the women are ranked by start time from left to right, in the two treatment groups. The exact time when retraction appeared is somewhere in the grey lines. Ideally, we would see a lower risk of retraction as grey lines stretching off to the top of the chart without red dots, and high risk indicated by red dots low down in the chart. But, it is hard to judge whether there are any differences.

Stata graph

Let's fit some streg models, using naive approximations of the data so that they look like they are exactly known. First, we use the start time (the last retraction-free appointment).

gen naive = (end!=.)

stset start, failure(naive)

streg treatment, distribution(weibull)

Stata output

This shows a very strong and statistically significant detrimental effect of chemotherapy. Next, we use the mid-point between the two appointments:

gen naivetime = (start+end)/2

stset naivetime, failure(naive)

streg treatment, distribution(weibull)

Stata output

Oh dear! This shows an almost non-existent difference between the two groups, with no statistical significance. Which can we trust? We need to account for the interval censoring ina principled way, as part of the overall probabilistic model for the data, and this is where stintreg comes in.

stintreg treatment, interval(start end) distribution(weibull)

Stata output

Now, we can see that the treatment group difference is quite large, and quite compelling in terms of significance. Note that, because the models are very different, we can't compare the log-likelihoods.

Why does this matter?

Interval-censored data are everywhere. Being able to account for the uncertainty in the data gives more honest answers, not only by avoiding bias but by having the right level of uncertainty. If you fill in the "blank" with a single number, then you give Stata the false impression that the data are exactly known, and the inferences (p-values, confidence intervals) that follow will be falsely precise as a result.

 

Stata Tips #20 – Power Analysis

The ultimate question in applied research is: does event A cause event B? The search for causality relationships have pushed applied scientists to rely more and more on field experiments. To be able to detect a treatment effect (or the causality impact), and if setting up an experiment, researchers need to determine the required sample size. Stata 15 has a command power that allows users tremendous flexibility in determining sample size, power of test, and graph those. In the following example we compute sample size for a specific power of test; we could have also computed different powers for a specific sample size.

Assume that we want to calculate the impact of an innovative teaching program on General Equivalency Development (GED) completion rates for a young population (between the age of 17 and 25). In order to do so, we need to run a randomized control trial and randomly pick participants in the innovative program. We know from the literature that the proportion of the population in question with a GED is around 66%. Experts in the education field expects the new program to increase GED completion rates by 20 percentage points (to 86%). How many study participants do we need to test this hypothesis?

Using the command below Stata helps identify the total number of participants (and also the number in each group: treatment and control) needed in the study.

power twoproportions 0.66 0.86, power(0.8) alpha(0.05)

The twoproportions method is used because we are comparing two proportions, 0.66 is the proportion of the population with GED and 0.86 is what we expect the proportion of the population with GED will be after completing the program. power() is the power of the test and alpha() is the significance level of the test; both values here are the default values.

Performing iteration ...

Estimated sample sizes for a two-sample proportions test
Pearson's chi-squared test
Ho: p2 = p1 versus Ha: p2 != p1

Study parameters:

alpha = 0.0500
power = 0.8000
delta = 0.2000 (difference)
p1 = 0.6600
p2 = 0.8600

Estimated sample sizes:

N = 142
N per group = 71

The result of the power analysis tells us that we need to recruit 142 participants into the study in which we enroll 71 in the new program and rely on the remaining 71 as a control group.

Let us say we now want to relax the 0.8 assumption of the power of the test and allow 4 different values ranging from a low of 0.6 to a high of 0.9

power twoproportions 0.66 0.86, power(0.6 0.7 0.8 0.9) alpha(0.05)

Performing iteration ...

Estimated sample sizes for a two-sample proportions test
Pearson's chi-squared test
Ho: p2 = p1 versus Ha: p2 != p1

+-----------------------------------------------------------------+
| alpha power N N1 N2 delta p1 p2 |
|-----------------------------------------------------------------|
| .05 .6 90 45 45 .2 .66 .86 |
| .05 .7 112 56 56 .2 .66 .86 |
| .05 .8 142 71 71 .2 .66 .86 |
| .05 .9 188 94 94 .2 .66 .86 |
+-----------------------------------------------------------------+

The command gives us the sample size of the 4 different scenarios with a low of 90 participants to a high of 188. As expected, the higher the power of the test the larger the sample size that is required.

We can graph the above table using the below command:

power twoproportions 0.66 0.86, power(0.6 0.7 0.8 0.9) alpha(0.05) graph

[Image: Stata output for a multilevel linear regression]

In some cases, the sample size has been predetermined. For a variety of reasons which could eligibility into a program, financial, etc. the number of participants in a study is known and also fixed. In the above example assume the number of participants eligible for the innovative education program is 120. How much power can we detect from the given sample size and a number of different magnitude effects of the program? As a reminder, power of a test is the probability of making a correct decision (in order words to reject the null hypothesis) when the null hypothesis is actually false.

power twoproportions 0.66 (0.71 0.76 0.81 0.86), n(120) alpha(0.05) graph

In the above command, we fix sample size to 120 and suggest 4 different effect sizes each increasing by 5 percentage points from the initial 66% proportion of GED completion rates.

[Image: Stata output for a multilevel linear regression]

The resulting graph suggests low power with the highest being 73% for an effect of 20 percentage points increase in the proportion of GED completion rate. This is not surprising given that the first command gave us a required 142 participants to get the same effect with a power of 80%. With less number of participants, we expect lower power.

Stata Tips #19 – Multilevel Tobit regression models in Stata 15

 

Multilevel Tobit regression models in Stata 15

Tobit models are made for censored dependent variables, where the value is sometimes only known within a certain range. Chemical sensors may have a lower limit of detection, for example. In this blog post, we'll use some simulated data so that we know what relationships we expect to see, and they will be censored with an upper limit, or as the jargon goes, right-censored.

Tobit models have been available in Stata for a while, but version 15 now includes multilevel versions with random intercepts and random slopes. Let's look at a simple mathematical representation.

[Image: Algebra for the Multilevel Tobit model]

In these equations, i indexes the individual cases and j the clusters of cases that define the multilevel structure. That might be exam results clustered by student, or trees clustered by plantation, or whatever. One neat way of thinking about the advantage of multilevel models is that they split up the unexplained variance into observation-level and cluster-level variance, and in doing so, give deeper understanding and more nuanced conclusions about the data.

x is an independent variable or predictor, y is the dependent variable or outcome of interest, but we don't observe y directly, only y*, which is the censored version at a threshold h. The betas are simple regression coefficients. Then, there is also a random intercept u, which is different for every cluster in the data's multilevel structure, and a random slope v, which contributes to the effect of x on y. The u and v values are normally distributed around zero, by convention.

Simulated data

Let's create some simulated data for an imaginary research study into rehabilitation for amputees learning to use electrically-controlled prosthetic hands. The data are fake but inspired by real-life research.

In these data, there are two training programmes being compared (variable: train) and the participants (identified by the variable: id) are tracked over several weeks (variable: week). They have to complete a test by carrying out four tasks of increasing difficulty (variable: task), and they are timed on the tasks (variable: time), so smaller time is desirable, but there is a maximum time of 30 seconds, so if they have not completed it within 30 seconds, then time is recorded as 30.

This is right-censoring of the dependent variable. We also have to account for the repeated-measures (economists may say "panel data") nature of the data in a multilevel model. There will be a random intercept by id, and a random slope for week, and the study is principally interested in the interaction between week and train, because this would indicate a difference between training programmes in their impact on the rate of recovery (slope relating week to time).

The code to generate these data is in the accompanying do-file. The graphs below show the individual "patients" as lines over time, split by the four tasks and two training groups.

[Image: Outcome versus week for task 1]
[Image: Outcome versus week for task 2]
[Image: Outcome versus week for task 3]
[Image: Outcome versus week for task 4]

If only we knew the real values of the dependent variable...

...then we could just fit a familiar multilevel linear regression. Let's call it "Model 1".

mixed y c.week##i.train i.task || id: week, covariance(unstructured)

You can see that this recovers the population parameters we fed into the simulated data, plus-minus some sampling error. Along with the other two models we'll fit, they are summarised in this table:

[Image: Table of parameters from the three models]

Next, we try the same linear regression, having censored our dependent variable above 30 seconds. The only thing we change is to replace y with ystar, which has any values above 30 replaced with exactly 30. This is "Model 2", and it effectively ignores the censoring, so we shouldn't be too surprised to get biased results out. In the table above, the highlighted cells show those parameters whose Model 1 values lie outside the 95% confidence intervals from Model 2.

mixed ystar c.week##i.train i.task || id: week, covariance(unstructured)

Many of these biases can be understood in terms of the data being pushed down to the red dashed line in the graphs above, and the linear model trying to compensate for that. But, when we use metobit (which has almost identical syntax to mixed) to fit "Model 3", we are asking Stata to integrate out the censored values over the part of the data distribution above the censoring threshold. That requires some assumptions in the model, and of course, this linear and normally distributed dataset fits those assumptions nicely, but in any situation where you are happy to use linear regression with the usual assumption of normal and homoscedastic residuals, it will be justified. Here, the biases of Model 2 all disappear, as seen in the table.

metobit ystar c.week##i.train i.task || id: week, ul(30) covariance(unstructured)

We can use margins and marginsplot to help us interpret the model. You might notice that margins takes much longer with metobit than with mixed: we are asking a much harder task of the computer!

[Image: Stata output for a multilevel linear regression]

From this graph, it's obvious that the bias in Model 2 has been very well removed by the multilevel Tobit in Model 3.

 

The do-file for this example can be downloaded here.

 

Stata Tips #18 – Fitting more complex Bayesian models with BUGS, Stan or JAGS – from inside Stata

In this post, we will look at a couple of community-contributed commands for fitting Bayesian models to your data. Stata 14 introduced Bayesian functionality for the first time with bayesmh, and version 15 took this further with the bayes: prefix, which can conveniently be added before calling any of 45 estimation commands (just as you might type bootstrap: or svy:), but you can also fit bespoke models with external, free software: BUGS, Stan and JAGS.

Why would I need anything other than Stata's Bayesian commands?

BUGS, Stan and JAGS have been around since before Stata's bayesmh and bayes:. They used to be the only way to fit these kinds of models, so you might well find that someone has written code for one of them that you can easily adapt to your needs.

In particular, The BUGS Book and the Stan manual and case studies contain many examples of specialised models, and wise advice on adapting it.

Also, there are some kinds of model that Stata can't fit. At the time of writing, this includes imputing of missing or censored data inside a Bayesian model, Gaussian processes, or geospatial models.

Installing commands for BUGS and JAGS

Various commands for interfacing between Stata and WinBUGS, OpenBUGS and JAGS were written by John Thompson of the University of Leicester in the UK. You can download them all from Stata Press, because they accompany his book Bayesian Analysis With Stata:


net from http://www.stata-press.com/data/bas

net get bayes

do bayes

Then, follow the instructions in the readme.txt file to complete installation. You will need the WinBUGS folder location, so read on if you haven't already installed it.

To work with JAGS, you will need the experimental v2 versions of these commands. We won't go into them in this post, so there's no need to install them now, but I do suggest you consider JAGS in preference to BUGS, as a more actively-maintained and used piece of software, and one that works on all operating systems.

If you are familiar with the version control software, Git, you might prefer to clone my GitHub repository to get all the files in one place, as well as the experimental v2 versions, which we'll discuss below: git clone https://github.com/robertgrant/winbugsfromstatav2 If you don't do git, you can just point your browser to https://github.com/robertgrant/winbugsfromstatav2/archive/master.zip and unzip this into your personal ado path.

(What's my personal ado path? It's a folder on your computer where your own Stata commands get added, like stuff you install from SSC. You can find it by typing in Stata: sysdir, then looking for the PERSONAL directory.

Stata output

This package was originally aimed at running WinBUGS, a graphical user interface to the BUGS program that the Medical Research Council Biostatistics Unit in Cambridge created for Windows computers, and some years ago stopped actively developing. You will need to install WinBUGS if you want to use it (there are instructions here). If you are not using Windows, or just want to use OpenBUGS or JAGS (which both run the same algorithm, the Gibbs sampler, but unlike WinBUGS, do not have a graphical user interface), then you will need to work with John's experimental v2 commands. They are included in my GitHub repository.

You will find instructions to install OpenBUGS on Windows or Linux (there is no guarantee that the Linux version will behave as expected on Mac) here and JAGS here. My personal advice is to use JAGS, which is actively maintained and has a large pool of expert users, who are likely to have detected and fixed any problems before you encounter them. JAGS is the only option for Mac anyway.

If you are considering using JAGS, check out John's blog post on the (relatively minor) differences. Once you install JAGS, look in the Readme.rtf file that is installed to find the location of the executable program on your operating system. You will need to tell Stata this location.

Installing commands for Stan

To use Stan, you need to install CmdStan, and there are instructions here, particularly in Appendix B of the CmdStan manual. Note that, for Windows, you should install RTools as well, and at the time of writing, it may be best to avoid CmdStan 2.18.0, which has a lot of speed improvements but many Windows users have had problems installing it.

Like JAGS, you need to know where it has been installed on your computer, so that you can tell Stata when you run it.

Then, installing the Stata commands is simple because they are on SSC:

ssc install stan

And, if you are using Windows, you will need the windowsmonitor command:

ssc install windowsmonitor

Running a simple model in WinBUGS

Now, having done the tiresome business of installation, we can start sending data and model specification from Stata to some of these Bayesian programs. We will use the simple regression from my previous blog post on Bayes in Stata:

sysuse auto, clear
regress price mpg

We could just put the bayes: prefix in from of this regress command, but the default priors are not suitable here, so to get better results, we specified a normally-distributed prior for the slope, and uniformly-distributed priors for the constant term and the residual variance:

bayes, prior({price: mpg}, normal(0, 4000000)) ///

prior({price: _cons}, uniform(-10000,25000)) ///

prior({sigma2}, uniform(0.1, 5000)) ///

initial({price: _cons} 12000 {price: mpg} -300 {sigma2} 500) ///

: regress price mpg

If we wrote out this model algebraically, it would be something like this:

model specification

The first three lines are the priors, the fourth deterministically sets the linear predictor mu, given values drawn from the priors, and the fifth defines the likelihood. In BUGS, JAGS or Stan, you specify the model in individual statements like this.

The model code for BUGS or JAGS is the same:

model {

beta0 ~ dunif(-10000,25000)

beta1 ~ dnorm(0,0.00000025)

tau ~ dgamma(0.001,0.001)

for (i in 1:74) {

mu[i] <- beta0 + (beta1 * mpg[i])

price[i] ~ dnorm(mu[i],tau)

}

}

You might notice two things that differ from the Stata 15 specification: instead of variance, we have to specify precision (1/variance), and BUGS is much less tolerant of priors on precisions other than the gamma distribution, so here we are using a diffuse gamma. (Note that this has been heavily criticised...)

There are some ways of specifying this model inside the do-file, but to keep it simple, you can just type it up inside WinBUGS or a text editor (like Notepad, Atom, Sublime Text, vim, or whatever). Here, I saved it as autommodel.txt. The rest of the files that WinBUGS will need are generated for you by John Thompson's excellent commands.

wbslist or wbsarray can be used to write your data out into a text format. In my experience, you need to use the format option to get readable results. wbslist will also write out the initial values. Here, we are just running one chain.

wbsscript writes the script file for WinBUGS that tells it where to find the model, the initial values and data, how many iterations to run, and so on. If you know MCMC analysis, the options are pretty self-explanatory. The noquit option keeps WinBUGS open, and Stata won't proceed until you shut it, which is handy while you are debugging and getting your code right.

Having obtained your results by running wbsrun, wbscoda reads the iterations of the Markov chains into Stata, which you can summarise statistically with mcmcstats and visually with mcmctrace, as well as other approaches.

Here are the outputs from mcmcstats:

Stata output

Below, you can see the traceplots from mcmctrace, then a bivariate trace where we simply get a line chart of the two betas, which we expect to be quite strongly negatively correlated.

traceplots
bivariate traceplot

If you wanted to use OpenBUGS instead, you can just add the openbugs option to your wbsscript, wbsrun and wbscoda commands. Thompson's v2 versions of the commands are an experimental approach to extending this to JAGS. They have not been fully tested so you should use them with caution.

Running a simple model in Stan

To do the same analysis in Stan, we need a somewhat different model file:


data {

int N;

real mpg[N];

real price[N];

}

parameters {

real beta0;

real beta1;

real<lower=0> sigma;

}

model {

real mu[N];

beta0 ~ uniform(-10000,25000);

beta1 ~ normal(0,2000);

sigma ~ uniform(0.1,5000);

for(i in 1:N) {

mu[i] = beta0 + (beta1*mpg[i]);

price[i] ~ normal(mu[i],sigma);

}

}

Data and parameters have to be declared before they can be used; we assign them to be integers or real numbers in this example. This helps the computer work more efficiently. Also, instead of precision, we can specify the standard deviation, which is usually more intuitive. The algorithm used by Stan is stable with a wide range of prior distributions, so we can use a uniform on the standard deviation, and we do not have to specify intial values (usually). This model file is saved as auto.stan. Then, we store the number of rows as a global N, and use the stan command:

count

global N=r(N)

stan mpg price, globals("N") modelfile("auto.stan") ///

cmdstandir("/Users/robert/cmdstan-2.16.0") load

Here are the outputs:

Stata output

For a very simple example with only 74 observations and 3 parameters, Stan takes notably longer than BUGS. As the size of the problem grows, or as parameters become more correlated, or as the priors become more bespoke (non-conjugate), Stan will win on speed. The load option reads the chain back into Stata, and then we can generate traceplots and a bivariate trace on the betas.

traceplots
bivariate traceplot

The do-file for all the commands (except installations) in this post can be downloaded by clicking here.

 

Stata Tips #17 – Threshold regression for time series in Stata 15

Threshold regression for time series in Stata 15

In time series analysis, sometimes we are suspicious that relationships among variables might change at some time. The new threshold command allows you to look for these changes in a statistically informed way, which helps you avoid the potential for bias if you just eyeball line charts and pick the point that fits with your expectations.

Here's a simple example. The goldfinch is a small songbird found throughout Eurasia. In these population data from the United Kingdom, you can see a sudden change in the time series at around 1986.

British Trust for Ornithology population scatterplot with moving average

Image copyright British Trust for Ornithology

This is generally ascribed to the birds learning how to forage in suburban gardens. Prior to that point, you might have studied the effect of the size of field margins on farms on the goldfinch population, on the basis that the birds eat seeds of wild plants that grow on the margins of cultivated land. The post-1986 data would throw your analysis out; the birds near human habitation were no longer totally dependent on wild plants. Identifying the threshold and fitting different models on either side allows you to improve causal understanding or prediction.

An example with weather data

Let's look at a fairly small dataset: a few weather variables from an observation station to the South of London in August 2018. You can download the data file here.

We will open the file (I suggest you browse it to see what's inside) and declare it to be a time series.

use "kenley.dta", clear

tsset datetime, clocktime delta(1 hour)

 

Next, we make two new variables: decimalday will be handy for plotting, and hoursine is a quick and dirty way of incorporating the daily oscillation in temperature, with minimum about 5 a.m. and maximum about 5 p.m., fairly standard in an English summer. We are imposing a sinusoidal shape on this oscillation, and fixing the time of the minimum and maximum, which is not a great idea for time series analysis, but will simplify what follows and allow us to focus on the threshold command.

generate decimalday = day + (hour/24)

generate hoursine = sin((hour-11)/12*3.14)
Line chart of temperature, humidity and air pressure change

If we regress temperature on hoursine, we can evaluate the size of the diurnal variation. Bear in mind that hoursine varies from -1 to 1, so the coefficient tells us the average change from 5 a.m. to 11 a.m. or 11 a.m. to 5 p.m.

regress temperature hoursine

This change is 3.2 degrees Celsius, with 95% confidence interval from 2.8 to 3.6. The intercept term can be interpreted as the average temperature at 11 a.m. (17.2 degrees, confidence interval 17.0 to 17.5). If you compare this to the line chart for temperatures, the size of the sine wave looks fairly sensible, but it is centered around 17 degrees, when actually the month started with temperatures notably above that, which then dropped around the 8th of August to be below that. Looking more closely, the period before the 8th seems to have larger amplitude (height of the wave) than after the 8th. Really, we need to consider at least one threshold point.

Line chart of temperature and hoursine regression

To run the corresponding threshold regression, we can simply type threshold temperature, threshvar(decimalday) regionvars(hoursine). threshvar tells Stata which variable to use to detect the threshold location(s) and regionvars tells it what variable(s) will be used as predictors on either side of the threshold(s).

threshold regression of temperature on hoursine

threshold fits linear regressions (I'll discuss later what you can do in non-Gaussian error situations), and it runs a fairly exhaustive search along the range of threshvar, fitting 557 regressions in this case. Fortunately, linear regressions are fitted by simple matrix algebra and are hence very fast. The default is to look for one threshold, but we will extend that later.

The output includes Akaike information criterion (AIC), Bayesian information criterion (BIC) and Hannan-Quinn information criterion (HQIC); BIC in particular allows inter-model comparisons. This is helpful if you want to compare a model with one threshold to a model with two, for example.

There is also a sum of squared residuals (SSR), which is 4908 for one threshold. We can compare this will the simple regression above, where the sum of squares (SS) residual is 9342 – a big improvement! The threshold itself occurs on the night of 7-8 August (decimalday = 7.875), which is indeed the most obvious changepoint in the time series. In fact, this coincides with a cold front moving over the area after a record-breaking period of hot, dry weather.

Visualising the new model's predictions, we can see it does much better, although there might be a case for subdividing the period after the 8th further.

Line chart of temperature and hoursine regression, one threshold

The accompanying do-file shows how the number of thresholds, and their locations, can be extracted into Stata macros and reused, for example in graphics, without hard-coding their values.

More than one threshold

To allow for more thresholds, we can just add the option optthresh(4).

threshold ... optthresh(4)

After 4 scans through the data, we get four threshold points. The residual sum of squares is shown as each one is added, ending at 3138 with a BIC of 1114, notably lower than the 1386 of the one-threshold model. The chart below shows their locations in purple, with the one-threshold model's location in orange (overlapping one of the purple lines).

4-threshold regression of temperature on hoursine

The new model clearly has predctions that more closely follow the observed data, and this is supported by the BIC, although the reduction in residual sum of squares does not always decrease as more thresholds are added, so we should be cautious about adding these additional thresholds. I'll come back to this point at the end.

Line chart of temperature and hoursine regression, four thresholds

Threshold regression with more predictors

Now that we've seen the basics of how threshold works, let's try it out with a more realistic regression. In the weather data, we might want to predict temperature based on hoursine, as well as the lagged values of humidity and pressure_change.

regression of temperature on three predictors

Running threshold with one threshold, we get:

threshold regression of temperature on three predictors, one threshold

This has put the one threshold much later in the month, probably because the disturbance in temperature around the 8th is adequately predicted by disturbances in humidity and pressure too. The observed and predicted plot looks like there are a couple of short periods of systematic over- or under-estimation.

threshold regression of temperature on three predictors, one threshold

A four-threshold model follows:

threshold regression of temperature on three predictors, four thresholds

The coefficients change quite a lot, and some predictors move to having high p-values, in some of the regions. A closer examination of what this means to an expert in weather prediction would perhaps help to trim the thresholds.

threshold regression of temperature on three predictors, one threshold

How could threshold regression be used for model building in practice?

As we've seen, threshold regression is an exploratory technique and the final choice of model has to be taken by the analyst, taking theory into consideration alongside the stats like BIC and residual sums of squares. There is potential to overfit, especially if you set optthresh to be quite high, which is really no different to any other model building procedure. In the weather data, although I am no meteorologist, it seems justified that there is a threshold as the cold front arrives (which has been suggested in both these regressions), and perhaps another after the band of rain has passed – where there is high and aperiodic humidity in the 9-11th – (which was not suggested by threshold), replaced by colder air and showers. Beyond that, later thresholds are questionable.

The best way to guard against that is to plot your predictions and residuals and look for patterns that can refine your model. If possible, split your data into a training set (that you run threshold on), and a test set (that you apply the training set's model to). With this cross-validation approach, you can quantify the effect of increasing optthresh.

What about non-Gaussian errors, or when the dependent variable is a count or binary?

Although threshold only tests for thresholds in linear regression (with the usual Gaussian, homoscedastic and independent errors), if you regard it as an exploratory technique pointing the way to a more detailed model, then there is no reason why you could not use it to get a selection of candidate thresholds.

Suppose you have count data and intend to fit a Poisson model. If you take logarithms (adding a small number to avoid log(0), you could run threshold and then try out the thresholds in poisson.

Non-Gaussian errors might give you incorrect standard errors but still provide reasonable coefficients and thresholds, to be properly evaluated in a more robust model. You will just have to be careful to spot any outlier residuals or clumps of outliers, that might induce a threshold where there really isn't enough evidence to justify it, given the "true" residual distribution.

The do-file for this example can be downloaded here.