2020 London Stata Online Conference - Proceedings

26th UK Stata Conference - Online, 10 & 11 September 2020

From datasets to metadatasets in Stata

Roger Newson
Department of Primary Care and Public Health, Imperial College London

Metadatasets are Stata datasets, in files or in frames, which may have one observation per file, per dataset, per variable, or per variable value. Metadatasets can be used to modify a Stata database, or to make a Stata database self-documenting, especially if converted to non-Stata formats, such as HTML or even Microsoft Excel. We present some user-written packages, updated to Stata version 16, for creating and using metadatasets. The xdir package creates a resultsset with one observation per file in a folder conforming to a user-specified pattern. The descgen pack inputs a xdir resultsset, and generates a new variable indicating whether each file is a Stata dataset, and other new variables containing dataset attributes, such as the dataset label and characteristics, the sort key of variables, and the numbers of observations and variables. The vallabdef package inputs a dataset with one observation per label name per value per value label, and generates Stata value labels. The vallabsave package loads and saves value labels from and to label-only datasets, and transfers value labels between data frames. The descsave package creates a metadataset with one observation per variable in a dataset, and data on variable attributes (including characteristics). The invdesc package modifies the variable attributes of the dataset in the current frame, inputting a descsave resultsset in a second data frame to set the variable attributes, and inputting value labels from a dataset in a third data frame. The datasets containing the variable attributes and value labels may be produced as resultssets by Stata packages, or produced manually in a spreadsheet using LibreOffice Calc or Microsoft Excel, and input into Stata datasets using import delimited or import excel.

Download presentation

Second Generation P-Values (SGPV) for common estimation commands in Stata

Sven-Kristjan Bormann
School Economics and Business Administration, University of Tartu, Estonia

This presentation introduces commands to calculate Second Generation P-Values (SGPV) for common estimation commands in Stata. The sgpv command and its companions allow the easy calculation of SGPVs and the associated diagnostics as well as the plotting of SGPVs against the standard p-values. SGPVs were introduced by Blume et al. (2018, 2019) as an alternative and upgrade of the standard p-values.

Reference 1,Reference 2

Download presentation

xthst: Testing for slope homogeneity in Stata

Jan Ditzen & Tore Bersvendsen
Heriot-Watt University, Edinburgh & Kristiansand Kommune, Norway

This talk introduces a new community contributed Stata command, xthst, to test for slope homogeneity in panels with a large number of observations over cross-sectional units and time periods. The program implements such a test, the delta test derived by Pesaran and Yamagata (2008). Under the null, slope coefficients are heterogeneous across cross-sectional units. xthst also includes two extensions. The first is a heteroscedasticity auto-correlation robust test on the lines of Blomquist and Westerlund (2013). The second extension is a cross-sectional dependence robust version. The talk will cover the econometric theory of the tests, explain xthst and its options and give empirical examples. Monte Carlo evidence will be shown to prove that the test behaves as expected.

  • Blomquist, J., and J. Westerlund. 2013. Testing slope homogeneity in large panels with serial correlation. Economics Letters 121(3): 374-378.
  • Pesaran, M. H., and T. Yamagata. 2008. Testing slope homogeneity in large panels. Journal of Econometrics 142(1): 50-93.

Download presentation

Unit root tests for explosive behaviour

Jesús Otero & Christopher F. Baum
Universidad del Rosario, Bogotá, Colombia & Boston College, Chestnut Hill, MA

We present the new Stata command radf to compute several tests for explosive behaviour in time series. The command implements the right-tail augmented Dickey and Fuller (1979) (ADF) unit root test, and its further developments based on supremum statistics derived from ADF-type regressions estimated using rolling windows, recursive windows (Phillips, Wu and Yu 2011), and recursive flexible windows (Phillips, Shi and Yu 2015). The command allows for the number of lags of the dependent variable in the test regression to be either specified by the user or endogenously determined using a data-dependent procedure. The use of the command is illustrated with an empirical example.

Download presentation

A gmm recipe to get standard errors for control function and other two-step estimators

Enrique Pinzon
StataCorp, College Station, TX

It is common to use residuals from the first step of estimation as regressors in the second step. We are interested in the coefficients and effects of the second step. An example of these types of estimators is control function approach methods. Getting standard errors in these cases is challenging, and thus bootstrap methods are commonly used. I will illustrate how to use Stata's gmm command to obtain correct standard errors, using cross-sectional and panel-data examples. The GMM estimates give correct coverage and reduce computation time relative to commonly used bootstrap methods.

Download presentation

randregret: A command for fitting Random Regret Minimization Models

Álvaro A. Gutiérrez Vargas, Michel Meulders & Martina Vandebroek
Centre for Research Operation and Statistics (ORSTAT), KU Leuven, Belgium

In this article, we describe the randregret command which implements a variety of Random Regret Minimization (RRM) models. The command allows the user to apply the classic RRM model (Chorus, 2010), the Generalized RRM model (Chorus, 2014), and also the mu-RRM and Pure RRM models (Van Cranenburgh, Guevara and Chorus, 2015). We illustrate the usage of the randregret command using stated choice data on route preferences. The command offers robust and cluster standard error correction using analytical expressions of the score functions. It also offers likelihood ratio tests which can be used to assess the relevance of a given model specification. Finally, predicted probabilities from each model can be easily computed using the randregretpred postestimation command


  • Chorus, C. G. 2010. A new model of random regret minimization. European Journal of Transport and Infrastructure Research 10
  • Chorus, C. G. 2014. A generalized random regret minimization model. Transportation Research Part B: Methodological 68: 224–238.
  • Van Cranenburgh, S., C. A. Guevara, and C. G. Chorus. 2015. New insights on random regret minimization models. Transportation Research Part A: Policy and Practice 74: 91–109.

Download presentation

Agent based models in Mata: Modelling aggregate processes, like the spread of a disease

Maarten Buis
University of Konstanz

An Agent Based Model (ABM) is a simulation in which agents that each follow simple rules interact with one another and thus produce an often surprising outcome at the macro level. The purpose of an ABM is to explore mechanisms through which actions of the individual agents add up to a macro outcome by varying the rules that agents have to follow or varying with whom the agent can interact (for example, varying the network). These models have many applications, like the study of segregation of neighborhoods or the adoption of new technologies. However, the application that is currently most topical is the spread of a disease. In this talk, I will give introduction on how to implement an ABM in Mata, by going through the simple models I (a sociologist, not an epidemiologist) used to make sense of what is happening with the COVID-19 pandemic.

Download presentation

New Bayesian features: multiple chains, predictions, and more

Yulia Marchenko
StataCorp, College Station, TX

Stata 16 expanded the Bayesian suite of commands with many new features, including multiple chains and Bayesian predictions. This presentation will showcase these features. I will demonstrate how to run multiple chains, including in parallel, and how to use them to check for MCMC convergence. I will show how to compute Bayesian predictions and how to use them for model diagnostic checks. And more.

Download presentation

Non-parametric estimation in multi-state survival models: An update to msaj

Micki Hill, Paul C. Lambert & Michael J. Crowther
University of Leicester, Leicester & Karolinska Institutet, Stockholm

multistate package in Stata can provide a range of predictions from parametric multi-state models via the predictms command. However, non-parametric estimates produced by the accompanying msaj command were limited. The aim of this work was to update msaj to provide a comprehensive set of non-parametric estimates.

Methods: Two useful metrics in a multi-state model are transition probabilities and expected length of stay. Transition probabilities from a Markov model can be estimated non-parametrically using the empirical Aalen–Johansen estimator (analogous to the Kaplan–Meier estimator in standard survival). Expected length of stay can be estimated by integrating the transition probabilities. In this setting, this involves a summation of rectangles, as the Aalen–Johansen estimator is a step function.

Updates to msaj: Previously, only transition probabilities from state 1 at time 0 could be obtained using msaj, along with corresponding confidence intervals. Following the update, the starting state, entry time and exit time can be specified. Estimates can now also be produced for bidirectional models and expected length of stay can be obtained.

Illustrative example: A non-parametric analysis was performed on hospital epidemiology data, which demonstrated how msaj can be implemented. Three parametric multi-state models were also fitted to illustrate how non-parametric estimates can be used as a reference to informally compare models. Transition probabilities and expected length of stay were estimated from state 1 at time 0 and from state 2 at time 3 (relevant metrics for this dataset).

Conclusion: The updated msaj provides a comprehensive set of non-parametric predictions, allowing for analyses with no assumptions made on transition rates and providing a reference for parametric models. Extensions could include fixed horizon predictions and confidence intervals for expected length of stay.

Download presentation

kinkyreg: Instrument-free inference for linear regression models with endogenous regressors

Sebastian Kripfganz & Jan F. Kiviet
University of Exeter Business School & University of Amsterdam

In models with endogenous regressors, a standard regression approach is to exploit just- or over-identifying orthogonality conditions by using instrumental variables. In just-identified models, the identifying orthogonality assumptions cannot be tested without the imposition of other non-testable assumptions. While formal testing of over-identifying restrictions is possible, its interpretation still hinges on the validity of an initial set of untestable just-identifying orthogonality conditions. We present the kinkyreg Stata program for kinky least squares (KLS) inference that adopts an alternative approach to identification. By exploiting non-orthogonality conditions in the form of bounds on the admissible degree of endogeneity, feasible test procedures can be constructed that do not require instrumental variables. The KLS confidence bands can be more informative than confidence intervals obtained from instrumental variable estimation, in particular when the instruments are weak. Moreover, the approach facilitates a sensitivity analysis for the standard instrumental variable inference. In particular, it allows assessment of the validity of previously untestable just-identification exclusion restrictions. Further KLS-based tests include heteroskedasticity, function form, and serial correlation tests.

Download presentation

Sample size calculation for an ordered categorical outcome

Ian White, Ella Marley-Zagar, Tim P. Morris, Mahesh K. B. Parmar, Abdel G. Babiker
MRC Clinical Trials Unit at UCL

We describe a new command, artcat, to calculate sample size or power for a clinical trial or similar experiment with an ordered categorical outcome, where analysis is by the proportional odds model.

The command implements an existing and a new method. The existing method is that of Whitehead (1993). The new method is based on creating a weighted data set containing the expected counts per person, and analysing it with ologit. We show how the weighted data set can be used to compute variances under the null and alternative hypotheses and hence to produce a more accurate calculation. We also show that the new method can be extended to handle non-inferiority trials and to settings where the proportional odds model does not fit the expected data.

We illustrate the command and explore the value of an ordered categorical outcome over a binary outcome in various settings.

We show by simulation that the methods perform well and are very similar when treatment effects are moderate. With very large treatment effects, the new method is a little more accurate than Whitehead’s method. The new method also applies to the case of a binary outcome and we show that it compares favourably with the official power and the community-contributed artbin.

Reference:Whitehead, J. 1993. Sample size calculations for ordered categorical data. Statistics in Medicine 12(24): 2257–2271.

Download presentation

Fancy graphics: Force-directed diagrams

Philippe van Kerm
University of Luxembourg and Luxembourg Institute of Socio-Economic Research

This short talk discusses and illustrates implementation of force-directed diagrams in Stata. Force-directed layouts use simple stochastic simulation algorithms to position nodes and vertices in a two-way plot. They can be used in a range of data visualisation applications, such as network visualisation, or representation of clustering and relationships among observations in the data. We will discuss implementation, examine some examples and discuss pros and cons of using Stata for producing such displays.

Download presentation

f_able: Estimation of marginal effects for models with alternative variable transformations

Fernando Rios-Avila
Levy Economics Institute, Bard College, Annandale-On-Hudson, NY

Margins is a powerful post-estimation command that allows the estimation of marginal effects for official and community-contributed commands, with well-defined predicted outcomes (see predict). While the use of factor variable notation allows us to easily estimate marginal effects when interactions and polynomials are used, estimation of marginal effects when other types of transformations such as splines, logs, or fractional polynomials, among others, are used remains a challenge. This paper describes how margins capabilities can be extended to analyze other variable transformations using the command f_able.

Download presentation

Socioeconomic Factors influencing the Spatial Spread of COVID-19 in the United States

Kit Baum & Miguel Henry
Boston College, DIW Berlin & CESIS & Greylock McKinnon Associates

As the COVID-19 pandemic has progressed in the U.S., “hotspots” have been shifting geographically over time to suburban and rural counties showing a high prevalence of the disease.

We analyze daily U.S. county-level variations in COVID-19 confirmed case counts to evaluate the spatial dependence between neighboring counties.

We find strong evidence of county-level socioeconomic factors influencing the spatial spread. We show the potential of combining spatial econometric techniques and socioeconomic factors in assessing the spatial effects of COVID-19 among neighboring counties.

Download presentation

Correlated random effects methods for panel data models with heterogeneous time effects

Jeff Wooldridge
Michigan State University, East Lansing, MI

I propose a correlated random effects (CRE) approach to linear panel data models with heterogeneous time effects. The setting is microeconometric, where the number of time periods is small relative to the number of cross-sectional units. Given T time periods, T different sources of heterogeneity are allowed, and each is allowed to be correlated with time-constant features of the covariates. In the leading case, the CRE approach extends the Mundlak regression by allowing each heterogeneity term to be correlated with the time averages of the time-varying covariates. Additional flexibility is allowed by extracting unit-specific trends from the covariates and using those in the CRE approach. Estimation requires (many) linear regressions. For small T, the approach is an alternative to factor models, which require nonlinear estimation in addition to pre-testing to determine the number of factors. I show straightforward implementation of the new estimators in Stata.

Download presentation

Post your comment

Timberlake Consultants