# xtheckman

## Highlights

• Random-effects panel-data modeling with endogenous selection
• Two-level multilevel models with endogenous selection
• Inference statistics
• Expected means and probabilities
• Marginal effects and contrasts
• Average structural functions (ASFs)
• More ...
• Conditional analysis—specify values of all covariates
• Test whether selection matters
• Population-averaged—values of specified covariates
• Inferences and plots over groups

Heckman selection models adjust for bias when some outcomes are missing not at random. Imagine modeling income. The problem is that income is observed only for those who work. Missingness is not random.

Stata fits Heckman selection models and, new in Stata 16, Stata can fit them with panel (two-level) data.

You want to fit the model

${y}_{it}={x}_{it}\beta +{\alpha }_{i}+{\epsilon }_{it}$$y_{it} = x_{it}\beta + \alpha_{i} + \varepsilon_{it}$

where ${y}_{it}$$y_{it}$ is sometimes missing. The equation that determines which ${y}_{it}$$y_{it}$ are not missing is

${S}_{it}=1\left({z}_{it}\gamma +{v}_{i}+{u}_{it}>0\right)$$S_{it} = 1(z_{it}\gamma + v_{i} + u_{it} > 0)$

In these equations, ${\alpha }_{i}$$\alpha_{i}$, ${\epsilon }_{it}$$\varepsilon_{it}$, ${v}_{i}$$v_{i}$, and ${u}_{it}$$u_{it}$ will not be estimated. Their correlations with each other, however, will be estimated along with $\beta$$\beta$ and $\gamma$$\gamma$.

The above model can be fit even though income is not observed for everyone and even if their employment status changes over time.

Why fit a selection model? Because it is possible that people who work and whose income is therefore observed systematically differ from those who do not, and those differences are for unobserved reasons.

For instance, if more productive people work, their income will be higher than those who do not work. Or, if income of the less productive is lower, they might need to work more. Allowing for selection allows for either of the above alternatives and other alternatives too. After estimation, we can test whether selection matters.

## Let's see it work

We have fictional data on 8,000 individuals from 2011 to 2018. Among the variables are income, which is observed only for those who work. We worry that unobservables might lead to biased results.

To fit the selection model, we must model income and the probability of working. We model probability of working as a function of experience, age, region of the county, and whether the person has college or technical college training.

We fit the model

. xtheckman income c.age##c.age i.training#(c.exp##c.exp),
select(working = age exp i.region i.training)


If you are new to Stata, things like c.age##c.age mean to include age and age squared in the model. The "c." means continuous. The "i." in i.training and i.region means categorical variable and indicates the categories are to be included in the model.

The results are

Random-effects regression with selection        Number of obs     =      8,000
Selected    =      7,235
Nonselected =        765

Group variable: id                              Number of groups  =      1,000

Obs. per group:
min =          8
avg =        8.0
max =          8

Integration method: mvaghermite                 Integration pts.  =          7

Wald chi2(6)      =   13011.86
Log likelihood = -28748.805                     Prob > chi2       =     0.0000

Coef.   Std. Err.      z     P>|z|      [95% Conf. Interval]

wage
age     .0841345     .05193     1.62   0.105    -.0176465    .1859154

c.age#c.age    -.0006552   .0006167    -1.06   0.288    -.0018638    .0005534

training#
c.exp
0      .3000872   .0122928    24.41   0.000     .2759939    .3241806
1     -.0994611    .014134    -7.04   0.000    -.1271632   -.0717591

_cons     7.744222   1.223347     6.33   0.000     5.346506    10.14194

working
age     .0083258    .001507     5.52   0.000      .005372    .0112795
exp      .069833    .007423     9.41   0.000     .0552843    .0843818

region
2      .1683876   .0653623     2.58   0.010     .0402798    .2964953
3      .0286791   .0630488     0.45   0.649    -.0948944    .1522525
4      .0476718   .0639092     0.75   0.456    -.0775879    .1729315
5      .0054477   .0621657     0.09   0.930    -.1163948    .1272901

1.training     .8223611   .0596781    13.78   0.000     .7053942     .939328
_cons     .3367662   .0905615     3.72   0.000     .1592688    .5142635

var(e.wage)    81.07829   2.142513                      76.98594    85.38819
corr(e.wor~g,
e.wage)   -.5812249   .0615897    -9.44   0.000    -.6892935    -.447854

var(wage[id])    19.93373   1.405619                      17.36067    22.88815
var(
working[id])    .0989163   .0262816                      .0587635    .16650525

corr(
working[id],
wage[id])     .258161   .1038653     2.49   0.013      .045996    .4480403



The first panel in the results reports the income equation.

The second panel reports the working (selection) equation.

After that are reported three variances and two correlations. The correlations are of interest.



correlation                     estimate      SE

corr(e.working. e.income)        -0.58    0.06
corr(working[id], income[id])     0.26    0.10



The first correlation is the correlation of the residuals in the income and working (selection) equation, the correlation of ${\epsilon }_{it}$$\varepsilon_{it}$ and ${u}_{it}$$u_{it}$.

The second is the correlation of random effects and unobservables that do not change over time, or the correlation of ${\alpha }_{i}$$\alpha_{i}$ and ${v}_{i}$$v_{i}$.

Selection was an issue if either of these correlations are significant. Both are.