Explore the New Features in Stata 18

Stata 18 is now available!  Join us for a complimentary online session where we will review and demonstrate some key new features now available within Stata.  You will gain hands-on practical insight into

  • Heterogeneous DID,
  • Lasso for Cox model,
  • Robust Interference for Linear Models
  • IV Quantile Regression.

This is a great opportunity to enhance your skills and stay up-to-date with the latest developments in Stata - statistics software for data science.

Customizable tables in Stata with Chuck Huber

Learn how to implement customizable tables in Stata with Chuck Huber (director of statistical outreach StataCorp). Throughout this helpful guide, Chuck expands on the functionality of the table command.

The table command is flexible for creating tables of many types—tabulations, tables of summary statistics, tables of regression results, and more. table can calculate summary statistics to display in the table. table can also include results from other Stata commands.

This guide consists of seven parts, from introducing the command to using custom styles and labels.

Index

  1. Customizable tables in Stata, part 1: The new table command
  2. Customizable tables in Stata, part 2: The new collect command
  3. Customizable tables in Stata, part 3: The classic table 1
  4. Customizable tables in Stata, part 4: Table of statistical tests
  5. Customizable tables in Stata, part 5: Tables for one regression model
  6. Customizable tables in Stata, part 6: Tables for multiple regression models
  7. Customizable tables in Stata, part 7: Saving and using custom styles and labels

Part 1: The new table command

Today, I’m going to begin a series of blog posts about customizable tables in Stata. We expanded the functionality of the table command. We also developed an entirely new system that allows you to collect results from any Stata command, create custom table layouts and stylessave and use those layouts and styles, and export your tables to the most popular document formats. We even added a new manual to show you how to use this powerful and flexible system.

I want to show you a few examples before I show you how to create your own customizable tables. I’ll show you how to re-create these examples in future posts.

graph1

Table of statistical test results

Sometimes, we wish to report a formal hypothesis test for a group of variables. The table below reports the means for a group of continuous variables for participants without hypertension, with hypertension, the difference between the means, and the p-value for a t test.

graph1

Table for multiple regression models

We may also wish to create a table to compare the results of several regression models. The table below displays the odds ratios and standard errors for the covariates of three logistic regression models along with the AIC and BIC for each model.

graph1

Table for a single regression model

We may also wish to display the results of our final regression model. The table below displays the odds ratio, standard error, z score, p-value, and 95% confidence interval for each covariate in our final model.

graph1

You may prefer a different layout for your tables, and that is the point of this series of blog posts. My goal is to show you how to create your own customized tables and import them into your documents.

The data

Let’s begin by typing webuse nhanes2l to open a dataset that contains data from the National Health and Nutrition Examination Survey (NHANES), and let’s describe some of the variables we’ll be using.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe age sex race height weight bmi highbp        
>          bpsystol bpdiast tcresult tgresult hdresult

Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------------------------------
age             byte    %9.0g                 Age (years)
sex             byte    %9.0g      sex        Sex
race            byte    %9.0g      race       Race
height          float   %9.0g                 Height (cm)
weight          float   %9.0g                 Weight (kg)
bmi             float   %9.0g                 Body mass index (BMI)
highbp          byte    %8.0g               * High blood pressure
bpsystol        int     %9.0g                 Systolic blood pressure
bpdiast         int     %9.0g                 Diastolic blood pressure
tcresult        int     %9.0g                 Serum cholesterol (mg/dL)
tgresult        int     %9.0g                 Serum triglycerides (mg/dL)
hdresult        int     %9.0g                 High density lipids (mg/dL)

This dataset contains demographic, anthropometric, and biological measures for participants in the United States. We will ignore the survey weights for now so that we can focus on the syntax for creating tables.

Introduction to the table command

The basic syntax of table is table (RowVars) (ColVars). The example below creates a table for the row variable highbp.

. table (highbp) ()

--------------------------------
                    |  Frequency
--------------------+-----------
High blood pressure |
  0                 |      5,975
  1                 |      4,376
  Total             |     10,351
--------------------------------

By default, the table displays the frequency for each category of highbp and the total frequency. The second set of empty parentheses in this example is not necessary because there is no column variable.

The example below creates a table for the column variable highbp. The first set of empty parentheses is necessary in this example so that table knows that highbp is a column variable.

. table () (highbp)

------------------------------------
          |    High blood pressure
          |       0       1    Total
----------+-------------------------
Frequency |   5,975   4,376   10,351
------------------------------------

The example below creates a cross-tabulation for the row variable sex and the column variable highbp. The row and column totals are included by default.

. table (sex) (highbp)

-----------------------------------
         |    High blood pressure
         |       0       1    Total
---------+-------------------------
Sex      |
  Male   |   2,611   2,304    4,915
  Female |   3,364   2,072    5,436
  Total  |   5,975   4,376   10,351
-----------------------------------

We can remove the row and column totals by including the nototals option.

. table (sex) (highbp), nototals

---------------------------------
         |   High blood pressure
         |          0           1
---------+-----------------------
Sex      |
  Male   |      2,611       2,304
  Female |      3,364       2,072
---------------------------------

We can also specify multiple row or column variables, or both. The example below displays frequencies for categories of sex nested within categories of highbp.

. table (highbp sex) (), nototals

--------------------------------
                    |  Frequency
--------------------+-----------
High blood pressure |
  0                 |
    Sex             |
      Male          |      2,611
      Female        |      3,364
  1                 |
    Sex             |
      Male          |      2,304
      Female        |      2,072
--------------------------------

Or we can display frequencies for categories of highbp nested within categories of sex as in the example below. The order of the variables in the parentheses determines the nesting structure in the table.

. table (sex highbp) (), nototals

------------------------------------
                        |  Frequency
------------------------+-----------
Sex                     |
  Male                  |
    High blood pressure |
      0                 |      2,611
      1                 |      2,304
  Female                |
    High blood pressure |
      0                 |      3,364
      1                 |      2,072
------------------------------------

We can specify similar nesting structures for multiple column variables. The example below displays frequencies for categories of sex nested within categories of highbp.

. table () (highbp sex), nototals

--------------------------------------------
          |        High blood pressure
          |         0                1
          |        Sex              Sex
          |   Male   Female    Male   Female
----------+---------------------------------
Frequency |  2,611    3,364   2,304    2,072
--------------------------------------------

Or we can display frequencies for categories of highbp nested within categories of sex as in the example below. Again, the order of the variables in the parentheses determines the nesting structure in the table.

. table () (sex highbp), nototals

----------------------------------------------------------
          |                       Sex
          |           Male                   Female
          |   High blood pressure     High blood pressure
          |          0           1           0           1
----------+-----------------------------------------------
Frequency |      2,611       2,304       3,364       2,072
----------------------------------------------------------

You can even specify three, or more, row or column variables. The example below displays frequencies for categories of diabetes nested within categories of sex nested within categories of highbp.

. table (highbp sex diabetes) (), nototals

------------------------------------
                        |  Frequency
------------------------+-----------
High blood pressure     |
  0                     |
    Sex                 |
      Male              |
        Diabetes status |
          Not diabetic  |      2,533
          Diabetic      |         78
      Female            |
        Diabetes status |
          Not diabetic  |      3,262
          Diabetic      |        100
  1                     |
    Sex                 |
      Male              |
        Diabetes status |
          Not diabetic  |      2,165
          Diabetic      |        139
      Female            |
        Diabetes status |
          Not diabetic  |      1,890
          Diabetic      |        182
------------------------------------

The totals() option

We can include totals for a particular row or column variable by including the variable name in the totals() option. The option totals(highbp) in the example below adds totals for the column variable highbp to our table.

. table (sex) (highbp), totals(highbp)

---------------------------------
         |   High blood pressure
         |          0           1
---------+-----------------------
Sex      |
  Male   |      2,611       2,304
  Female |      3,364       2,072
  Total  |      5,975       4,376
---------------------------------

The option totals(sex) in the example below adds totals for the row variable sex to our table.

. table (sex) (highbp), totals(sex)

-----------------------------------
         |    High blood pressure
         |       0        1   Total
---------+-------------------------
Sex      |
  Male   |   2,611    2,304   4,915
  Female |   3,364    2,072   5,436
-----------------------------------

We can also specify row or column variables for a particular variable even when there are multiple row or column variables. The example below displays totals for the row variable highbp, even though there are two row variables in the table.

. table (sex highbp) (), totals(highbp)

------------------------------------
                        |  Frequency
------------------------+-----------
Sex                     |
  Male                  |
    High blood pressure |
      0                 |      2,611
      1                 |      2,304
  Female                |
    High blood pressure |
      0                 |      3,364
      1                 |      2,072
  Total                 |
    High blood pressure |
      0                 |      5,975
      1                 |      4,376
------------------------------------

The statistic() options

Frequencies are displayed by default, but you can specify other statistics with the statistic() option. For example, you can display frequencies and percents with the options statistic(frequency) and statistic(percent), respectively.

. table (sex) (highbp),       
>       statistic(frequency)  
>       statistic(percent)    
>       nototals

--------------------------------------
              |   High blood pressure
              |          0           1
--------------+-----------------------
Sex           |
  Male        |
    Frequency |      2,611       2,304
    Percent   |      25.22       22.26
  Female      |
    Frequency |      3,364       2,072
    Percent   |      32.50       20.02
--------------------------------------

We can also include the mean and standard deviation of age with the options statistic(mean age) and statistic(sd age), respectively.

. // FORMAT THE NUMBERS IN THE OUTPUT
. table (sex) (highbp),           
>       statistic(frequency)      
>       statistic(percent)        
>       statistic(mean age)       
>       statistic(sd age)         
>       nototals

-----------------------------------------------
                       |   High blood pressure
                       |          0           1
-----------------------+-----------------------
Sex                    |
  Male                 |
    Frequency          |      2,611       2,304
    Percent            |      25.22       22.26
    Mean               |    42.8625    52.59288
    Standard deviation |    16.9688    15.88326
  Female               |
    Frequency          |      3,364       2,072
    Percent            |      32.50       20.02
    Mean               |   41.62366    57.61921
    Standard deviation |   16.59921    13.25577
-----------------------------------------------

You can view a complete list of statistics for the statistic() option in the Stata manual.

The nformat() and sformat() options

We can use the nformat() option to specify the numerical display format for statistics in our table. In the example below, the option nformat(%9.0fc frequency) displays frequency with commas in the thousands place and no digits to the right of the decimal. The option nformat(%6.2f mean sd) displays the mean and standard deviation with two digits to the right of the decimal.

. table (sex) (highbp),           
>       statistic(frequency)      
>       statistic(percent)        
>       statistic(mean age)       
>       statistic(sd age)         
>       nototals                  
>       nformat(%9.0fc frequency) 
>       nformat(%6.2f  mean sd)

-----------------------------------------------
                       |   High blood pressure
                       |          0           1
-----------------------+-----------------------
Sex                    |
  Male                 |
    Frequency          |      2,611       2,304
    Percent            |      25.22       22.26
    Mean               |      42.86       52.59
    Standard deviation |      16.97       15.88
  Female               |
    Frequency          |      3,364       2,072
    Percent            |      32.50       20.02
    Mean               |      41.62       57.62
    Standard deviation |      16.60       13.26
-----------------------------------------------

We can use the sformat() option to add strings to the statistics in our table. In the example below, the option sformat(“%s%%” percent) adds “%” to the statistic percent, and the option sformat(“(%s)” sd) places parentheses around the standard deviation.

. table (sex) (highbp),           
>       statistic(frequency)      
>       statistic(percent)        
>       statistic(mean age)       
>       statistic(sd age)         
>       nototals                  
>       nformat(%9.0fc frequency) 
>       nformat(%6.2f  mean sd)   
>       sformat("%s%%" percent)   
>       sformat("(%s)" sd)

-----------------------------------------------
                       |   High blood pressure
                       |          0           1
-----------------------+-----------------------
Sex                    |
  Male                 |
    Frequency          |      2,611       2,304
    Percent            |     25.22%      22.26%
    Mean               |      42.86       52.59
    Standard deviation |    (16.97)     (15.88)
  Female               |
    Frequency          |      3,364       2,072
    Percent            |     32.50%      20.02%
    Mean               |      41.62       57.62
    Standard deviation |    (16.60)     (13.26)
-----------------------------------------------

The style() option

We can use the style() option to apply a predefined style to a table. In the example below, the option style(table-1) applies Stata’s predefined style table-1 to our table. This style changed the appearance of the row labels. You can view a complete list of Stata’s predefined styles in the manual, and I will show you how to create your own styles in a future blog post.

. table (sex) (highbp),           
>       statistic(frequency)      
>       statistic(percent)        
>       statistic(mean age)       
>       statistic(sd age)         
>       nototals                  
>       nformat(%9.0fc frequency) 
>       nformat(%6.2f  mean sd)   
>       sformat("%s%%" percent)   
>       sformat("(%s)" sd)        
>       style(table-1)

---------------------------------
         |   High blood pressure
         |          0           1
---------+-----------------------
     Sex |
  Male   |      2,611       2,304
         |     25.22%      22.26%
         |      42.86       52.59
         |    (16.97)     (15.88)
         |
Female   |      3,364       2,072
         |     32.50%      20.02%
         |      41.62       57.62
         |    (16.60)     (13.26)
---------------------------------

Conclusion

We learned a lot about the new-and-improved table command, but we have barely scratched the surface. We have learned how to create tables and use the nototalstotals()statistic()nformat()sformat(), and style() options. There are many other options, and you can read about them in the manual. I’ll show you how to use collect to customize the appearance of your tables in my next post.

How to use the Stata Table Command.

Picture of George Naufal
George Naufal

This blog post is written by Dr George Naufal , assistant research scientist at the Public Policy Research Institute (PPRI) at Texas A&M University and a research fellow at the IZA Institute of Labor Economics.


Tables are in every report, article, and book chapter. Stata offers a way to create and customize tables using the commands table and collect.

We use the NLS data which is based on the National Longitudinal Survey of Young Women 14-26 years of age in 1968. This data is one of the many data sets used for training purposes. We first edit the labels of the variable collgrad


label variable collgrad "College Graduate"

label define collgrad_label 0 "No" 1 "Yes"

label values collgrad collgrad_label


Then we use the command table to create a simple table of frequencies of those with a college graduate degree.


table ( ) (collgrad) (), statistic(frequency)

Where the three brackets before the comma could be empty, have variable names, or include a specific keyword.

If you would like to show the percent instead of frequencies, the code is:


table ( ) (collgrad) (), statistic(percent)

Say now, you would like the percent statistic to be a column rather than a row, the code becomes: 


table (collgrad) (), statistic(percent)


Now, we would like to add a lot more details to the table. We would like to show the frequency and percent of college graduates and show the mean and standard deviation of age and experience by college graduate or not. The code becomes (including formatting options to show the mean and standard deviation in one decimal):


table ( var ) ( collgrad ) (), statistic(frequency) statistic(percent) statistic(mean  age exper) statistic(sd  age exper) nformat(%9.1f percent) sformat("%s%%" percent) nformat(%9.1f  mean) nformat(%9.1f  sd)


We can edit the labels of age and exper to make the table cleaner and run the code again.


label variable age "Age (years)"

label variable exper "Total Work Experience (year)"


Say you would like to show the mean and standard deviation in separate sections (instead of the variables), remove the var from the first bracket:


table () ( collgrad ) (), statistic(frequency) statistic(percent) statistic(mean  age exper) statistic(sd  age exper) nformat(%9.1f percent) sformat("%s%%" percent) nformat(%9.1f  mean) nformat(%9.1f  sd)


You can export the table to any format (Word file, PDF, HTML, etc.)


collect export "…\table1.docx", as(docx) replace


One neat thing is that you can also do the table above using putdocx; adding the below to the export code gives you the full code to create the same table using putdocx


collect export "…\table1.docx", as(docx) dofile("…\commands.do", replace) replace


Finally, you can do everything above through a point and click (including changing the borders, font, etc.) by following:

Menu

Statistics > Summaries, tables, and tests > Tables of frequencies, summaries, and command results

College Graduate

  No Yes Total
Frequency             10,996              2,552             13,548
Percent 81.2% 18.8% 100.0%

Age (years) Mean
     29.8      32.0      30.2
  Standard deviation       6.5       5.8       6.4
 
Total work experience Mean
      6.5       8.0       6.8
  Standard deviation       4.3       4.5       4.4
 
current grade completed Mean
     11.8      16.5      12.7
  Standard deviation       1.6       1.0       2.4

For an example of how to use the table editor, check this video

You can download all of the Stata used throughout this blog via this link: Stata Table Command.

Statistics is the New Machine Learning

Machine Learning (ML) is a rapidly developing field. There is enormous competition to be the inventor of a new method that gets widely adopted, and investors stand to make a fortune from a startup that implements a successful new analytical tool. I find it fascinating to watch the emerging hot topics. Recently, I noticed a trend. Hotshot VCs, take note! (But don't get too excited yet.)

Robert Grant, Statistician, Human, Audience
Robert Grant

First, let's look at some of the latest must-have techniques.

Time series

Me: Hello, old friend. What are you doing here?

Time series: Well, it turned out that not everything in life is a flat file of independent observations, so the ML people called me in.

Me: I thought you'd retired.

Time series: Ha ha ha! No.

H2O.ai, arguably the market leader in automated machine learning (autoML), have this to say right on the front page of their website:

Award-winning Automatic Machine Learning (AutoML) technology to solve the most challenging problems, including Time Series and Natural Language Processing.

Clearly, they think time series is one of the big selling points, or they wouldn't have put it there*. It even bumped the mighty Deep Learning off the Top Two. And perhaps the reason is that those highly flexible and non-linear modelling tools, like XGBoost, take no account of dependency in the data. Looking online for (for example) "XGBoost for market tick data", you'll find plenty of people who still believe you can push any data into a sufficiently flexible algorithm and get a useful model out. Elsewhere, you can find confusion between analysis of streaming data, where sliding windows of time define batches of data, and true time series models. I would argue that, to model data that are grounded in time and/or space, you should also leverage domain knowledge: that's a phrase we'll be hearing again soon.

Vendors working on the data storage and provisioning side have also been quick to offer a range of time-series-aware database products that scale to Big Data. Just search for "scalable time series database" and you'll find plenty. Five years ago it was a concept that you could shoehorn into a more generic type of database, but not a product in itself that could attract investment and get ML pulses racing.

It's hard not to chuckle as a statistician, seeing the ML folk discover autocorrelation 100 years on, but more constructively, this can help us predict where the next ML boom might come. Dependency does not just arise from time-located data, but also spatial data, networks (social or otherwise), heterogeneous data sources, and hierarchical structures. There is a lot of investment in geographical information systems and networks in the machine learning, AI and Big Data world, so it would be reasonable to see this appearing in the near future.

* - that's assuming they didn't tailor the front page for maximum appeal, having already categorised me under Statistics (I jest)

Uncertainty

For a while, ML algorithms became enormously popular on the strength of their predictions (point estimates). Then, unease set in about the lack of uncertainty in the outputs. Suppose you have to produce a categorical prediction. The algorithm returns the category with the best objective function, and that's the only information you have. Some decisions truly are binary (buy gold, sell zinc), while others are fuzzier, but even the binary ones can often be hedged.

In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie strike a conciliatory tone between statistics and ML, suggesting that most algorithms for data analysis start out with point estimates, and later, uncertainty is added. The EM algorithm is an example.

It was perhaps the rise of self-driving cars that made the absence of uncertainty too uncomfortable. Decisions do not always have to be taken by AI; they can be referred to as the human driver (let's hope they haven't fallen asleep), but that requires a measure of uncertainty. This problem remains imperfectly solved, despite massive investment, and is likely to provide interesting job opportunities for some time to come.

Now, we can see efforts to add measures of uncertainty into ML parameter estimates and predictions. Often, those look a lot like 95% confidence or credible intervals, and there is a big advantage to adding in domain knowledge. Leading software TensorFlow Probability has this to say:

"The TensorFlow team built TFP for data scientists, statisticians, and ML researchers and practitioners who want to encode domain knowledge to understand data and make predictions."

Without statistical (i.e. mathematical models of the uncertainty, based on probability) approaches, adding uncertainty to ML requires re-fitting laborious models many times or adopting some uncomfortable approximation. Any alternative technique that provides a quantified uncertainty output alongside a point estimate, quickly and with well-established long-run properties, is going to be popular with bosses everywhere. Sounds like statistics?

Gaussian processes

These are a class of supervised learning models that use probability but are highly flexible. Taking a starting point from signal processing ideas of the mid-20th century, they envisage the variation in the data to arise from a random process. Between observations, the process moves according to some probability distribution, and Gaussian processes use the normal, or Gaussian, distribution for this, just like Brownian motion. There are also Dirichlet processes for categorical variables.

This necessitates an interesting inversion in conceiving of the estimation and inference problem. Instead of a small number of parameters deterministically leading to a large number of predictions, the data and predictions together are recast as a vector (or higher-dimensional tensor) of values that are subject to the parameters of the underlying process.

The flexibility of Gaussian processes has seen them widely applied to time series and temporo-spatial problems in recent years. For autocorrelation over more than one variable, they are faster and far more robust than the conditionally autoregressive (CAR) models of yesteryear. However, this flexibility comes at a cost, and requires considerable expertise, which remains in extreme short supply. Gaussian processes are mentioned in many data science and ML courses, but usually only at a high level. Their successful implementation requires knowledge of how to choose from the esoteric multi-dimensional priors that they require, and how to tailor these to new settings. Statistical skills, in short!

Bayesian tuning parameter optimisation

There is no free lunch, in statistics as in ML. If you want to input x1, x2, x3, etc and get a prediction of y, you will need to constrain the space of possible models so that there are a number of parameters to estimate. Non-parametric models are sometimes framed as infinite-dimensional, but the essence is the same: choose your method and let the computer find the unknown parameters that can be combined to give an output. This applies to supervised and unsupervised learning equally, because we haven't said anything about the output y being in the dataset along with the x's.

Arising from computer science, many ML methods do not explicitly state a probabilistic model with parameters, but instead try to efficiently and flexibly find a way of converting x's into y. To get that right, there are tuning parameters that need to be set. (Often in ML, they are called hyperparameters, but this has a different meaning in Bayesian statistics, so I will try to avoid ambiguity.) This is generally done by human-in-the-loop experimentation, and more complex (hence flexible) procedures like neural networks can involve many such tuning parameters, which interact with one another. This gives rise to a notion of tuning the algorithm as art rather than science, but recently, there has been much interest in making the tuning part of the process more robust. You can see the problem, for the boss: the entire ML procedure that they have staked their reputation on depends on one geek who knows which buttons to push. That sort of situation keeps bosses awake at night.

Bayesian methods are the buzzword in tuning parameter optimisation. This holds out the promise of even including uncertainty about the tuning parameters in the final inference, although that is not what most ML people mean by Bayesian optimisation. Bayesian sampling algorithms are all about efficiently exploring parameter space to return the distribution of parameter values that fit the data. You can also apply this to explore tuning parameter space, to return the tuning parameters that lead to the best fit. The main difference is that there are only a few "observations" of tuning parameter values that have been tested; it is not a Big Data setting. In fact, it is generally our new friend Gaussian processes that are used as the tuning parameter model.

The ability to automate another part of the ML pipeline means ML engineer jobs being lost, but as always, there is a need for new skills, and those happen to be Bayesian sampling algorithms, probability, uncertainty and modelling small numbers of observations. If only there were a name for this new discipline...

Bayesian updating and windowing

Big Data might have passed the peak on the hype curve, but it remains alluring to many organisations and essential in some settings. Streaming data is a special case, where new data arrive too fast to be included by traditional means. The Covid-19 pandemic has thrown this into focus for many analysts. Epidemiological data, supply chains, and humanitarian aid have all had to be analysed urgently in a rapidly changing problem as data emerge.

The ability to update models rapidly as new data arrive will remain important for some time. That means analysing a new batch of data without having to re-run calculations on old data. In the streaming data paradigm, we might add the new data to a sliding window of time, and remove the oldest data from the window.

Here, Bayesian methods can help, by breaking down the posterior density of the parameters, for which we want estimates and uncertainty, into the product of the prior distribution and a sequence of likelihood functions for each batch.

Although this is alluded to in almost every introductory Bayesian class, it is rarely done because of the difficulty of defining a joint distribution for all unknowns together, and defining its hyperparameters from the posterior sample of the previous batch. Also, the success of the method is contingent on choosing a "right" form of joint distribution, and if you change your mind about it, you must return to the beginning again.

So, in practice, non-parametric updating methods are needed. These do not define a mathematical formula for the prior, but instead use the previous batch's posterior sample to estimate densities and gradients at any new values. This is a topic I have been actively involved in, and which shows promise for Big Data and streaming data settings. However, care has to be taken over the tuning parameters (those little devils again!) and data drift, requiring intensive expert input. It will continue to be a hot topic too, as the volume and velocity of data are only set to grow further.

Explainable AI

Parents everywhere warn their kids that some rowdy game is "fun until someone loses an eye". I have probably said it myself but was too tired at the time to remember. Fitting some complex predictive model to your data, launching it to the public and calling it "AI" is like that too. You're heading for promotion until it gets your employers bad PR because it is inadvertently biased against some group of people. There are many stories of this, and they are often painful to read.

The boss wants an assurance that your AI is "transparent" or "explainable". In other words, they want some best-practice procedure in place that they can cite should the worst happen. And as the boss is the one signing off the software purchase, you can be sure that ML vendors are climbing over each other to add explainability.

There are many ways to do this, which often involve either fitting simpler, interpretable models to the predictions in the locality of the point of interest, or comparing partly matched observations from the dataset. As an output, you can obtain measures of the importance of difference "features" or input variables, and an idea of what predictions would be obtained by perturbing the data. Simpler models, matching, variable importance measures, goodness-of-fit stats, smaller data... I suspect you can see where this is going.

Conclusion

Statistics and ML have often been presented as disparate skill sets, with adherents at loggerheads over which is best. This sort of story-telling gets clicks, but is a gross misrepresentation. Statistics and machine learning are two names for the same thing; they represent two paths by which people have arrived at data analysis: from mathematics or from computer science.

There are some cultural differences, just as Leo Breiman observed in the 2001 paper "Statistical Modeling: the two cultures", but increasing cross-fertilisation of ideas and influences as time goes by. As there is a premium for both groups of people in finding new, effective methods for new problems, we might reasonably expect further convergence.

Once, it was a time of Venn diagrams. Stats and computer science were represented as essential components of data science (whatever that is), along with domain knowledge. There were many variants of these diagrams, increasingly complex and eagerly shared or stolen by bloggers everywhere. A big mistake was introduced when the intersection of the three components was changed from data science to data scientist. These diagrams underwent their own hype curve, inspiring and then disappointing bosses everywhere when they failed to recruit the Unicorn at the centre. But maybe we should think again about the intersection of these three influences. ML has been through a boom time, so to make the next set of advances, wouldn't we expect to see stats and domain knowledge catching up in innovation and in the job market?

This article was written by medical statistician and trainer Robert Grant.

Visit Robert Grant's website.

DISSERTATION SOS: Refresh the Basics of Data Science Software and Make Your Research Shine

Top tips from an academic and expert in the field, Dr. Malvina Marchese.

Which is the secret to a great dissertation?

Writing an excellent dissertation requires a mixture of different skills. As an academic, I have supervised many MSc dissertations and over the years I have found that these are the ultimate 5 things you need to deliver a first class dissertation:

  1. Pin Down the Research Question.Be very clear about what you are investigating so the reader knows what your final goal is from the outset, and can concentrate on understanding your arguments.
  2. Write a Captivating Introduction.You want to strongly demonstrate the importance of your analysis and highlight how your results contribute to the current literature. So ask yourself: do I manage to really show that my hypothesis is supported by the consequential data?
  3. Make Your Data Speak.This is a top skill to show. Master an econometric software and make sure that you present informative descriptive statistics and tables, hypothesis tests and graphs of you data. Let the reader see at a glance the core of your information!  Stata and EViews are great for this with many options to run preliminary tests and to prepare great tables of  your data, as well as many built in data sets to construct your data base.
  4. Master the EconometricsLinear regression? ARIMA ? Panel model with fixed effect or with random effect ? Regime switching models? MIDAS models? Choose the most appropriate model for your research question. Stata and EViews offer easy estimation and very informative output for all the models above and many more. Want help identifying the best model for you? Join our SOS Masters Dissertation help class this Summer in Stata or EViews, where we will learn how to identify the best econometric model to support your findings and how to convince your reader that the model is robust, with a variety of easy to obtain post estimation diagnostic tests.
  5. Make Sure to Discuss the Interpretation of Your Findings.You've got the best model with Stata or EViews, now make sure to discuss the interpretation of your findings and how they support your hypothesis. Which tests should you comment on? Is the R squared really so important? Do you have more convincing information on your software output that you can use to describe? We will learn how to interpret parameters and findings form a wide range of models, to support your conclusions.

Join Dr Marchese for our upcoming SOS Master dissertation masterclasses, with a focus on Stata, and learn how to score the first you're capable of, all whilst enjoying the write up and econometrics!

Full Time Masters and PhD students from academic institutions around the world are eligible to apply for a subsidised place. Find out how to apply for a free spot here. 

Open Data And Policy Evidence Are Here To Stay

Over the first year of Covid-19 in the UK, the demand for data analysis boomed as never before. It seems that more was shared by government departments, agencies and other authoritative sources than ever before. Journalists took this up and news outlets competed to deliver the most up-to-date, meaningful and understandable data-driven content. And critically, the public engaged with data as never before. This transformation of public understanding and appetite for data is here to stay, which means higher standards for transparency around evidence and decision making.

In this article, Robert Grant considers what this will mean for public sector and NGO analysts and decision-makers.

Government

Open data and accountable decision-making are not new, but the public and organisational appetite for data is. This data (or statistics) helps us understand whether policy is justified, and plan for our own circumstances. That places demands on public sector and charitable data sources as never before. Politically, it is very unlikely that the cat will go back into the bag. It no longer seems trustworthy to ask those who are not privy to the data -- the public, companies and smaller public sector organisations -- to follow policy because some unspecified analysis happened on some unspecified data. Only three years ago, it was perfectly feasible for Brexit planning to be based on an expert analysis which the government refused to publish.

If this journey toward open data flows and understanding statistics within the timeframe of covid is a microcosm of a longer term trend, then what is new? Two rapid changes have appeared: a further degree of openness in national official statistics supporting policy decisions, and a critical consumption by first journalists, then the public.

Open data has been an imperative of UK government since the Cabinet Office's white paper on the subject in 2012. Availability has steadily gone up and latency has come down. There is even a requirement to move from static publication formats to interactive, queryable APIs. This has given an opportunity to local organisations with the technical skills in house to create their own tools that draw on national government data. The Open Data Institute has been actively promoting this efficient reuse of data and hosts a variety of case studies on their website.

Before Covid, government data, however much it may have technically been "open", was not something that journalists invested effort into checking and investigating, let alone members of the public.

Over the course of 2020, we saw a gradual expansion of the detail supporting policy decisions, from national to regional to local statistics, and from point predictions to predictions with uncertainty to alternative parallel models. Interestingly, this was not enough for the a parliamentary committee, which issued a public complaint about the government's lack of data sharing and information supporting policy. 

Some topics, like positive test results, were also broken down by age groups to illustrate particular trends. This may have been an exciting new level of transparency, but it quickly became the new normal; when the same granularity was not provided for vaccination statistics, it attracted official complaint from the UK Statistics Authority. It seems that at least this official regulator of government statistics will not be content to return to pre-Covid levels of openness.

The public

However useful the local statistics were, it soon became apparent that infections spread along edges in social networks, which are largely not known to government, and spill over (also in the economic sense) geographical boundaries. The obvious next questions for critical consumers of the stats is "what is MY risk?" There is usually some way in which it is obvious that each of us will differ from the averages for our nation or even neighbourhood, but it is not at all obvious just how much the numbers will change.

The public have grappled with several new concepts such as incidence, prevalence, exponential growth, and competing statistics for diagnostic test accuracy. These are well-known pitfalls for students in medical statistics and epidemiology, and the fact that the public are stumbling across them means that the general level of understanding is rising fast.

By April 2020, there was even a TV comedy show joke about armchair epidemiology. When the public gained the insight to spot poor analyses, the era of the "data bros" was over.

Journalism

In those early days, when there was a lack of information, we also witnessed the phenomenon of "armchair epidemiologists": in the absence of authoritative forecasting early on, anyone with an Excel spreadsheet might produce a compelling-looking forecast, and be taken seriously by those who are searching frantically for any information.

Among the sins committed in this time were fitting normal probability density functions to cases, fitting polynomial curves to cases, and comparing countries on disease burden by counting cases (Monaco is doing really well!). It's easy to laugh at these errors in retrospect (and in possession of a degree in statistics), but each was adopted briefly by, let's just say, prominent organisations (there's nothing to be gained by naming and shaming after we have all learnt so much in a short time). In short, if accountable people who have the data don't communicate, someone else will. And if those in power don't have data either, they might just get pulled in by the allure of certainty.

David Spiegelhalter and Tim Harford, popular translators of complex analyses, were busier than ever explaining and critiquing the numbers. Often, a reframing of the same number brings a new insight. For example, from reporting the number of Covid deaths (hard to contextualise) to the % of deaths which were due to Covid.

"We are four weeks behind Italy" also came from the period (10 March 2020) when we had little information except for some confirmed cases by PCR test, which at the time was only being used on the most seriously ill people. But it had the advantage of narrative and referred to demonstrable, empirical events, and mobilised action in government and concern in the public.

Widespread dispute of, first, the Imperial epidemiological model (16 March 2020), which provided only a point prediction for any given timepoint, and later Patrick Vallance's "illustration" of exponential growth (21 Sep 2020), seem to show an intolerance of prognostication based only on theory without data, while predictions without uncertainty will almost inevitably be wrong. I think this is a new development in public discourse. It must be led by journalism and so, perhaps, comes out of a longer trend in data journalism, data visualisation, and numeracy in the media.

What's next?

The UK has an unusual system for data sharing from government: there is a policy of open data, and an independent statistics regulator. That makes it less likely that this recent trend will be reversed here, though it may be elsewhere. We might expect to see other parts of government, central and local, being held to the same standards, but it is not as simple a comparison as that.

Local government (including public health) have to work hard to build and maintain the infrastructure and systems needed for low-latency, high-quality data. Even where they succeed, local data is small data, which means more noise and more risk of identifying individuals.

Also, there are many aspects of policy-making that elude a simple number, notably, where competing interests have to be balanced. This is more the realm of economics than statistics, to develop utility models of the various harms, benefits and costs accruing to different parts of society in different ways. Even then, the politician is traditionally tasked with making a value judgement to synthesize the evidence.

Beyond the numbers, we have all been confronted with the fact that policy succeeds or fails purely on the extent of public understanding and support, or at least faith. Previously, faith was more the norm than critical querying of the statistics behind policy decisions, not least because the stats were hidden from view, or presented in confusing and, to be honest, boring formats.

Analysts and policy-makers need to be prepared to justify decisions more, whether that's in public health or elsewhere. You should expect your audience to be more critical, more quantitatively minded, and more curious than ever before. Covid-19 did that. But before you fear this new age of scrutiny, remember that they also appreciate your efforts more.

Robert Grant will be presenting the Introduction to Model Building Techniques with Stata, 23 June 2021. How do you know if your models are useful? Or maybe even wrong? This one-day course provides an introduction to the techniques used in model building.

This article is written by Robert Grant, a chartered statistician at BayesCamp.

Robert Grant

Robert's email: robert@bayescamp.com

How to install Python packages into Stata

Written by Chuck Huber (director of statistical outreach - StataCorp).

Using pip to install Python packages

Let’s begin by typing python query to verify that Python is installed on our system and that Stata is set up to use Python.

The results indicate that Stata is set up to use Python 3.8, so we are ready to install packages.

NumPy is a popular package that is described as “the fundamental package for scientific computing with Python”. Many other packages rely on NumPy‘s mathematical features, so let’s begin by installing it. It is possible that NumPy is already installed on my system, and I can check by typing python which numpy in Stata.

NumPy is not found on my system, so I am going to install it. I am using Windows 10, so I will type shell in Stata to open a Windows Command Prompt.

Figure 1: Windows Command Prompt

shell will also open a terminal in Mac or Linux operating systems. Note that experienced Stata users often type ! rather than the word shell.

Next, I will use a program named pip to install NumPy. You can type pip -V in the Windows Command Prompt or terminal in Mac or Linux to see the version and location of your pip program.

Figure 2: pip version and location

The path for pip is the same as the path returned by python query above. You should verify this if you have multiple versions of Python installed on your system.

Next, type pip install numpy in the Command Prompt or terminal, and pip will download and install NumPy in the appropriate location on your system.

Figure 3: pip install numpy

The output tells us that NumPy was installed successfully.

We can verify that NumPy was installed successfully by again typing python which numpy

Let’s install three more packages that we will use in the future. Pandas is a popular Python package used for importing, exporting, and manipulating data. We can install it by typing pip install pandas in the Command Prompt.

Figure 4: pip install pandas

You can watch a video that demonstrates how to use pip to install Pandas on the Stata YouTube channel.

Matplotlib is a popular package that “is a comprehensive library for creating static, animated, and interactive visualizations in Python”. We can install it by typing pip install matplotlib in the Command Prompt.

Figure 5: pip install matplotlib

Scikit-learn is a popular package for machine learning. We can install it by typing pip install sklearn in the Command Prompt.

Figure 6: pip install scikit-learn

Let’s use python which to verify that pandasmatplotlib, and scikit-learn are installed.

Conclusion

We did it! We successfully installed four of the most popular Python packages using pip. You can use your Internet search engine to find hundreds of other Python packages and install them with pip.

Why you should use Stata and Python – The Power of Data Scraping

Are you tired of spending countless hours navigating websites and taking extensive notes for your research? Say hello to data scraping - a game-changing technique that allows you to extract information from websites and transform it into a spreadsheet.  

Web data scraping is one of the oldest techniques for extracting content from the Web, and it`s valuable to a wide range of applications. The objective of this technique is to extract data from Web sources. It allows you to interact with a website and extract it's data through the combined efforts of humans and automations.  The retrieved data can be edited, formatted, structured and stored through the data management techniques. The importance of this technique has grown with the increase of data available on the Web: so-called Big Data.  It is essential that this technique be part of your skills portfolio, given the positive increasing trend of the information produced, shared, and consumed online. Acquiring these skills will allow you to efficiently collect this data with limited human effort. It will therefore enable you to very quickly obtain a large amount of data ready to be analysed. The application of data management and analysis techniques will enable you to understand complex social and economic phenomena.

From a practical point of view, this technique can help companies obtain and analyse a great deal of information on the activities of their competitors, enabling them to understand market challenges and opportunities. Nowadays, the increasing amount of available data makes this technique even more interesting. Moreover, it is easy to set up a data scraping pipeline, with a minimum of programming effort, and to meet a number of practical needs.

The integration of Stata and Python provides a powerful solution for effortless data scraping. This guide will introduce you to the process of data acquisition using various Python libraries, followed by analysis in Stata. By mastering code replication in Python and analysing the resulting datasets in Stata, you will be equipped to tackle the challenges commonly encountered in data scraping projects.

Register now on our short course in Data Scraping using Stata and Python with Dr Francesco Lopes on May 24 - 25

Register Here

New Stata Cheat Sheet: Call Python from Stata

StataCorp has recently developed a new resource for the Python integration feature that was expanded in Stata 17—a cheat sheet that demonstrates how to call Python from Stata. The cheat sheet includes everything from setup to executing Python code in Stata.

What is a Stata Cheat Sheet?

The Stata Cheat Sheets provide any user, whether new or old, a well-structured and helpful guide to some of the Stata basics. The cheat sheet covers topics from data analysis to plotting in Stata, and now calling Python from Stata!

The latest Stata Cheat Sheet demonstrates how to call Python from Stata. To learn more about Calling Python from Stata, type "help pystata module" in Stata's Command window.

Download the Python & Stata Cheat Sheet.

Use Python and Stata together:

Python integration was first introduced in Stata 16, where Python's extensive language features could be leveraged within Stata. Fast forward to Stata 17, where Stata can be invoked from a standalone Python environment via the pystata Python package. Learn more about using Python and Stata together.

Upgrade to Stata 17 today to experience the full power of Stata's newest features, including the Python and Stata functionalities.

2022 UK Stata Conference | Proceedings Available Now

Proceedings are now available for the 2022 UK Stata Conference, which took place in London, UK, on 8 & 9 September 2022.

What is the UK Stata Conference?

The UK Stata Conference is the longest-running Conference of its kind, with this year's event being the 28th edition. We are incredibly excited to welcome people back to our dedicated in-person event. This two-day international event provides Stata users from all over the world the opportunity to exchange ideas, experiences, and information on new applications of the software.

Experience what happens when new and long-time Stata users from across all disciplines gather to discuss real-world applications of Stata.

The Conference Proceedings are available now, where you can read the invited Stata presentations from Yulia MarchenkoJeff Pitblado and Asjad Naqvi on a diverse collection of topics, including:

  • Bayesian Multilevel Modeling
  • Custom Estimation Tables
  • Advanced-Data Visualizations with Stata: Part III
  • And lot's more.

View the Proceedings now.

What’s New in Stata – Mapping Developments Over the Years

How has Stata changed from edition to edition?

Timberlake have been the official distributors of Stata in the UK for over 30 years. Over this period, we have seen the powerful evolution of the software.

Stata word cloud

Stata 1.0 was officially released in January 1985.

"a small program that could not claim to cover all of even mainstream statistics, any more than its competitors did. It could more fairly be described as a regression package with data management features. - Nicholas J. Cox  A brief history of Stata. 

In April 2021, we saw the arrival of the latest edition, Stata 17, the most powerful package so far.

After 36 years of improvements to the user interface, statistical features, visualizations and much more, Stata has become a spearhead in the statistical world, providing statisticians globally with the means to enrich their data.

Stata 17

  • Tables
  • Bayesian econometrics: VAR, DSGE, IRF, dynamic forecasts, and panel-data models
  • Faster Stata
  • Difference-in-differences (DID) and DDD models
  • Interval-censored Cox model
  • PyStata—Python/Stata integration
  • Jupyter Notebook with Stata
  • Multivariate meta-analysis
  • Bayesian multilevel models: nonlinear, joint, SEM-like, and more
  • Treatment-effects lasso estimation

And many more »

Stata 16

  • Lasso
  • Truly reproducible reporting
  • Meta-analysis
  • Python integration
  • Bayesian analysis: multiple chains, Bayesian predictions, Gelman–Rubin convergence diagnostic
  • Choice models
  • Import from SAS and SPSS
  • Panel-data models for endogenous covariates, sample selection, and treatment
  • Nonparametric series regression
  • Multiple datasets in memory, Do-File Editor autocompletion and syntax highlighting, and Mac's Dark mode

And many more »

Upgrade to Stata 17 now!

Take a look at the features of Stata as they developed over the years, with StataCorp's overview of the highlights across all releases here.