Stata 18 is now available! Join us for a complimentary online session where we will review and demonstrate some key new features now available within Stata. You will gain hands-on practical insight into
Heterogeneous DID,
Lasso for Cox model,
Robust Interference for Linear Models
IV Quantile Regression.
This is a great opportunity to enhance your skills and stay up-to-date with the latest developments in Stata - statistics software for data science.
The Stata cheat sheets provide any user, whether new or old, a well structured and helpful guide to some of the Stata basics. Produced by data practitioners Dr Tim Essam and Dr Laura Hughes, the cheat sheet covers topics from data analysis to plotting in Stata.
Learn how to implement customizable tables in Stata with Chuck Huber (director of statistical outreach StataCorp). Throughout this helpful guide, Chuck expands on the functionality of the table command.
The table command is flexible for creating tables of many types—tabulations, tables of summary statistics, tables of regression results, and more. table can calculate summary statistics to display in the table. table can also include results from other Stata commands.
This guide consists of seven parts, from introducing the command to using custom styles and labels.
Today, I’m going to begin a series of blog posts about customizable tables in Stata. We expanded the functionality of the table command. We also developed an entirely new system that allows you to collect results from any Stata command, create custom table layouts and styles, save and use those layouts and styles, and export your tables to the most popular document formats. We even added a new manual to show you how to use this powerful and flexible system.
I want to show you a few examples before I show you how to create your own customizable tables. I’ll show you how to re-create these examples in future posts.
Table of statistical test results
Sometimes, we wish to report a formal hypothesis test for a group of variables. The table below reports the means for a group of continuous variables for participants without hypertension, with hypertension, the difference between the means, and the p-value for a t test.
Table for multiple regression models
We may also wish to create a table to compare the results of several regression models. The table below displays the odds ratios and standard errors for the covariates of three logistic regression models along with the AIC and BIC for each model.
Table for a single regression model
We may also wish to display the results of our final regression model. The table below displays the odds ratio, standard error, z score, p-value, and 95% confidence interval for each covariate in our final model.
You may prefer a different layout for your tables, and that is the point of this series of blog posts. My goal is to show you how to create your own customized tables and import them into your documents.
The data
Let’s begin by typing webuse nhanes2l to open a dataset that contains data from the National Health and Nutrition Examination Survey (NHANES), and let’s describe some of the variables we’ll be using.
. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)
. describe age sex race height weight bmi highbp
> bpsystol bpdiast tcresult tgresult hdresult
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------------
age byte %9.0g Age (years)
sex byte %9.0g sex Sex
race byte %9.0g race Race
height float %9.0g Height (cm)
weight float %9.0g Weight (kg)
bmi float %9.0g Body mass index (BMI)
highbp byte %8.0g * High blood pressure
bpsystol int %9.0g Systolic blood pressure
bpdiast int %9.0g Diastolic blood pressure
tcresult int %9.0g Serum cholesterol (mg/dL)
tgresult int %9.0g Serum triglycerides (mg/dL)
hdresult int %9.0g High density lipids (mg/dL)
This dataset contains demographic, anthropometric, and biological measures for participants in the United States. We will ignore the survey weights for now so that we can focus on the syntax for creating tables.
Introduction to the table command
The basic syntax of table is table (RowVars) (ColVars). The example below creates a table for the row variable highbp.
. table (highbp) ()
--------------------------------
| Frequency
--------------------+-----------
High blood pressure |
0 | 5,975
1 | 4,376
Total | 10,351
--------------------------------
By default, the table displays the frequency for each category of highbp and the total frequency. The second set of empty parentheses in this example is not necessary because there is no column variable.
The example below creates a table for the column variable highbp. The first set of empty parentheses is necessary in this example so that table knows that highbp is a column variable.
. table () (highbp)
------------------------------------
| High blood pressure
| 0 1 Total
----------+-------------------------
Frequency | 5,975 4,376 10,351
------------------------------------
The example below creates a cross-tabulation for the row variable sex and the column variable highbp. The row and column totals are included by default.
. table (sex) (highbp)
-----------------------------------
| High blood pressure
| 0 1 Total
---------+-------------------------
Sex |
Male | 2,611 2,304 4,915
Female | 3,364 2,072 5,436
Total | 5,975 4,376 10,351
-----------------------------------
We can remove the row and column totals by including the nototals option.
. table (sex) (highbp), nototals
---------------------------------
| High blood pressure
| 0 1
---------+-----------------------
Sex |
Male | 2,611 2,304
Female | 3,364 2,072
---------------------------------
We can also specify multiple row or column variables, or both. The example below displays frequencies for categories of sex nested within categories of highbp.
. table (highbp sex) (), nototals
--------------------------------
| Frequency
--------------------+-----------
High blood pressure |
0 |
Sex |
Male | 2,611
Female | 3,364
1 |
Sex |
Male | 2,304
Female | 2,072
--------------------------------
Or we can display frequencies for categories of highbp nested within categories of sex as in the example below. The order of the variables in the parentheses determines the nesting structure in the table.
. table (sex highbp) (), nototals
------------------------------------
| Frequency
------------------------+-----------
Sex |
Male |
High blood pressure |
0 | 2,611
1 | 2,304
Female |
High blood pressure |
0 | 3,364
1 | 2,072
------------------------------------
We can specify similar nesting structures for multiple column variables. The example below displays frequencies for categories of sex nested within categories of highbp.
. table () (highbp sex), nototals
--------------------------------------------
| High blood pressure
| 0 1
| Sex Sex
| Male Female Male Female
----------+---------------------------------
Frequency | 2,611 3,364 2,304 2,072
--------------------------------------------
Or we can display frequencies for categories of highbp nested within categories of sex as in the example below. Again, the order of the variables in the parentheses determines the nesting structure in the table.
. table () (sex highbp), nototals
----------------------------------------------------------
| Sex
| Male Female
| High blood pressure High blood pressure
| 0 1 0 1
----------+-----------------------------------------------
Frequency | 2,611 2,304 3,364 2,072
----------------------------------------------------------
You can even specify three, or more, row or column variables. The example below displays frequencies for categories of diabetes nested within categories of sex nested within categories of highbp.
. table (highbp sex diabetes) (), nototals
------------------------------------
| Frequency
------------------------+-----------
High blood pressure |
0 |
Sex |
Male |
Diabetes status |
Not diabetic | 2,533
Diabetic | 78
Female |
Diabetes status |
Not diabetic | 3,262
Diabetic | 100
1 |
Sex |
Male |
Diabetes status |
Not diabetic | 2,165
Diabetic | 139
Female |
Diabetes status |
Not diabetic | 1,890
Diabetic | 182
------------------------------------
The totals() option
We can include totals for a particular row or column variable by including the variable name in the totals() option. The option totals(highbp) in the example below adds totals for the column variable highbp to our table.
. table (sex) (highbp), totals(highbp)
---------------------------------
| High blood pressure
| 0 1
---------+-----------------------
Sex |
Male | 2,611 2,304
Female | 3,364 2,072
Total | 5,975 4,376
---------------------------------
The option totals(sex) in the example below adds totals for the row variable sex to our table.
. table (sex) (highbp), totals(sex)
-----------------------------------
| High blood pressure
| 0 1 Total
---------+-------------------------
Sex |
Male | 2,611 2,304 4,915
Female | 3,364 2,072 5,436
-----------------------------------
We can also specify row or column variables for a particular variable even when there are multiple row or column variables. The example below displays totals for the row variable highbp, even though there are two row variables in the table.
. table (sex highbp) (), totals(highbp)
------------------------------------
| Frequency
------------------------+-----------
Sex |
Male |
High blood pressure |
0 | 2,611
1 | 2,304
Female |
High blood pressure |
0 | 3,364
1 | 2,072
Total |
High blood pressure |
0 | 5,975
1 | 4,376
------------------------------------
The statistic() options
Frequencies are displayed by default, but you can specify other statistics with the statistic() option. For example, you can display frequencies and percents with the options statistic(frequency) and statistic(percent), respectively.
. table (sex) (highbp),
> statistic(frequency)
> statistic(percent)
> nototals
--------------------------------------
| High blood pressure
| 0 1
--------------+-----------------------
Sex |
Male |
Frequency | 2,611 2,304
Percent | 25.22 22.26
Female |
Frequency | 3,364 2,072
Percent | 32.50 20.02
--------------------------------------
We can also include the mean and standard deviation of age with the options statistic(mean age) and statistic(sd age), respectively.
. // FORMAT THE NUMBERS IN THE OUTPUT
. table (sex) (highbp),
> statistic(frequency)
> statistic(percent)
> statistic(mean age)
> statistic(sd age)
> nototals
-----------------------------------------------
| High blood pressure
| 0 1
-----------------------+-----------------------
Sex |
Male |
Frequency | 2,611 2,304
Percent | 25.22 22.26
Mean | 42.8625 52.59288
Standard deviation | 16.9688 15.88326
Female |
Frequency | 3,364 2,072
Percent | 32.50 20.02
Mean | 41.62366 57.61921
Standard deviation | 16.59921 13.25577
-----------------------------------------------
You can view a complete list of statistics for the statistic() option in the Stata manual.
The nformat() and sformat() options
We can use the nformat() option to specify the numerical display format for statistics in our table. In the example below, the option nformat(%9.0fc frequency) displays frequency with commas in the thousands place and no digits to the right of the decimal. The option nformat(%6.2f mean sd) displays the mean and standard deviation with two digits to the right of the decimal.
. table (sex) (highbp),
> statistic(frequency)
> statistic(percent)
> statistic(mean age)
> statistic(sd age)
> nototals
> nformat(%9.0fc frequency)
> nformat(%6.2f mean sd)
-----------------------------------------------
| High blood pressure
| 0 1
-----------------------+-----------------------
Sex |
Male |
Frequency | 2,611 2,304
Percent | 25.22 22.26
Mean | 42.86 52.59
Standard deviation | 16.97 15.88
Female |
Frequency | 3,364 2,072
Percent | 32.50 20.02
Mean | 41.62 57.62
Standard deviation | 16.60 13.26
-----------------------------------------------
We can use the sformat() option to add strings to the statistics in our table. In the example below, the option sformat(“%s%%” percent) adds “%” to the statistic percent, and the option sformat(“(%s)” sd) places parentheses around the standard deviation.
. table (sex) (highbp),
> statistic(frequency)
> statistic(percent)
> statistic(mean age)
> statistic(sd age)
> nototals
> nformat(%9.0fc frequency)
> nformat(%6.2f mean sd)
> sformat("%s%%" percent)
> sformat("(%s)" sd)
-----------------------------------------------
| High blood pressure
| 0 1
-----------------------+-----------------------
Sex |
Male |
Frequency | 2,611 2,304
Percent | 25.22% 22.26%
Mean | 42.86 52.59
Standard deviation | (16.97) (15.88)
Female |
Frequency | 3,364 2,072
Percent | 32.50% 20.02%
Mean | 41.62 57.62
Standard deviation | (16.60) (13.26)
-----------------------------------------------
The style() option
We can use the style() option to apply a predefined style to a table. In the example below, the option style(table-1) applies Stata’s predefined style table-1 to our table. This style changed the appearance of the row labels. You can view a complete list of Stata’s predefined styles in the manual, and I will show you how to create your own styles in a future blog post.
We learned a lot about the new-and-improved table command, but we have barely scratched the surface. We have learned how to create tables and use the nototals, totals(), statistic(), nformat(), sformat(), and style() options. There are many other options, and you can read about them in the manual. I’ll show you how to use collect to customize the appearance of your tables in my next post.
This blog post is written by Dr George Naufal , assistant research scientist at the Public Policy Research Institute (PPRI) at Texas A&M University and a research fellow at the IZA Institute of Labor Economics.
Tables are in every report, article, and book chapter. Stata offers a way to create and customize tables using the commands table and collect.
We use the NLS data which is based on the National Longitudinal Survey of Young Women 14-26 years of age in 1968. This data is one of the many data setsused for training purposes. We first edit the labels of the variable collgrad
label variable collgrad "College Graduate"
label
define collgrad_label 0 "No" 1 "Yes"
label
values collgrad collgrad_label
Then
we use the command table to create a simple table of frequencies of
those with a college graduate degree.
table
( ) (collgrad) (), statistic(frequency)
Where
the three brackets before the comma could be empty, have variable names, or
include a specific keyword.
If you would like to show
the percent instead of frequencies, the code is:
table
( ) (collgrad) (), statistic(percent)
Say
now, you would like the percent statistic to be a column rather than a
row, the code becomes:
table
(collgrad) (), statistic(percent)
Now, we would like to add a lot more details to the table. We would like to show the frequency and percent of college graduates and show the mean and standard deviation of age and experience by college graduate or not. The code becomes (including formatting options to show the mean and standard deviation in one decimal):
table
( var ) ( collgrad ) (), statistic(frequency) statistic(percent)
statistic(mean age exper)
statistic(sd age exper) nformat(%9.1f
percent) sformat("%s%%" percent) nformat(%9.1f mean) nformat(%9.1f sd)
We
can edit the labels of age and exper to make the table cleaner
and run the code again.
label
variable age "Age (years)"
label
variable exper "Total Work Experience (year)"
Say
you would like to show the mean and standard deviation in separate sections
(instead of the variables), remove the var from the first bracket:
You
can export the table to any format (Word file, PDF, HTML, etc.)
collect
export "…\table1.docx", as(docx) replace
One
neat thing is that you can also do the table above using putdocx; adding
the below to the export code gives you the full code to create the same table
using putdocx
Machine Learning (ML) is a rapidly developing field. There is enormous competition to be the inventor of a new method that gets widely adopted, and investors stand to make a fortune from a startup that implements a successful new analytical tool. I find it fascinating to watch the emerging hot topics. Recently, I noticed a trend. Hotshot VCs, take note! (But don't get too excited yet.)
First, let's look at some of the latest must-have techniques.
Time series
Me: Hello, old friend. What are you doing here?
Time series: Well, it turned out that not everything in life is a flat file of independent observations, so the ML people called me in.
Me: I thought you'd retired.
Time series: Ha ha ha! No.
H2O.ai, arguably the market leader in automated machine learning (autoML), have this to say right on the front page of their website:
Award-winning Automatic Machine Learning (AutoML) technology to solve the most challenging problems, including Time Series and Natural Language Processing.
Clearly, they think time series is one of the big selling points, or they wouldn't have put it there*. It even bumped the mighty Deep Learning off the Top Two. And perhaps the reason is that those highly flexible and non-linear modelling tools, like XGBoost, take no account of dependency in the data. Looking online for (for example) "XGBoost for market tick data", you'll find plenty of people who still believe you can push any data into a sufficiently flexible algorithm and get a useful model out. Elsewhere, you can find confusion between analysis of streaming data, where sliding windows of time define batches of data, and true time series models. I would argue that, to model data that are grounded in time and/or space, you should also leverage domain knowledge: that's a phrase we'll be hearing again soon.
Vendors working on the data storage and provisioning side have also been quick to offer a range of time-series-aware database products that scale to Big Data. Just search for "scalable time series database" and you'll find plenty. Five years ago it was a concept that you could shoehorn into a more generic type of database, but not a product in itself that could attract investment and get ML pulses racing.
It's hard not to chuckle as a statistician, seeing the ML folk discover autocorrelation 100 years on, but more constructively, this can help us predict where the next ML boom might come. Dependency does not just arise from time-located data, but also spatial data, networks (social or otherwise), heterogeneous data sources, and hierarchical structures. There is a lot of investment in geographical information systems and networks in the machine learning, AI and Big Data world, so it would be reasonable to see this appearing in the near future.
* - that's assuming they didn't tailor the front page for maximum appeal, having already categorised me under Statistics (I jest)
Uncertainty
For a while, ML algorithms became enormously popular on the strength of their predictions (point estimates). Then, unease set in about the lack of uncertainty in the outputs. Suppose you have to produce a categorical prediction. The algorithm returns the category with the best objective function, and that's the only information you have. Some decisions truly are binary (buy gold, sell zinc), while others are fuzzier, but even the binary ones can often be hedged.
In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie strike a conciliatory tone between statistics and ML, suggesting that most algorithms for data analysis start out with point estimates, and later, uncertainty is added. The EM algorithm is an example.
It was perhaps the rise of self-driving cars that made the absence of uncertainty too uncomfortable. Decisions do not always have to be taken by AI; they can be referred to as the human driver (let's hope they haven't fallen asleep), but that requires a measure of uncertainty. This problem remains imperfectly solved, despite massive investment, and is likely to provide interesting job opportunities for some time to come.
Now, we can see efforts to add measures of uncertainty into ML parameter estimates and predictions. Often, those look a lot like 95% confidence or credible intervals, and there is a big advantage to adding in domain knowledge. Leading software TensorFlow Probability has this to say:
"The TensorFlow team built TFP for data scientists, statisticians, and ML researchers and practitioners who want to encode domain knowledge to understand data and make predictions."
Without statistical (i.e. mathematical models of the uncertainty, based on probability) approaches, adding uncertainty to ML requires re-fitting laborious models many times or adopting some uncomfortable approximation. Any alternative technique that provides a quantified uncertainty output alongside a point estimate, quickly and with well-established long-run properties, is going to be popular with bosses everywhere. Sounds like statistics?
Gaussian processes
These are a class of supervised learning models that use probability but are highly flexible. Taking a starting point from signal processing ideas of the mid-20th century, they envisage the variation in the data to arise from a random process. Between observations, the process moves according to some probability distribution, and Gaussian processes use the normal, or Gaussian, distribution for this, just like Brownian motion. There are also Dirichlet processes for categorical variables.
This necessitates an interesting inversion in conceiving of the estimation and inference problem. Instead of a small number of parameters deterministically leading to a large number of predictions, the data and predictions together are recast as a vector (or higher-dimensional tensor) of values that are subject to the parameters of the underlying process.
The flexibility of Gaussian processes has seen them widely applied to time series and temporo-spatial problems in recent years. For autocorrelation over more than one variable, they are faster and far more robust than the conditionally autoregressive (CAR) models of yesteryear. However, this flexibility comes at a cost, and requires considerable expertise, which remains in extreme short supply. Gaussian processes are mentioned in many data science and ML courses, but usually only at a high level. Their successful implementation requires knowledge of how to choose from the esoteric multi-dimensional priors that they require, and how to tailor these to new settings. Statistical skills, in short!
Bayesian tuning parameter optimisation
There is no free lunch, in statistics as in ML. If you want to input x1, x2, x3, etc and get a prediction of y, you will need to constrain the space of possible models so that there are a number of parameters to estimate. Non-parametric models are sometimes framed as infinite-dimensional, but the essence is the same: choose your method and let the computer find the unknown parameters that can be combined to give an output. This applies to supervised and unsupervised learning equally, because we haven't said anything about the output y being in the dataset along with the x's.
Arising from computer science, many ML methods do not explicitly state a probabilistic model with parameters, but instead try to efficiently and flexibly find a way of converting x's into y. To get that right, there are tuning parameters that need to be set. (Often in ML, they are called hyperparameters, but this has a different meaning in Bayesian statistics, so I will try to avoid ambiguity.) This is generally done by human-in-the-loop experimentation, and more complex (hence flexible) procedures like neural networks can involve many such tuning parameters, which interact with one another. This gives rise to a notion of tuning the algorithm as art rather than science, but recently, there has been much interest in making the tuning part of the process more robust. You can see the problem, for the boss: the entire ML procedure that they have staked their reputation on depends on one geek who knows which buttons to push. That sort of situation keeps bosses awake at night.
Bayesian methods are the buzzword in tuning parameter optimisation. This holds out the promise of even including uncertainty about the tuning parameters in the final inference, although that is not what most ML people mean by Bayesian optimisation. Bayesian sampling algorithms are all about efficiently exploring parameter space to return the distribution of parameter values that fit the data. You can also apply this to explore tuning parameter space, to return the tuning parameters that lead to the best fit. The main difference is that there are only a few "observations" of tuning parameter values that have been tested; it is not a Big Data setting. In fact, it is generally our new friend Gaussian processes that are used as the tuning parameter model.
The ability to automate another part of the ML pipeline means ML engineer jobs being lost, but as always, there is a need for new skills, and those happen to be Bayesian sampling algorithms, probability, uncertainty and modelling small numbers of observations. If only there were a name for this new discipline...
Bayesian updating and windowing
Big Data might have passed the peak on the hype curve, but it remains alluring to many organisations and essential in some settings. Streaming data is a special case, where new data arrive too fast to be included by traditional means. The Covid-19 pandemic has thrown this into focus for many analysts. Epidemiological data, supply chains, and humanitarian aid have all had to be analysed urgently in a rapidly changing problem as data emerge.
The ability to update models rapidly as new data arrive will remain important for some time. That means analysing a new batch of data without having to re-run calculations on old data. In the streaming data paradigm, we might add the new data to a sliding window of time, and remove the oldest data from the window.
Here, Bayesian methods can help, by breaking down the posterior density of the parameters, for which we want estimates and uncertainty, into the product of the prior distribution and a sequence of likelihood functions for each batch.
Although this is alluded to in almost every introductory Bayesian class, it is rarely done because of the difficulty of defining a joint distribution for all unknowns together, and defining its hyperparameters from the posterior sample of the previous batch. Also, the success of the method is contingent on choosing a "right" form of joint distribution, and if you change your mind about it, you must return to the beginning again.
So, in practice, non-parametric updating methods are needed. These do not define a mathematical formula for the prior, but instead use the previous batch's posterior sample to estimate densities and gradients at any new values. This is a topic I have been actively involved in, and which shows promise for Big Data and streaming data settings. However, care has to be taken over the tuning parameters (those little devils again!) and data drift, requiring intensive expert input. It will continue to be a hot topic too, as the volume and velocity of data are only set to grow further.
Explainable AI
Parents everywhere warn their kids that some rowdy game is "fun until someone loses an eye". I have probably said it myself but was too tired at the time to remember. Fitting some complex predictive model to your data, launching it to the public and calling it "AI" is like that too. You're heading for promotion until it gets your employers bad PR because it is inadvertently biased against some group of people. There are many stories of this, and they are often painful to read.
The boss wants an assurance that your AI is "transparent" or "explainable". In other words, they want some best-practice procedure in place that they can cite should the worst happen. And as the boss is the one signing off the software purchase, you can be sure that ML vendors are climbing over each other to add explainability.
There are many ways to do this, which often involve either fitting simpler, interpretable models to the predictions in the locality of the point of interest, or comparing partly matched observations from the dataset. As an output, you can obtain measures of the importance of difference "features" or input variables, and an idea of what predictions would be obtained by perturbing the data. Simpler models, matching, variable importance measures, goodness-of-fit stats, smaller data... I suspect you can see where this is going.
Conclusion
Statistics and ML have often been presented as disparate skill sets, with adherents at loggerheads over which is best. This sort of story-telling gets clicks, but is a gross misrepresentation. Statistics and machine learning are two names for the same thing; they represent two paths by which people have arrived at data analysis: from mathematics or from computer science.
There are some cultural differences, just as Leo Breiman observed in the 2001 paper "Statistical Modeling: the two cultures", but increasing cross-fertilisation of ideas and influences as time goes by. As there is a premium for both groups of people in finding new, effective methods for new problems, we might reasonably expect further convergence.
Once, it wasa time of Venn diagrams. Stats and computer science were represented as essential components of data science (whatever that is), along with domain knowledge. There were many variants of these diagrams, increasingly complex and eagerly shared or stolen by bloggers everywhere. A big mistake was introduced when the intersection of the three components was changed from data science to data scientist. These diagrams underwent their own hype curve, inspiring and then disappointing bosses everywhere when they failed to recruit the Unicorn at the centre. But maybe we should think again about the intersection of these three influences. ML has been through a boom time, so to make the next set of advances, wouldn't we expect to see stats and domain knowledge catching up in innovation and in the job market?
This article was written by medical statistician and trainer Robert Grant.
Top tips from an academic and expert in the field, Dr. Malvina Marchese.
Which is the secret to a great dissertation?
Writing an excellent dissertation requires a mixture of different skills. As an academic, I have supervised many MSc dissertations and over the years I have found that these are the ultimate 5 things you need to deliver a first class dissertation:
Pin Down the Research Question.Be very clear about what you are investigating so the reader knows what your final goal is from the outset, and can concentrate on understanding your arguments.
Write a Captivating Introduction.You want to strongly demonstrate the importance of your analysis and highlight how your results contribute to the current literature. So ask yourself: do I manage to really show that my hypothesis is supported by the consequential data?
Make Your Data Speak.This is a top skill to show. Master an econometric software and make sure that you present informative descriptive statistics and tables, hypothesis tests and graphs of you data. Let the reader see at a glance the core of your information! Stata and EViews are great for this with many options to run preliminary tests and to prepare great tables of your data, as well as many built in data sets to construct your data base.
Master the EconometricsLinear regression? ARIMA ? Panel model with fixed effect or with random effect ? Regime switching models? MIDAS models? Choose the most appropriate model for your research question. Stata and EViews offer easy estimation and very informative output for all the models above and many more. Want help identifying the best model for you? Join our SOS Masters Dissertation help class this Summer in Stata or EViews, where we will learn how to identify the best econometric model to support your findings and how to convince your reader that the model is robust, with a variety of easy to obtain post estimation diagnostic tests.
Make Sure to Discuss the Interpretation of Your Findings.You've got the best model with Stata or EViews, now make sure to discuss the interpretation of your findings and how they support your hypothesis. Which tests should you comment on? Is the R squared really so important? Do you have more convincing information on your software output that you can use to describe? We will learn how to interpret parameters and findings form a wide range of models, to support your conclusions.
Join Dr Marchese for our upcoming SOS Master dissertation masterclasses, with a focus on Stata, and learn how to score the first you're capable of, all whilst enjoying the write up and econometrics!
Over the first year of Covid-19 in the UK, the demand for data analysis boomed as never before. It seems that more was shared by government departments, agencies and other authoritative sources than ever before. Journalists took this up and news outlets competed to deliver the most up-to-date, meaningful and understandable data-driven content. And critically, the public engaged with data as never before. This transformation of public understanding and appetite for data is here to stay, which means higher standards for transparency around evidence and decision making.
In this article, Robert Grant considers what this will mean for public sector and NGO analysts and decision-makers.
Government
Open data and accountable decision-making are not new, but the public and organisational appetite for data is. This data (or statistics) helps us understand whether policy is justified, and plan for our own circumstances. That places demands on public sector and charitable data sources as never before. Politically, it is very unlikely that the cat will go back into the bag. It no longer seems trustworthy to ask those who are not privy to the data -- the public, companies and smaller public sector organisations -- to follow policy because some unspecified analysis happened on some unspecified data. Only three years ago, it was perfectly feasible for Brexit planning to be based on an expert analysis which the government refused to publish.
If this journey toward open data flows and understanding statistics within the timeframe of covid is a microcosm of a longer term trend, then what is new? Two rapid changes have appeared: a further degree of openness in national official statistics supporting policy decisions, and a critical consumption by first journalists, then the public.
Open data has been an imperative of UK government since the Cabinet Office's white paper on the subject in 2012. Availability has steadily gone up and latency has come down. There is even a requirement to move from static publication formats to interactive, queryable APIs. This has given an opportunity to local organisations with the technical skills in house to create their own tools that draw on national government data. The Open Data Institute has been actively promoting this efficient reuse of data and hosts a variety of case studies on their website.
Before Covid, government data, however much it may have technically been "open", was not something that journalists invested effort into checking and investigating, let alone members of the public.
Over the course of 2020, we saw a gradual expansion of the detail supporting policy decisions, from national to regional to local statistics, and from point predictions to predictions with uncertainty to alternative parallel models. Interestingly, this was not enough for the a parliamentary committee, which issued a public complaint about the government's lack of data sharing and information supporting policy.
Some topics, like positive test results, were also broken down by age groups to illustrate particular trends. This may have been an exciting new level of transparency, but it quickly became the new normal; when the same granularity was not provided for vaccination statistics, it attracted official complaint from the UK Statistics Authority. It seems that at least this official regulator of government statistics will not be content to return to pre-Covid levels of openness.
The public
However useful the local statistics were, it soon became apparent that infections spread along edges in social networks, which are largely not known to government, and spill over (also in the economic sense) geographical boundaries. The obvious next questions for critical consumers of the stats is "what is MY risk?" There is usually some way in which it is obvious that each of us will differ from the averages for our nation or even neighbourhood, but it is not at all obvious just how much the numbers will change.
The public have grappled with several new concepts such as incidence, prevalence, exponential growth, and competing statistics for diagnostic test accuracy. These are well-known pitfalls for students in medical statistics and epidemiology, and the fact that the public are stumbling across them means that the general level of understanding is rising fast.
In those early days, when there was a lack of information, we also witnessed the phenomenon of "armchair epidemiologists": in the absence of authoritative forecasting early on, anyone with an Excel spreadsheet might produce a compelling-looking forecast, and be taken seriously by those who are searching frantically for any information.
Among the sins committed in this time were fitting normal probability density functions to cases, fitting polynomial curves to cases, and comparing countries on disease burden by counting cases (Monaco is doing really well!). It's easy to laugh at these errors in retrospect (and in possession of a degree in statistics), but each was adopted briefly by, let's just say, prominent organisations (there's nothing to be gained by naming and shaming after we have all learnt so much in a short time). In short, if accountable people who have the data don't communicate, someone else will. And if those in power don't have data either, they might just get pulled in by the allure of certainty.
David Spiegelhalter and Tim Harford, popular translators of complex analyses, were busier than ever explaining and critiquing the numbers. Often, a reframing of the same number brings a new insight. For example, from reporting the number of Covid deaths (hard to contextualise) to the % of deaths which were due to Covid.
"We are four weeks behind Italy" also came from the period (10 March 2020) when we had little information except for some confirmed cases by PCR test, which at the time was only being used on the most seriously ill people. But it had the advantage of narrative and referred to demonstrable, empirical events, and mobilised action in government and concern in the public.
Widespread dispute of, first, the Imperial epidemiological model (16 March 2020), which provided only a point prediction for any given timepoint, and later Patrick Vallance's "illustration" of exponential growth (21 Sep 2020), seem to show an intolerance of prognostication based only on theory without data, while predictions without uncertainty will almost inevitably be wrong. I think this is a new development in public discourse. It must be led by journalism and so, perhaps, comes out of a longer trend in data journalism, data visualisation, and numeracy in the media.
What's next?
The UK has an unusual system for data sharing from government: there is a policy of open data, and an independent statistics regulator. That makes it less likely that this recent trend will be reversed here, though it may be elsewhere. We might expect to see other parts of government, central and local, being held to the same standards, but it is not as simple a comparison as that.
Local government (including public health) have to work hard to build and maintain the infrastructure and systems needed for low-latency, high-quality data. Even where they succeed, local data is small data, which means more noise and more risk of identifying individuals.
Also, there are many aspects of policy-making that elude a simple number, notably, where competing interests have to be balanced. This is more the realm of economics than statistics, to develop utility models of the various harms, benefits and costs accruing to different parts of society in different ways. Even then, the politician is traditionally tasked with making a value judgement to synthesize the evidence.
Beyond the numbers, we have all been confronted with the fact that policy succeeds or fails purely on the extent of public understanding and support, or at least faith. Previously, faith was more the norm than critical querying of the statistics behind policy decisions, not least because the stats were hidden from view, or presented in confusing and, to be honest, boring formats.
Analysts and policy-makers need to be prepared to justify decisions more, whether that's in public health or elsewhere. You should expect your audience to be more critical, more quantitatively minded, and more curious than ever before. Covid-19 did that. But before you fear this new age of scrutiny, remember that they also appreciate your efforts more.
Robert Grant will be presenting the Introduction to Model Building Techniques with Stata, 23 June 2021. How do you know if your models are useful? Or maybe even wrong? This one-day course provides an introduction to the techniques used in model building.
Written by Chuck Huber (director of statistical outreach - StataCorp).
Using pip to install Python packages
Let’s begin by typing python query to verify that Python is installed on our system and that Stata is set up to use Python.
The results indicate that Stata is set up to use Python 3.8, so we are ready to install packages.
NumPy is a popular package that is described as “the fundamental package for scientific computing with Python”. Many other packages rely on NumPy‘s mathematical features, so let’s begin by installing it. It is possible that NumPy is already installed on my system, and I can check by typing python which numpy in Stata.
NumPy is not found on my system, so I am going to install it. I am using Windows 10, so I will type shell in Stata to open a Windows Command Prompt.
Figure 1: Windows Command Prompt
shell will also open a terminal in Mac or Linux operating systems. Note that experienced Stata users often type ! rather than the word shell.
Next, I will use a program named pip to install NumPy. You can type pip -V in the Windows Command Prompt or terminal in Mac or Linux to see the version and location of your pip program.
Figure 2: pip version and location
The path for pip is the same as the path returned by python query above. You should verify this if you have multiple versions of Python installed on your system.
Next, type pip install numpy in the Command Prompt or terminal, and pip will download and install NumPy in the appropriate location on your system.
Figure 3: pip install numpy
The output tells us that NumPy was installed successfully.
We can verify that NumPy was installed successfully by again typing python which numpy
Let’s install three more packages that we will use in the future. Pandas is a popular Python package used for importing, exporting, and manipulating data. We can install it by typing pip install pandas in the Command Prompt.
Figure 4: pip install pandas
You can watch a video that demonstrates how to use pip to install Pandas on the Stata YouTube channel.
Matplotlib is a popular package that “is a comprehensive library for creating static, animated, and interactive visualizations in Python”. We can install it by typing pip install matplotlib in the Command Prompt.
Figure 5: pip install matplotlib
Scikit-learn is a popular package for machine learning. We can install it by typing pip install sklearn in the Command Prompt.
Figure 6: pip install scikit-learn
Let’s use python which to verify that pandas, matplotlib, and scikit-learn are installed.
Conclusion
We did it! We successfully installed four of the most popular Python packages using pip. You can use your Internet search engine to find hundreds of other Python packages and install them with pip.
Are you tired of spending countless hours navigating websites and taking extensive notes for your research? Say hello to data scraping - a game-changing technique that allows you to extract information from websites and transform it into a spreadsheet.
Web data scraping is one of the oldest techniques for extracting content from the Web, and it`s valuable to a wide range of applications. The objective of this technique is to extract data from Web sources. It allows you to interact with a website and extract it's data through the combined efforts of humans and automations. The retrieved data can be edited, formatted, structured and stored through the data management techniques. The importance of this technique has grown with the increase of data available on the Web: so-called Big Data. It is essential that this technique be part of your skills portfolio, given the positive increasing trend of the information produced, shared, and consumed online. Acquiring these skills will allow you to efficiently collect this data with limited human effort. It will therefore enable you to very quickly obtain a large amount of data ready to be analysed. The application of data management and analysis techniques will enable you to understand complex social and economic phenomena.
From a practical point of view, this technique can help companies obtain and analyse a great deal of information on the activities of their competitors, enabling them to understand market challenges and opportunities. Nowadays, the increasing amount of available data makes this technique even more interesting. Moreover, it is easy to set up a data scraping pipeline, with a minimum of programming effort, and to meet a number of practical needs.
The integration of Stata and Python provides a powerful solution for effortless data scraping. This guide will introduce you to the process of data acquisition using various Python libraries, followed by analysis in Stata. By mastering code replication in Python and analysing the resulting datasets in Stata, you will be equipped to tackle the challenges commonly encountered in data scraping projects.
Register now on our short course in Data Scraping using Stata and Python with Dr Francesco Lopes on May 24 - 25
StataCorphas recently developed a new resource for the Python integration feature that was expanded in Stata 17—a cheat sheet that demonstrates how to call Python from Stata. The cheat sheet includes everything from setup to executing Python code in Stata.
What is a Stata Cheat Sheet?
The Stata Cheat Sheets provide any user, whether new or old, a well-structured and helpful guide to some of the Stata basics. The cheat sheet covers topics from data analysis to plotting in Stata, and now calling Python from Stata!
The latest Stata Cheat Sheet demonstrates how to call Python from Stata. To learn more about Calling Python from Stata, type "help pystata module" in Stata's Command window.
Python integration was first introduced in Stata 16, where Python's extensive language features could be leveraged within Stata. Fast forward to Stata 17, where Stata can be invoked from a standalone Python environment via the pystataPython package. Learn more about using Python and Stata together.
Upgrade to Stata 17 today to experience the full power of Stata's newest features, including the Python and Stata functionalities.
Proceedings are now available for the 2022 UK Stata Conference, which took place in London, UK, on 8 & 9 September 2022.
What is the UK Stata Conference?
The UK Stata Conference is the longest-running Conference of its kind, with this year's event being the 28th edition. We are incredibly excited to welcome people back to our dedicated in-person event. This two-day international event provides Stata users from all over the world the opportunity to exchange ideas, experiences, and information on new applications of the software.
Experience what happens when new and long-time Stata users from across all disciplines gather to discuss real-world applications of Stata.
The Conference Proceedings are available now, where you can read the invited Stata presentations from Yulia Marchenko, Jeff Pitblado and Asjad Naqvi on a diverse collection of topics, including:
Timberlake have been the official distributors of Stata in the UK for over 30 years. Over this period, we have seen the powerful evolution of the software.
Stata 1.0 was officially released in January 1985.
"a small program that could not claim to cover all of even mainstream statistics, any more than its competitors did. It could more fairly be described as a regression package with data management features. -Nicholas J. Cox A brief history of Stata.
In April 2021, we saw the arrival of the latest edition, Stata 17, the most powerful package so far.
After 36 years of improvements to the user interface, statistical features, visualizations and much more, Stata has become a spearhead in the statistical world, providing statisticians globally with the means to enrich their data.
Stata 17
Tables
Bayesian econometrics: VAR, DSGE, IRF, dynamic forecasts, and panel-data models
Faster Stata
Difference-in-differences (DID) and DDD models
Interval-censored Cox model
PyStata—Python/Stata integration
Jupyter Notebook with Stata
Multivariate meta-analysis
Bayesian multilevel models: nonlinear, joint, SEM-like, and more