Multiple datasets in memory

Highlights

  • Multiple datasets in memory simultaneously
  • Each dataset is stored in a frame
  • Frames are easy to use interactively
  • Frames are fully programmable, in both ado and Mata
  • Access data in frames from Java and Python

This is about changing the way you work.

Datasets in memory are stored in frames, and frames are named. When Stata launches, it creates a frame named default, but there is nothing special about it, and the name has no special or secret meaning. You can rename it.

You can create frames, and delete them, and rename them. The commands are

. frame create framename
. frame drop framename
. frame rename oldname newname

Stata will list the names of all the existing frames if you type

. frames dir

One of the frame names that frames dir lists will be the current frame. It is the frame that Stata commands assume that you want them to use. To find out the name of the current frame, type

. frame
  (current frame is default)

We are in the frame default. If we fit a regression, it would be fit on the data in default. Or we could change to another frame. We might type

. frame change myframe

Now if we fit a regression, it would be fit on the data in myframe.

So that is one way of working with frames. You can frame change, issue the Stata commands, and then frame change back.

Another way of working with frames is

. frame framename {
        stata_command
        stata_command
        .
        .
  }

and

. frame framename: one_stata_command

These commands run the Stata commands on the specified frame, and switch back to the original frame once they are finished.

And the final way to work with frames is to link them. If a frame is linked to another, it can access the other frame's data without changing them. We will demonstrate that below.

Let's see it work

Here are five ways frames will change the way you work.

Example 1: Multitask.

You are working to finish your project when the phone rings. Something has to be handled right now. Here is what you do:

. frame create interruption

. frame change interruption

. use another_dataset

. do what needs doing

. frame change default

. frame drop interruption

Example 2: Use frames to perform tasks integral to your work.

You want to predict the income of men as if they were women and of women as if they were men. Frames provides yet another way you can do this. We are about to

  1. run a regression,
  2. change the data so that men are recorded as women and women as men,
  3. obtain predicted income on the changed data,
  4. and all the while not change the data.

Frames is how we will avoid changing the data.

. regress income i.sex##(i.ed c.age##c.age) i.occ

. frame copy default new

. frame new {
        replace sex = !sex       // reverse the sexes
        predict pincome
  }

. generate alt_income = _frget(new, pincome, _n)

. frame drop new

generate copied values from frame new by using the _frget() function . The argument _n specified that observation 1 in new be copied to 1 in default, 2 in new to 2 in default, and so on.

Example 3: Work with separate but related datasets simultaneously.

You have two files, persons.dta and counties.dta, that are related. The persons live in the counties. You can load the datasets into separate frames and link them.

. use persons

. frame create counties

. frame counties: use counties

. frlink m:1 countyid, frame(counties)

frlink links observations in the current frame to corresponding observations in the other frame. Variable countyid in persons.dta records the county in which each person lives. A variable of the same name in counties.dta records the county on which additional data are provided. The data were linked on countyid.

Assume counties contains a variable med_income containing each county's median income. Then you could type

. frget med_income, from(counties)

. regress income med_income educ age

The first command copies med_income from counties to the current frame. There are lots of issues in doing this, but they are handled automatically. Some individuals might live in counties not recorded in counties. Others might live in the same county. And there may be counties in which no one in persons.dta lives. All of that is handled.

Example 4: Record results in another frame.

You can use one frame to record results from another. The frame create command, which we have used before, can also create new frames containing new variables. For instance,

. frame create newframename stat1 stat2

creates a new frame containing zero observations on variables named stat1 and stat2.

Another frame command,

. frame post framename (expression) (expression) ...

will add observations to an existing frame, filling in the variables with the values of the expressions.

Thus, we can use frame create to create a new frame ready to receive new observations, and we can use frame post to send the new observations we want to add. Here is an example of how we can put frame create and frame post to use.

How often will a sample of 100 draws from N(0,1) have a mean different from 0 at the 5% level? Let's do 1,000 simulations.

. frame create results t p 

. forvalues i=1(1)1000 {
  2.         quietly set obs 100
  3.         quietly generate x = rnormal()
  4.         quietly ttest x=0
  5.         frame post results (r(t)) (r(p))
  6.         drop _all
  7. }

. frame results: count if p<=0.05
  43

How often will draws from N(0,1) produce coefficients with |t|>2 in a regression? Let's do 1,000 simulations:

. sysuse auto
(1978 Automobile Data)

. frame create results b se

. forvalues i=1(1)1000 {
  2.         quietly generate x = rnormal()
  3.         quietly regress  mpg  x weight displ
  4.         frame post results (_b[x]) (_se[x])
  5.         drop x
  6. }

. frame results: count if abs(b/se) > 2
  54

Recording simulation results is one way you can use frame create and frame post. Here's another. We recently had a dataset with 2,000-plus variables in it, and we wanted to get its names organized and standardized. We started by creating a dataset of the variable names:

. frame create varnames str32 varname

. foreach name of varlist _all {
  2.         frame post varnames ("`name'")
  3. }

Now we had a dataset in frame varnames with 2,000-plus observations of variable varname. We looked at the dataset, sorted it, performed other shrewd transformations on it, and finally knew what we wanted to do. We started like this:

. frame change varnames
. rename varname oldname
. generate str32 newname = ""

Then, we copied some old names over to newname. We filled others in by hand. We even filled some of them in with programs we wrote. Finally, we reached the point where we had a new name for each original name.

Then, we used frames to change the names in the original data:

. frame change varnames

. local N = _N

. forvalues i=1(1)`N' {
  2.         local old = oldname[`i']
  3.         local new = newname[`i']
  4.         frame default: rename `old' `new'
  5. }

Then, we put the names in the order we had them in our dataset:

. local names = ""

. forvalues i=1(1)`N' {
  2.         local names = "`names' " + newname[`i']
  3. }

. frame default: order `names'

Example 5: Use frames to make your work easier.

Another frame feature is frame put for copying a subset of data from one frame to another. There are two variations. One copies a subset of the variables, and the other, a subset of the observations.

. frame put varlist, into(framename)

. frame put if expression, into(framename)

Here is how you might use them.

  1. You have hundreds of variables in your dataset. Right now, you want to look at only a few of them:
  2. . frame put city country gdp, into(subset)
    . frame change subset
    . stata_command
    . stata_command
    
    . frame change default
    . frame drop subset
    

  3. You have data for most cities and countries of the world. You want to analyze the data for Germany:
  4. . frame put if country=="Germany", into(subset)
    . frame change subset
    . stata_command
    . stata_command
    
    . frame change default
    . frame drop subset
    

We once had country data and wanted to perform country_analysis.do for each country separately, starting with Afghanistan and ending with Zimbabwe. We did the following and produced Afghanistan.log, Albania.log, Algeria.log, ... Zimbabwe.log.

. egen c = group(country)

. quietly summarize c

. local N_of_countries = r(max)

. forvalues i=1(1)`N_of_countries' {
  2.         frame put if c==`i', into(subset)
  3.         frame subset {
  4.                 local cntryname = country[1]
  5.                 log using "`cntryname'.log" 
  6.                 do country_analysis     
  7.                 log close             
  8.         }
  9.         frame drop subset
 10. }

Example 6: Make code run faster.

We said there were five ways frames will change the way you work, and yet here we are on number 6. We do not count this one because you do not have to change the way you work to experience the benefit.

The do- and ado-files that you have previously written that use preserve and restore will run faster if you use Stata/MP because it secretly uses frames in place of temporary files to preserve data. The speed-up is sometimes remarkable. We have old do- and ado-files that run 20 percent faster.

Post your comment

Timberlake Consultants