fabricatr is a package designed to help you imagine your data before you collect it. While many solutions exist for creating simulated datasets, fabricatr is specifically designed to make the creation of realistic social science datasets easy. In particular, we need to be able to imagine correlated data and hierarchical data. fabricatr is designed to integrate into a tidyverse workflow, and to allow users to imagine data from scratch or by modifying existing data.

fabricatr is a member of the DeclareDesign software suite that also includes the r packages randomizr, estimatr, and DeclareDesign.

Basics

Using fabricatr begins by making a call to the function fabricate(). fabricate() can be used to create single-level of hierarchical data. There are three main ways to call fabricate(): making a single-level dataset by specifying how many observations you would like; making a single-level dataset by importing data and optionally modifying it by creating new variables; and making a hierarchical dataset.

Single-level datasets from scratch

Making a single-level dataset begins with providing the argument N, a number representing the number of observations you wish to create, followed by a series of variable definitions. Variables can be defined using any function you have access to in R. fabricatr provides several simple functions for generating common types of data. These are covered below. Functions that create subsequent variables can rely on previously created variables, which ensures that variables can be correlated with one another:

library(fabricatr)
my_data <- fabricate(N = 5, Y = runif(N), Y2 = Y*5)
my_data
ID Y Y2
1 0.78 3.9
2 0.50 2.5
3 0.54 2.7
4 0.61 3.0
5 0.32 1.6

Single-level datasets using existing data

Instead of specifying the argument N, users can specify the argument data to import existing datasets. Once a dataset is imported, subsequent variables have access to N, representing the number of observations in the imported data. This makes it easy to augment existing data with simulations based on that data:

# This example makes use of the "quakes" dataset, built into R
# which describes earthquakes off the coast of Fiji. The "mag"
# variable contains the richter magnitude of the earthquakes.

simulated_quake_data = fabricate(data = quakes,
                                 fatalities = round(pmax(0, rnorm(N, mean=mag)) * 100),
                                 insurance_cost = fatalities * runif(N, 1000000, 2000000))
head(simulated_quake_data)
lat long depth mag stations fatalities insurance_cost
-20 182 562 4.8 41 611 1.0e+09
-21 181 650 4.2 15 290 3.0e+08
-26 184 42 5.4 43 427 7.8e+08
-18 182 626 4.1 19 470 5.2e+08
-20 182 649 4.0 11 456 7.2e+08
-20 184 195 4.0 12 390 7.0e+08

Notice that variable creation calls are able to make reference to both the variables in the imported data set, and newly created variables. Also, function calls can be arbitrarily nested – the variable fatalities uses several nested function calls.

Hierarchical data

The most powerful use of fabricatr is to create hierarchical (“nested”) data. In the example below, we create 5 countries, each of which has 10 provinces:

country_data <-
  fabricate(
    countries = level(N = 5, 
                      gdp_per_capita = runif(N, min=10000, max=50000),
                      life_expectancy = 50 + runif(N, 10, 20) + ((gdp_per_capita > 30000) * 10)),
    provinces = level(N = 10,
                      has_nat_resources = draw_discrete(x=0.3, N=N, type="bernoulli"),
                      has_manufacturing = draw_discrete(x=0.7, N=N, type="bernoulli"))
  )
head(country_data)
countries gdp_per_capita life_expectancy provinces has_nat_resources has_manufacturing
1 39,398 73 01 1 1
1 39,398 73 02 1 1
1 39,398 73 03 0 1
1 39,398 73 04 0 1
1 39,398 73 05 1 0
1 39,398 73 06 0 1

Several things can be observed in this example. First, fabricate knows that your second level() command will be nested under the first level of data. Each level gets its own ID variable, in addition to the variables you create. Second, the meaning of the variable “N” changes. During the level() call for countries, N is 5. During the level() call for provinces, N is 10. And the resulting data, of course, has 50 observations.

Finally, the province-level variables are created using the draw_discrete() function. This is a function provided by fabricatr to make simulating discrete random variables simple. When you simulate your own data, you can use fabricatr’s functions, R’s built-ins, or any custom functions you wish. draw_discrete() is explained in our tutorial on variable generation using fabricatr

Adding hierarchy to existing data

fabricatr is also able to import existing data and nest hierarchical data under it. This maybe be useful if, for example, you have existing country-level data but wish to simulate data at lower geographical levels for the purposes of an experiment you plan to conduct.

Imagine importing the country-province data simulated in the previous example. Because fabricate() returns a data frame, this simulated data can be re-imported into a subsequent fabricate call, just like external data can be.

citizen_data <- 
  fabricate(
    data = country_data,
    citizens = level(N=10,
                     salary = rnorm(N, 
                                    mean = gdp_per_capita + 
                                      has_nat_resources * 5000 + 
                                      has_manufacturing * 5000,
                                    sd = 10000)))
head(citizen_data)
countries gdp_per_capita life_expectancy provinces has_nat_resources has_manufacturing citizens salary
1 39,398 73 01 1 1 001 61,928
1 39,398 73 01 1 1 002 61,358
1 39,398 73 01 1 1 003 67,497
1 39,398 73 01 1 1 004 51,694
1 39,398 73 01 1 1 005 38,326
1 39,398 73 01 1 1 006 60,078

In this example, we add a third level of data; for each of our 50 country-province observations, we now have 10 citizen-level observations. Citizen-level covariates like salary can draw from both the country-level covariate and the province-level covariate.

Notice that the syntax for adding a new nested level to existing data is different than the syntax for adding new variables to the original dataset.

Modifying existing levels

Suppose you have hierarchical data, and wish to simulate variables at a higher level of aggregation. For example, imagine you import a dataset containing citizens within countries, but you wish to simulate additional country-level variables. In fabricatr, you can do this using the level() command.

Let’s use our country-province data from earlier:

new_country_data <-
  fabricate(
    data = country_data,
    countries = level(avg_temp = runif(N, 30, 80))
  )

head(new_country_data)
countries provinces has_nat_resources has_manufacturing gdp_per_capita life_expectancy avg_temp
1 01 1 1 39,398 73 33
1 02 1 1 39,398 73 33
1 03 0 1 39,398 73 33
1 04 0 1 39,398 73 33
1 05 1 0 39,398 73 33
1 06 0 1 39,398 73 33

How does level() know whether to modify your data or add a new level? level() uses contextual information – if the name you provide to your level() call is already a field that exists in your data set, level() will treat this as a request to modify this level of data. If, on the other hand, you provide a name not used in the data set, level() will assume you mean to add nested data under the existing data.

We can observe that the new variable is created at the level of aggregation you chose – countries. Also, although N is not specified anywhere, level() knows how large N should be based on the number of countries it finds in the dataset. It is important, then, to ensure that the level() command is correctly assigned to the level of interest.

We can also modify more than one level. Recalling our country-province-citizen data from above, the following process is possible:

new_citizen_data <-
  fabricate(
    data = citizen_data,
    countries = level(avg_temp = runif(N, 30, 80)),
    provinces = level(conflict_zone = draw_discrete(N, 
                                                    x=0.2 + has_nat_resources * 0.3,
                                                    type="binary"),
                      infant_mortality = runif(N, 0, 10) + 
                        conflict_zone * 10 + 
                        (avg_temp > 70) * 10),
    citizens = level(college_degree = draw_discrete(N, 
                                                    x=0.4 - (0.3 * conflict_zone), 
                                                    type="binary"))
  )

Before assessing what this tells us about level(), let’s consider what the data simulated does. It creates a new variable at the country level, for a country level average temperature. Subsequently, it creates a province level binary indicator for whether the province is an active conflict site. Provinces that have natural resources are more likely to be in conflict in this simulation, drawing on conclusions from literature on “resource curses”. The infant mortality rate for the province is able to depend both on province level data we have just generated, and country-level data: it is higher in high-temperature areas (reflecting literature on increased disease burden near the equator) and also higher in conflict zones. Citizens access to education is also random, but depends on whether they live in a conflict area.

There are a lot of things to learn from this example. First, it’s possible to modify multiple levels. Any new variable created will automatically propagate to the lower level data according – by setting an average temperature for a country, all provinces, and all citizens of those provinces, have the value for the country. Values created from one level() call can be used in subsequent variables of the same call, or subsequent calls.

Again, we see the use of draw_discrete(). Using this function is covered in our tutorial on generating discrete random variables, linked below.

Next Steps

You’ve seen fabricatr’s ability to generate single-level and hierarchical data, which is enough to get you started on using the package. From here, you can learn about using draw_discrete() to generate discrete random variables, using fabricatr to bootstrap or resample hierarchical data, or advanced features.