fabricatr is a package designed to help you imagine your data before you collect it. While many solutions exist for creating simulated datasets, fabricatr is specifically designed to make the creation of realistic social science datasets easy. In particular, we need to be able to imagine correlated data and hierarchical data. fabricatr is designed to integrate into a tidyverse workflow, and to allow users to imagine data from scratch or by modifying existing data.
Using fabricatr begins by making a call to the function
fabricate() can be used to create single-level of hierarchical data. There are three main ways to call
fabricate(): making a single-level dataset by specifying how many observations you would like; making a single-level dataset by importing data and optionally modifying it by creating new variables; and making a hierarchical dataset.
Making a single-level dataset begins with providing the argument
N, a number representing the number of observations you wish to create, followed by a series of variable definitions. Variables can be defined using any function you have access to in R. fabricatr provides several simple functions for generating common types of data. These are covered below. Functions that create subsequent variables can rely on previously created variables, which ensures that variables can be correlated with one another:
library(fabricatr) my_data <- fabricate(N = 5, Y = runif(N), Y2 = Y*5) my_data
Instead of specifying the argument
N, users can specify the argument
data to import existing datasets. Once a dataset is imported, subsequent variables have access to
N, representing the number of observations in the imported data. This makes it easy to augment existing data with simulations based on that data:
# This example makes use of the "quakes" dataset, built into R # which describes earthquakes off the coast of Fiji. The "mag" # variable contains the richter magnitude of the earthquakes. simulated_quake_data = fabricate(data = quakes, fatalities = round(pmax(0, rnorm(N, mean=mag)) * 100), insurance_cost = fatalities * runif(N, 1000000, 2000000)) head(simulated_quake_data)
Notice that variable creation calls are able to make reference to both the variables in the imported data set, and newly created variables. Also, function calls can be arbitrarily nested – the variable fatalities uses several nested function calls.
The most powerful use of fabricatr is to create hierarchical (“nested”) data. In the example below, we create 5 countries, each of which has 10 provinces:
country_data <- fabricate( countries = level(N = 5, gdp_per_capita = runif(N, min=10000, max=50000), life_expectancy = 50 + runif(N, 10, 20) + ((gdp_per_capita > 30000) * 10)), provinces = level(N = 10, has_nat_resources = draw_discrete(x=0.3, N=N, type="bernoulli"), has_manufacturing = draw_discrete(x=0.7, N=N, type="bernoulli")) ) head(country_data)
Several things can be observed in this example. First, fabricate knows that your second
level() command will be nested under the first level of data. Each level gets its own ID variable, in addition to the variables you create. Second, the meaning of the variable “N” changes. During the
level() call for countries, N is 5. During the
level() call for provinces, N is 10. And the resulting data, of course, has 50 observations.
Finally, the province-level variables are created using the
draw_discrete() function. This is a function provided by fabricatr to make simulating discrete random variables simple. When you simulate your own data, you can use fabricatr’s functions, R’s built-ins, or any custom functions you wish.
draw_discrete() is explained in our tutorial on variable generation using fabricatr
fabricatr is also able to import existing data and nest hierarchical data under it. This maybe be useful if, for example, you have existing country-level data but wish to simulate data at lower geographical levels for the purposes of an experiment you plan to conduct.
Imagine importing the country-province data simulated in the previous example. Because
fabricate() returns a data frame, this simulated data can be re-imported into a subsequent fabricate call, just like external data can be.
In this example, we add a third level of data; for each of our 50 country-province observations, we now have 10 citizen-level observations. Citizen-level covariates like salary can draw from both the country-level covariate and the province-level covariate.
Notice that the syntax for adding a new nested level to existing data is different than the syntax for adding new variables to the original dataset.
Suppose you have hierarchical data, and wish to simulate variables at a higher level of aggregation. For example, imagine you import a dataset containing citizens within countries, but you wish to simulate additional country-level variables. In fabricatr, you can do this using the
Let’s use our country-province data from earlier:
level() know whether to modify your data or add a new level?
level() uses contextual information – if the name you provide to your
level() call is already a field that exists in your data set,
level() will treat this as a request to modify this level of data. If, on the other hand, you provide a name not used in the data set,
level() will assume you mean to add nested data under the existing data.
We can observe that the new variable is created at the level of aggregation you chose – countries. Also, although N is not specified anywhere,
level() knows how large N should be based on the number of countries it finds in the dataset. It is important, then, to ensure that the
level() command is correctly assigned to the level of interest.
We can also modify more than one level. Recalling our country-province-citizen data from above, the following process is possible:
new_citizen_data <- fabricate( data = citizen_data, countries = level(avg_temp = runif(N, 30, 80)), provinces = level(conflict_zone = draw_discrete(N, x=0.2 + has_nat_resources * 0.3, type="binary"), infant_mortality = runif(N, 0, 10) + conflict_zone * 10 + (avg_temp > 70) * 10), citizens = level(college_degree = draw_discrete(N, x=0.4 - (0.3 * conflict_zone), type="binary")) )
Before assessing what this tells us about
level(), let’s consider what the data simulated does. It creates a new variable at the country level, for a country level average temperature. Subsequently, it creates a province level binary indicator for whether the province is an active conflict site. Provinces that have natural resources are more likely to be in conflict in this simulation, drawing on conclusions from literature on “resource curses”. The infant mortality rate for the province is able to depend both on province level data we have just generated, and country-level data: it is higher in high-temperature areas (reflecting literature on increased disease burden near the equator) and also higher in conflict zones. Citizens access to education is also random, but depends on whether they live in a conflict area.
There are a lot of things to learn from this example. First, it’s possible to modify multiple levels. Any new variable created will automatically propagate to the lower level data according – by setting an average temperature for a country, all provinces, and all citizens of those provinces, have the value for the country. Values created from one
level() call can be used in subsequent variables of the same call, or subsequent calls.
You’ve seen fabricatr’s ability to generate single-level and hierarchical data, which is enough to get you started on using the package. From here, you can learn about using
draw_discrete() to generate discrete random variables, using fabricatr to bootstrap or resample hierarchical data, or advanced features.