fabricatr is a package designed to help you imagine your data before you collect it. While many solutions exist for creating simulated datasets, fabricatr is specifically designed to make the creation of realistic social science datasets easy. In particular, we need to be able to imagine correlated data and hierarchical data. fabricatr is designed to integrate into a tidyverse workflow, and to allow users to imagine data from scratch or by modifying existing data.

fabricatr is a member of the DeclareDesign software suite that also includes the r packages randomizr, estimatr, and DeclareDesign.

# Simulating “resampling” from existing data.

One way to imagine new data is to take data you already have and resample it, ensuring that existing inter-correlations between variables are preserved, while generating new data or expanding the size of the dataset. fabricatr offers several options to simulate resampling.

# Bootstrapping

The simplest option in fabricatr is to “bootstrap” data. Taking data with N observations, the “bootstrap” resamples these observations with replacement and generates N new observations. Existing observations may be used zero times, once, or more than once. Bootstrapping is very simple with the resample_data() function:

survey_data = fabricate(N=10,
voted_republican = draw_binary(N=N, x=0.5))

survey_data_new = resample_data(survey_data)
survey_data
ID voted_republican
01 1
02 0
03 1
04 1
05 0
06 1
07 0
08 0
09 1
10 0

It is also possible to resample fewer or greater number of observations from your existing data. We can do this by specifying the argument N to resample_data(). Consider expanding a small dataset to allow for better imagination of larger data to be collected later.

large_survey_data = resample_data(survey_data, N=100)
nrow(large_survey_data)

100

# Resampling hierarchical data

One of the most powerful features of all of fabricatr is the ability to resample from hierarchical data at any or all levels. Doing so requires specifying which levels you will want to resample with the ID_labels argument. Unless otherwise specified, all units from levels below the resampled level will be kept. In our earlier country-province-citizen dataset, resampling countries will lead to all provinces and citizens of the selected country being carried forward. You can resample at multiple levels simultaneously.

Consider this example, which takes a dataset containing 2 cities of 3 citizens, and resamples it into a dataset of 3 cities, each containing 5 citizens.

my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = 3, age = runif(N, 18, 70))
)

my_data_2 <- resample_data(my_data,
N = c(3, 5),
ID_labels = c("cities", "citizens"))
my_data_2
cities elevation citizens age
2 1971 4 44
2 1971 6 45
2 1971 6 45
2 1971 5 19
2 1971 4 44
1 1493 3 26
1 1493 2 35
1 1493 2 35
1 1493 1 33
1 1493 3 26
2 1971 4 44
2 1971 5 19
2 1971 5 19
2 1971 6 45
2 1971 4 44

resample_data() will first select the cities to be resampled. Then, for each city, it will continue by selecting the citizens to be resampled. If a higher level unit is used more than once (for example, the same city being chosen twice), and a lower level is subsequently resampled, the choices of which units to keep for the lower level will differ for each copy of the higher level. In this example, if city 1 is chosen twice, then the sets of five citizens chosen for each copy of the city 1 will differ.