fabricatr is a package designed to help you imagine your data before you collect it. While many solutions exist for creating simulated datasets, fabricatr is specifically designed to make the creation of realistic social science datasets easy. In particular, we need to be able to imagine correlated data and heirarchical data.

fabricatr is a member of the DeclareDesign software suite that includes the r packages randomizr, estimatr, and Declare Design. fabricatr plays well with the tidyverse.

# Basics

The workhorse function is fabricate. You provide a number to N, then a series of named functions. A nice feature is that you can use N as an argument to any of the functions you supply. Later functions can depend on values defined earlier, making the creation of correlated data easy

library(fabricatr)
my_data <- fabricate(N = 5, Y = runif(N), Y2 = Y*5)
my_data
##   ID          Y        Y2
## 1  1 0.41562700 2.0781350
## 2  2 0.24783623 1.2391812
## 3  3 0.26713051 1.3356526
## 4  4 0.74410844 3.7205422
## 5  5 0.06653615 0.3326807

# Hierarchical data

We can create hierarchical data through use of the level function. In the example below, we create 2 cities, each with an elevation. We then create 3 citizens per city, each with an income.

1. The meaning of N changes. In the cities line, N means 2, the number of cities. In the citizens line, N means 3, the number of citizens.
2. The data created at the cities level is constant within cities. Each city has its own elevation. The data created at the citizens level is not constant within cities.
3. Variables created at a lower level can depend on variables created at a higher level. Citizen’s income depends on the elevation of cities.
my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = 3, income = round(elevation * rnorm(n = N, mean = 5)))
)
my_data
##   cities elevation citizens income
## 1      1  1112.361        1   4874
## 2      1  1112.361        2   4609
## 3      1  1112.361        3   4354
## 4      2  1736.349        4   8673
## 5      2  1736.349        5   5764
## 6      2  1736.349        6  10800

# Bringing in your own data

An essential part of imagining your data before you collect it is the ability to build on the data you all ready have.

A second way you may wish to use existing data is bootstrap a new dataset from it, thereby preserving all the natural inter-correlations.

## Modifying existing data

If you have already conducted a baseline survey, you may which to imagine how the endline may deviate from it. In this case, you will want to add new variables to your existing dataset. Notice that the meaning of N in the definition of Y_post automatically refers to the number of rows in the dataset provided to the data argument.

baseline_survey <- fabricate(N = 5, Y_pre = rnorm(N))

my_endline <- fabricate(data = baseline_survey,
Y_post = Y_pre + rnorm(N))
my_endline
##   ID      Y_pre      Y_post
## 1  1 -1.1493787 -1.85048959
## 2  2  0.2465355  1.30343303
## 3  3  0.2900489  0.18682424
## 4  4 -0.5520475  0.09666843
## 5  5  0.3632573  2.53695339

## Bootstrapping

Suppose you wanted to bootstrap from your baseline survey.

bootsrapped_data <- resample_data(baseline_survey, N = 10)
bootsrapped_data
##    ID      Y_pre
## 1   1 -1.1493787
## 2   1 -1.1493787
## 3   2  0.2465355
## 4   3  0.2900489
## 5   2  0.2465355
## 6   2  0.2465355
## 7   1 -1.1493787
## 8   5  0.3632573
## 9   1 -1.1493787
## 10  3  0.2900489

The real utility of this function comes when bootstrapping from hierarchical data. The example below takes a dataset that contains 2 cities, each with 3 citizens, then bootstraps to 3 cities, each with 5 citizens.

my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = 3, income = round(elevation * rnorm(n = N, mean = 5)))
)

my_data_2 <- resample_data(my_data, N = c(3, 5), ID_labels = c("cities", "citizens"))
my_data_2
##    cities elevation citizens income
## 1       1  1941.207        3  12747
## 2       1  1941.207        3  12747
## 3       1  1941.207        3  12747
## 4       1  1941.207        2   8524
## 5       1  1941.207        2   8524
## 6       1  1941.207        2   8524
## 7       1  1941.207        2   8524
## 8       1  1941.207        2   8524
## 9       1  1941.207        2   8524
## 10      1  1941.207        1   9949
## 11      1  1941.207        1   9949
## 12      1  1941.207        1   9949
## 13      1  1941.207        1   9949
## 14      1  1941.207        1   9949
## 15      1  1941.207        1   9949

## Ns that vary

When making hierarchical data, you may not want to have the same number of units at each level of the hierarchy. For example, in the example below, we want one city to have 2 citizens and the other city to have four:

my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = c(2, 4), income = round(elevation * rnorm(n = N, mean = 5)))
)
my_data
##   cities elevation citizens income
## 1      1  1268.314        1   6200
## 2      1  1268.314        2   5622
## 3      2  1561.877        3   8468
## 4      2  1561.877        4   4951
## 5      2  1561.877        5   6864
## 6      2  1561.877        6   9126

You can even have Ns that are determined by a function, enabling a random number of citizens per city:

my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = sample(1:6, size = 2, replace = TRUE), income = round(elevation * rnorm(n = N, mean = 5)))
)
my_data
##    cities elevation citizens income
## 1       1  1989.295       01  13104
## 2       1  1989.295       02   8905
## 3       1  1989.295       03  15421
## 4       1  1989.295       04  10432
## 5       2  1068.203       05   5931
## 6       2  1068.203       06   5902
## 7       2  1068.203       07   4814
## 8       2  1068.203       08   5204
## 9       2  1068.203       09   6509
## 10      2  1068.203       10   4036

## Bringing in your own hierarchical data

Suppose you had existing hierarchical data, and you wanted to add variables that respected the levels.

my_baseline_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = 3, income = round(elevation * rnorm(n = N, mean = 5)))
)

# add new variables at each level
my_data <-
fabricate(data = my_baseline_data,
cities = level(density = elevation / 2),
citizens = level(wealth = income - 100))

my_data
##   citizens cities income elevation  density wealth
## 1        1      1   7303  1405.193 702.5966   7203
## 2        2      1   6772  1405.193 702.5966   6672
## 3        3      1   8510  1405.193 702.5966   8410
## 4        4      2   5089  1138.607 569.3035   4989
## 5        5      2   6965  1138.607 569.3035   6865
## 6        6      2   6243  1138.607 569.3035   6143

# Tidyverse integration

Because the functions in fabricatr take data and return data, they are easily slotted into a tidyverse workflow:

library(dplyr)

# letting higher levels depend on lower levels

my_data <-
fabricate(
cities = level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = level(N = c(2, 3), income = round(elevation * rnorm(n = N, mean = 5)))
) %>%
group_by(cities) %>%
mutate(pop = n())

my_data
## # A tibble: 5 x 5
## # Groups:   cities [2]
##   cities elevation citizens income   pop
##    <chr>     <dbl>    <chr>  <dbl> <int>
## 1      1   1842.06        1   9577     2
## 2      1   1842.06        2   7305     2
## 3      2   1783.04        3  11146     3
## 4      2   1783.04        4   9189     3
## 5      2   1783.04        5   9696     3
my_data <-
data_frame(Y = sample(1:10, 2)) %>%
fabricate(lower_level = level(N = 3, Y2 = Y + rnorm(N)))
my_data
##    Y lower_level        Y2
## 1 10           1 10.495408
## 2 10           2  9.916366
## 3 10           3  9.998652
## 4  6           4  5.288414
## 5  6           5  6.462447
## 6  6           6  6.161634