datagen Introduction

I recently gave a talk at a meetup in London and posed the following set of questions to the audience:

“Who develops applications that talk to a database?”

70% of hands up.

“Who develops applications that talk to a database and tests with production amounts of data?”

60% of hands up.

“Who develops applications that talk to a database and tests with production amounts of data during development?”

7% of hands up.

Now this wasn’t a large audience… 7% of attendees represented about 1.75 people, so the sample size was pretty small. However, it was interesting to see that only a fraction of those developing database-backed applications were developing their code with production amounts of data in mind.

Towards the start of my career and before I’d started to test with large datasets, I was bitten like the 1.75 people in that audience will eventually be. My application needed a few indices and was generally a bit shit but it wasn’t until my application had reached a staging environment that I realised there was an issue. By then, there were a lot of performance oversights and assumptions baked in and it required some uwinding.

The best way to predict how your applications are going to behave when they’re handling production data sets, is to not rely on predictions and test with production data sets.

I find one of the least enjoyable (although weirdly satisfying) parts of developing a database-backed application is filling my dev databases with fake data. In the past, I’ve created custom data generator CLIs to fill related tables with information to benchmark against. I’ve spent more time on tools like this than I care to admit.

This is tedious and costly.

Enter datagen

I wrote datagen of an evening in April 2019. It was (and still is) hacky but it gets the job done and that job, is to load a lot of data into dev/test databases quickly. It’s the tool I always wanted and needed but never had, hence all those wasted hours creating very specific tools that were good for one very specfic application.


Here are the main random generators available with datagen. For more information on all of them, take a look at the repo:

{{string 5 10 "x-" "abcABC"}}

{{stringf "" 5 10 "abc" 10000 20000}}

{{int 5 10}}

{{date "2019-01-02" "2019-12-31" "" }}

{{float 0.0 1.0}}


{{set "a" "b" "c"}}

{{wset "a" 60 "b" 30 "c" 10}}

{{fset "animals.txt"}}

Create some tables

The following is a simple owners and pets example. An owner can have 0:* pets and a pet can have 1 owner. In this example, we’re restricting pets to being either a cat, or a dog.


USE "sandbox";

CREATE TABLE "owner" (
    "id" UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    "name" STRING NOT NULL,
    "date_of_birth" TIMESTAMP NOT NULL

    "id" UUID DEFAULT gen_random_uuid(),
    "pid" UUID NOT NULL,
    "name" STRING NOT NULL,
    "type" STRING NOT NULL,
    PRIMARY KEY ("pid", "id"),
    CONSTRAINT type_v1 CHECK ("type" IN ('cat', 'dog'))
) INTERLEAVE IN PARENT "owner" ("pid");
Write a datagen load script

datagen uses Go’s text/template engine along with a bunch of custom helper functions to generate SQL with random data. Here’s the complete load script and I’ll follow with a line-by-line summary of what it’s doing.

-- NAME owner
-- Inserts 5 rows into the owner table in 1 statement.
insert into "owner" ("email", "date_of_birth") values
{{range $i, $e := ntimes 5 }}
	{{if $i}},{{end}}
		'{{date "1900-01-01" "now" ""}}'
returning "id";

-- NAME pet
-- Inserts 10 rows into the pet table across 2 statements.
insert into "pet" ("pid", "name", "type") values
{{range $i, $e := ntimes 5 }}
	{{if $i}},{{end}}
		'{{each "owner" "id" $i}}',
		'{{wset "dog" 60 "cat" 40}}'

datagen works in “blocks” which are separated by empty lines. Owing to the empty line between the two insert statements, this load script has 2 blocks; one named “owner” and one named “pet”.

Here’s a breakdown of each line from the first block and a couple of interesting lines from the second:


Execute block once.

-- NAME owner

The name of the block is “owner”. This allows the results from the block to be used in subsequent blocks.

insert into "owner" ("email", "date_of_birth") values

The start of a standard SQL insert statement.

{{range $i, $e := ntimes 5 }} ... {{end}}

A Go template range pipeline in conjunction with a custom helper called ntimes. ntimes takes an argument (in this case 5) and creates a slice of that many empty structs ([]struct{}) which occupy zero memory. The range pipeline can then range over this slice and generate 5 copies of whatever it finds between the start and {{end}}, which in this case, will be multi-row DML.

{{if $i}},{{end}}

This ensures that the multi-row DML being generated is comma-separated but only between items (e.g. “a,b,c” and not “a,b,c,” which would be invalid SQL).

( ... )

The brackets used to contain the values being inserted.

'{{stringf "" 3 7 "abcdefg" 3 7 "hijklmnop" }}',

stringf is a function which generates a random value based on a format string. In this example, we’re creating an email address with 2 generated parts that are separated by a full-stop. The first part (identified by the first “%s” verb) will generate between 3 and 7 characters from the set “abcdefg” and the second part (identified by the second “%s” verb) will generate between 3 and 7 characters from the set “hijklmnop”.

'{{date "1900-01-01" "now" ""}}'

date is a function which generates a random date between a range. In this example, we’re generating a random data between January 1st 1900 and the current date and time. The last argument (an empty string), denotes that we’re using the current default date format for datagen, hence January 1st 1990 appears in “2006-01-02” format. This can be overridden with a command line argument. If you’d like to use a different date format, you can either pass a default value via the command line, or override the value as follows {{date “19000101” “now” “20060102”}}.

Note that the data generated will need to be recognised by the database, hence the format “2006-01-02” was chosen to be the default date format.

returning "id";

This instructs datagen to remember all ID values inserted into the database and persiste them against the block’s name “owner” for use in subsequent blocks. To make more columns available to subsequent blocks, simply add more columns to the returning statement, separating them with commas.

Note that this only works with database that make use of the “returning” statement.

'{{each "owner" "id" $i}}',

each is a function which references all rows inserted by a previous block. In this case, what we’re referencing is “each id in the owner table”. the “$i” argument allows us to reference more than one column from the preceding table. If you only need one column value and don’t need to generate values for each of the rows created by a previous block, use ref. If you’d like to reference multiple columns from the rows generated by a previous block, use row.

'{{wset "dog" 60 "cat" 40}}'

wset is a function that will instruct datagen to select between a set of options with a weight attached to each item, making each more or less likely to be selected. You can use the set function if you’d like each item to be equally likely to be selected or fset if you’d like to store your enumeration in a file and reuse between calls.

Loading data

To execute the script, run datagen from the command line:

$ datagen -script in.sql --driver postgres --conn postgres://root@localhost:26257/sandbox?sslmode=disable

We’ve just inserted 5 rows into the owner table and 10 rows into the pet table. The each function will have ensured that each of the owner records gets assigned a pet and owing to the fact we’re inserting twice as many pet records as owners, each owner gets 2 pets.

select, count(*) from owner o join pet p on = group by;
pid count
1c26e929-cc55-43ba-8826-0cde1c05c45a 2
2f57581a-9dfc-4c27-a67f-046b00ada04c 2
55040a35-72e1-41a5-a83f-5fadb7d0afc6 2
7ceaaded-3aee-4881-bf39-aad8d6efdbdd 2
d8cd830f-1393-492e-877e-5d92bdcac83b 2

If the --REPEAT and ntimes product is less than the total number of rows in a previous table, you’ll just get less results in the next table. In this example, --REPEAT is 1 and ntimes is 3 (1 x 3):

pid count
1c26e929-cc55-43ba-8826-0cde1c05c45a 1
2f57581a-9dfc-4c27-a67f-046b00ada04c 1
55040a35-72e1-41a5-a83f-5fadb7d0afc6 1

If the --REPEAT and ntimes product is great than the total number of rows in a previous table, you’ll get more results in the next table. In this example, --REPEAT is 2 and ntimes is 7 (2 x 7):

pid count
095f9ff6-e9a4-48de-b86f-bcbad96466bb 3
6854cf04-dfb0-4163-8bfd-deb9404d6b34 3
c88df58a-3dc7-4b9c-9190-e36158e80d12 3
cdf17a11-88cc-43bc-a09d-114803c911dc 3
eae4672f-ec77-4036-8498-863b68ad5748 2

We could make this more random by using the ref or row functions:

'{{ref "owner" "id"}}'

'{{row "owner" "id" $1}}'

pid count
095f9ff6-e9a4-48de-b86f-bcbad96466bb 2
6854cf04-dfb0-4163-8bfd-deb9404d6b34 1
c88df58a-3dc7-4b9c-9190-e36158e80d12 4
cdf17a11-88cc-43bc-a09d-114803c911dc 3

Note that not all of the owners will have pets in this case, as ref looks for a random owner, rather than going through them one-by-one.

I’ll follow up with a deeper dive in my next post 👋