Skip to main content

Synthetic Data for Official Statistics - A Starter Guide

Synthetic Data for Official Statistics

Producers of official statistics face a complicated task in managing users’ access to the data they collect, as they must maintain the confidentiality of the individuals or businesses who have provided their data to them, while being under pressure to release ever more detailed datasets in order to provide greater analytical insight to those who wish to use such data.

Traditionally, national statistical offices have provided trusted users (such as academics) access to some micro-level data at the level of individuals or businesses, while publishing aggregate statistical tables to other users. This approach is not a perfect solution to managing access to data, as many users will not obtain the amount of detail they are seeking, while vetting and managing trusted users is time consuming, and does not guarantee that they will never misuse or lose the data they access.

However, there is another way of providing users with analytical insight, by providing them with Synthetic Data, which may be advantageous for certain use case scenarios. Synthetic data can be simulated in such a way as to have many of the same properties as the original dataset, and to allow derivation of the same results and insights, but with a much lower risk of revealing information about individuals to which that data relate.

If you are involved in managing users’ access to official statistics, and would like to have another option for dealing with your data access dilemmas, this guide will give you what you need to get started.