Could synthetic data redefine the way we share statistical insights?

24 January 2023

UNECE is releasing guidance to help producers of official statistics to produce synthetic data.

Data creates value by guiding the decisions of governments, businesses and individuals. This value is often described in terms of the insights contained within these data, which can, among other things, help to target resources where they are most needed, to evaluate public policies, or for business planning. More detailed data may contain more insights and be more valuable, depending on a user’s requirements.

Producers of statistics aim to fulfil the need for these insights, and wherever possible to give equal access to them. However, they face a serious constraint in facilitating access to their data; namely the need to protect the confidentiality of the individuals that supply their data, who do so for example by filling in statistical surveys.

This constraint is reflected in the main products that producers of statistics offer to the users of their data.

One of these products is the aggregated summaries of the data that one typically finds on the website of a county’s statistical office, and which are referred to as statistical tables. These usually contain total or average values of a variable of interest, such as workers’ average salary, perhaps broken down by other variables, such as by occupation type.

Providing aggregate values protects the identity of a particular person who supplied their data, but it also reduces the amount of detail provided to a data user. While such figures might be suitable for including in a written report, performing sophisticated analyses for deeper insight necessitates more detailed data.

The other product that statistical offices may provide is “microdata”, which contains detailed information at the level of specific individuals, and which are provided only to trusted researchers on promise of confidentiality.

Microdata are particularly vulnerable to disclosing personal information, and even if people’s names are removed from a dataset, it may be possible to establish their identities by comparing to other sources of data. Therefore, while microdata can contain far more insights than the aggregated summaries described above, fewer users have the opportunity to access and use them.

In short, statistics producers either limit the amount of data they give to users, or they limit the sorts of users who can access it, and in different ways this limits the insights (and analytical value) that can be supplied to data users overall. The growing hunger for detailed data is placing increasing pressure on producers of statistics.

Synthetic data: A possible solution

There is another, and largely unexplored, alternative for satisfying the needs of those users who require microdata: synthetic data.

This can be thought of as a way of detaching the data from the insights they contain and supplying these insights to researchers within an artificial data set, that contains data about non-existent people. The word synthetic refers to the way these data are synthesised from real data using a model or algorithm. Clearly, such data would need to mimic the real data in such a way that conclusions derived from them would be the same or very similar to those derived from real data.

This is an emerging topic, of increasing interest to producers of statistics, as synthetic data could potentially allow detailed microdata to be released safely, while preserving their analytical value for the end user.

The key word here is "potentially" because data synthesis must be performed carefully, giving consideration to the intended analytical use of those data. The method chosen to synthesise the data must maximise how similar their analytical properties are to the real data, while reducing to a negligible level the chances of identifying any real individual.

While synthetic data sets refer to non-existent people, their attributes cannot be too close to those of specific real people or else personal information could be deduced.

This is not easy to do, and statistical organizations that wish to adopt synthetic data require guidance on best practices to follow, to help them decide which synthesis method is appropriate for synthetising data in a given implementation scenario, and to understand the pros and cons of each one.

It is for this reason that UNECE has published new guidance, which provides many of the main elements needed to start exploring the implementation of synthetic data, and which provides a foundation to motivate the further development of this field going forward.

What’s the current status of synthetic data in statistical organizations? What does the future hold?

Producers of statistics have been fairly cautious in their applications of synthetic data, and with good reason, given that the confidentiality of individuals’ data, and accuracy of any conclusions derived, are paramount.

Some, such as Statistics Canada and the Scottish Longitudinal Study have supplied researchers with synthetic data for exploratory analysis, which is useful either to avoid travelling to a secure location where the real microdata is accessed, or else to commence research while waiting for authorization of a request to access real data. In both cases, it is mandated that final publication-quality results must be obtained by executing their models on the real microdata.

In another example, Statistics New Zealand has published synthetic microdata that mimics some of the data from their census, made available for purposes of teaching and learning, among others.

The limitations that these organizations place on the use of their synthetic data provide an automatic safeguard against any discrepancies between findings derived from synthetic versus real data. But is it possible to go further, and to release synthetic data with fewer restrictions on their use?

To some extent, this is already possible from a technical standpoint. There are a variety of different methods for synthesizing data, and some of them can do it in such a way that they will closely reproduce certain results from the real data. For example, it is possible to produce synthetic microdata of a country’s population such that the average age of that population is almost identical to what would be obtained if that calculation was performed using real census data. One could also preserve correlations between variables within that data set if the data user needed it.

However, this approach to producing synthetic data requires prior knowledge of how those data are going to be used, and which analytical properties to preserve, so may not be appropriate if the intention is to provide a data set for more general use.

The field of synthetic data, and the possibilities that it holds for statistical offices and broader data users are continuously expanding. With the emergence of deep learning, a class of machine learning algorithms, new possibilities are arising in this field for producing synthetic data from a broader range of data sources, such as satellite images or unstructured data.

More immediately (and perhaps more realistically) the impact of releasing synthetic data could be to level the playing field to allow advanced analytical research to be done by types of users beyond those who would normally be permitted to access real microdata (which has often been mainly those who work in academia, or government researchers who already have access to it).

Synthetic data could be used more by companies and private individuals to perform sophisticated analyses (such as machine learning), to obtain analytical value from those data. For example, one could imagine an entrepreneur doing so to identify business opportunities for a start-up company. Given the amount of resources used to collect statistical data, it is important maximize the extent of their use.

The new UNECE publication, Synthetic Data for Official Statistics: A Starter Guide, incorporates contributions of experts from academia and the commercial world, as well as national experts from Australia, Canada, Germany, Netherlands, New Zealand, Norway, the United Kingdom and the United States of America. It arose from a project led by Statistics Canada under the auspices of the High-Level Group for the Modernisation of Official Statistics, the modernization arm of UNECE’s Conference of European Statisticians.