Skip to main content

United Kingdom of Great Britain and Northern Ireland

Intruder testing – an empirical measure of the quality of Census 2021 England and Wales Disclosure Control methods, ONS UK

confidentiality rules, individual data, swapping records, cell key method, disclosure rules, intruder testing,

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Title : Intruder testing – an empirical measure of the quality of Census 2021

England and Wales Disclosure Control methods

Author(s) Samantha Trace (Office For National Statistics) Dominic Nelson (Office For National Statistics

e-mail [email protected]

Abstract

By law, the Office for National Statistics (ONS) must protect the confidentiality of respondents to

Census 2021. We protected the confidentiality of individuals' data in three ways: swapping records

between areas, applying a cell key method to each table, and applying disclosure rules in deciding

which tables could be published. To assess the effectiveness of these methods and provide assurance,

an intruder test was performed on Census 2021 data using a secure version of the outputs system. 51

intruders were recruited to attempt to identify individuals in the planned data outputs. 30 Intruders

took part, 81 claims were made, and more than half of these claims (41/81) were incorrect. Further

steps were taken reduce the risks identified by the test, making the data the majority of these claims

were made from no longer possible to access through the Create a Custom Dataset system. This gave

the Office for National Statistics evidence there was sufficient uncertainty in the data to meet the

standard required by legal guidance and we would meet our ethical duty to protect confidentiality.

2

1 Introduction

The Office for National Statistics (ONS) has legal obligations under the Statistics and Registration

Service Act (SRSA, 2007) Section 39 and the Data Protection Act (2018) that require the ONS not to

reveal the identity or private information about an individual or organisation.

We have a pledge to respondents that the information will only be used for statistical purposes, so we

must look after and protect the information that is provided to us. Moreover, a breach of disclosure

could lead to criminal proceedings against an individual who has released or authorised the release of

personal information, as defined under Section 39 of the SRSA.

The SRSA defines "personal information" as information that identifies a particular person if the

identity of that person:

• is specified in the information

• can be deduced from the information

• can be deduced from the information taken together with any other published information

Therefore, in order for data to be released, the risk of identifying individuals from it, potentially with

additional publicly available information, must be minimal.

Intruder testing is an empirical test to check that the measures applied to make data sufficiently

difficult to identify individuals within have been successful. This involves recruiting ‘friendly

intruders’ who emulate the actions of potential ‘real intruders’ upon the data.

The standard that needs to be met is suggested by the National Statistician’s Guidance, “the design

and selection of intruder scenarios should be informed by the means likely reasonably to be used to

identify an individual in the statistic”.

So, intruder tests are designed to measure what could be done with the means likely to be available to

an opportunistic attacker, it does not have to cover every imaginable scenario, just the most probable.

The 2011 Census outputs were tested in this way, and the findings were useful in providing assurance

that the disclosure controls measures used on the data were adequate, and provided evidence to what

further steps should be taken to further reduce disclosure risk. Other ad-hoc exercises have been

undertaken by the ONS as required since, with the same purpose – to determine the level of

identification risk in a dataset.

For Census 2021, new disclosure control methods were required for a new output system. On top of

the imputation of missing records done to make the Census as representative as it can be, which also

adds doubt as to whether a particular record is ‘real’ or not, there were new measures in place to

protect the data:

• Targeted Record Swapping – swapping households that are marked as unique in the data with

a similar record in the local area. The geographies were changed for between 7% and 10% of

households, and for between 2% and 5% of individuals in communal establishments.

• Cell Key Perturbation - this adds noise to the figures, making slight changes to cell counts

including zero cell counts, by a method which means that where the same records are

presented in a cell, the number should remain consistent. A typical dataset would have around

14% of cell counts perturbed by a small amount, and small counts were more likely to have

been perturbed than large counts.

3

• Disclosure rules (in the Create a Custom Dataset system) – automated rules including

measures of how many small counts are in the table, that can stop data being given for an area.

These methods were intended to combine as a ‘lighter touch’ approach, allowing some detail to be

possible at low level geography, whilst maintaining the usefulness of the data within the new Create a

custom dataset (CACD) system, and other census outputs. The CACD system allows users to create

their own multivariate datasets, so the rules are set to prevent the possibility of identifying a single

record and building up a list of potential attributes. The level of identification risk should still be

minimal, using information public or private.

2 The Intruder Test

2.1 Method

51 intruders, all ONS employees, were recruited. All had appropriate security clearances and

consented to an enhanced non-disclosure agreement. They were given training on how to use the

output system, and possible methods of working against our statistical disclosure controls. A safe area

of an approved file management system was set up, and they were given access to individualised

folders to record their findings and keep notes.

A version of the planned outputs system was created on a secure internal-access platform and loaded

with the usual resident database. This is the main basis for Census outputs as it includes all people

who are ‘usually resident’ at the enumeration address at the time of the census. This was also

programmed with all the current planned variables and classifications for those variables. A version of

the planned statistical disclosure rules was placed in this system, to auto-control outputs requested by

intruders, and deny access if the output does not pass these rules. The system had built in perturbation

so automatically created outputs with some values slightly changed.

The data placed in the system had targeted swapping already applied and imputed records present, just

as it would be when published. The main census 2021 geographies were available in this system, the

smallest geography used was output area (OA), an area with at least 100 persons in it, though more

typically 400 persons.

Intruders were given individual access to the system, encouraged to collaborate on a private Teams

channel, and to share resources, such as web pages, hints and tips. An errors log was set up to record

system issues, and the details of the claim, including geography, variables and classifications used, as

well as the name and address of the individual being claimed as found, and the confidence level in the

identification as a percentage.

Claims were transcribed from the individual file folders to a single sheet that the checkers had access

to. These checkers were from a different team to ensure the data was fully firewalled from the

intruders, and no actual disclosure would result from the exercise.

The checkers had access to record level data, so could determine whether a claim was correct, partial,

or incorrect. A correct claim would match on name and approximate address. Inaccurate address

matches were counted as correct so long as they would have been within the geographical area used to

make the claim.

Inaccurate name matching was counted as incorrect. A partial match would be where a claim was

made on a 1 in a cell, where more records would have been in that cell but were perturbed down to 1.

4

.

2.2 Limitations

We had considered engaging a third party to take part in the test, however we could not be sure of

start time, and there are few companies engaged in exactly this sort of testing that could have gained

security clearances in time, so it was deemed impractical to engage a third party in this exercise.

Therefore, there may be some organisational biases in our exercise.

Although attempts were made to recruit people from more sparsely populated areas of England and

Wales, most people were still clustered geographically around ONS offices and reflect the socio-

demographic mix of ONS staff rather than the general population.

Intruders also had to use their spare time around their regular work, and the exercise ran in August

when many took leave, although it took place over three weeks to allow more people to participate.

The dataset looked at was not the full range of planned Census outputs. The final system includes not

just Usual Resident, but also Usual Residents in Households and Communal Establishments,

Households and Household reference Persons. The Usual Resident dataset used was taken to be a

sufficient test of the general level of risk in the data.

2.3 Results

2.31 Claims

81 identification claims were made, excluding duplicates. These claims are where an intruder

highlighted a ‘1’ cell count in a dataset, and gave the details of this, and claimed they knew which

person it related it to. Some (2) claims listed various methods to approach the same identification, in

these cases this was still counted as one claim and measures such as cell count were taken from the

first tables stated.

40/81 or 49% of identification claims were correct (the intruder correctly named an individual in a

cell)

8/81 or 10% of identification claims were partially correct (the intruder correctly names an individual

in a cell of apparent size 1, but the cell count is greater than 1 – due to cell key perturbation – the cell

could have been representing any of the people in it)

33/81 or 41% of identification claims were incorrect, the record marked in the cell did not relate to the

individual named.

No attribute claims were made, an attribute claim is where an intruder claims to have found something

new about a person through the data presented.

Of the initial 51, 12 dropped out, citing workload or holiday as reasons, and a further 9 filed no notes

and made no claims. Of the 30 intruders that took part, 6 (20%) did not make any claims. Reasons

cited included not being able to claim anything with certainty, some may also have lacked time to

spend on the project.

2.32 Confidence

5

Figure1: Confidence, correctness and number of claims

This histogram shows numbers of claims by the percentage confidence the intruder reported in the

claim, banded by whether they were correct, partially correct or incorrect.

There was a range of 7.5-100% confidence in claims

The mean confidence placed in a claim was 73.6%, the median was 80%.

2.32 Cell Counts and correctness

The cell count is the number of cells (row * columns) present in the table used to make the claim.

A wide range of table sizes were used to inform claims, (range 7 – 2100, mean 183, median 182).

Figure 2: Cell counts and correctness

The scatter plot shows claims rated by percentage correctness. Partially correct claims are 50%

correct, fully correct are 100% correct. One outlier (cell count 2100) was removed. This shows a

R² = 0.0986

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120

N u m

b er

o

f ce

ll s

fo r

cl ai

m d

at as

et

Correctness

6

positive correlation (R^2 = 0.0986), but with outlier, this relationship was zero. This could suggest

that higher cell counts may increase possibility of identification within limits – very high cell counts

may not.

2.34 Variables Used

To assess which variables were most likely to result in a claim, and which in a correct claim, the

claims were coded to variable type. Any table constructed with a single classification making the bulk

of the cells would be coded to that variable, e.g., any claim using single year of age, or single year of

age plus another less detailed classification such as sex, was coded to ‘age’, any claim using a 3-part

country of birth classification, 10-part age, and sex would be coded ‘multivariate’. A few variables

with only a few claims each were coded to ‘other’, such as country_of_birth.

Table 1: Number of claims by variables used in the datasets those claims came from

Variable Number of Claims Number of Correct Claims % Claims that were correct

Age 35 21 60%

Multivariate 28 12 43%

Occupation 9 2 22%

Other 8 5 63%

The table shows claims where age was the main component had the highest number of claims, and

highest number of correct claims. Multivariate tables were less than 50% likely to yield a correct

claim, and occupation was unlikely to result in a correct claim. The main cause of correct claims from

the ‘Other’ category were claims using country_of_birth.

2.35 Geography

Table 2: Number and correctness of claims by Geography used in datasets

Geography

of the table

used for the

claim

Number of

Claims

Number

of Correct

Claims

% of Claims made

that were correct

Mean %

confidence

Mean cell

count

OA 67 34 51% 75 142

LSOA 9 5 56% 73 248

MSOA 5 1 20% 47 610

The largest geography used for any claim was Middle Super Output Area (MSOA), Output area (OA)

was the main area of risk with the bulk of claims (65/81 or 80%) being made using OA datasets. It

was also the focus of correct claims (34/40 or 85%). There were few claims at MSOA, and only one

7

correct claim. Lower Super Output Area (LSOA) again was used in few claims, and though these were

majority correct, with such a small sample it cannot be concluded that this would always be more

likely to be correct or not.

Table 3: Subject of the Disclosure Claim

Number of Claims Correct claims Percentage of

Claims correct

Family and friends 59 25 42%

People from news/ web 16 11 68%

Self-identification 6 4 67%

Many those known about through news or online articles were centenarians, identified through age

and location.

Other

Though intruders were given access to ‘fixed’ tables as csv files, at least 7 intruders used them, there

were no correct claims from these.

Qualitative evidence suggested the intruders found the new flexible outputs system was very easy to

use (rated 4.3 out of 5 by the 15 intruders surveyed), and low amounts of time were recorded as

typical to arrive at a claim (5-30 minutes) though it is hard to calculate total time taken per claim

accurately as time spent logged in could not be taken as an indication of time spent on this project.

Intruder feedback suggested that the disclosure rules built into the system were working as intended

and when they tried to obtain a cell value of 1 at lower geography, the rules prevented this by denying

the data.

3 Discussion

The overall results show that over half of identification claims were incorrect. However, unlike other

intruder testing exercises carried out previously by ONS, intruders were fairly unlikely to make claims

where they had low confidence. Almost all claims were made with a confidence of 60% or greater.

Generally, the higher percentage of confidence the intruder rated a claim, the more likely they were to

be correct. Although this was statistically significant, the relationship was not so strong, and a

significant portion of those who were over 90% confident were still incorrect or partially correct (35%

or 13/37).

The exercise on 2011 census data saw a drop off in percentage correctness at very high confidence

claims which was not seen here. Possibly, the ease of using the system may have made all intruders

more confident, and meant intruders went for easier identifications, rather than putting forwards ones

they were less sure of.

8

The method used for this exercise did not allow us to know whether an identification was wrong due

to swapping, or other reasons – only if it was perturbed and therefore a ‘partial’. Therefore, it is hard

to evaluate the success of swapping as a single method from this evidence.

Cell counts of tables present an unclear picture, as no correlation was found with table size in cell

count and correctness. Smaller tables may be easier to be sure where a person might be represented,

where a larger table makes it more likely to get a small count to base an identification claim. It seems

more detailed classifications may offer additional risk in some circumstances, but dependent on

geography.

There were no claims at any geography higher than MSOA. It is likely that an intruder would have far

more confidence over a claim at lower geographies since they may have considerable knowledge as to

who lived in an OA with which they are familiar, but far more uncertainty as the geography level

increases. Observing a cell count of 1 in an OA may convince them that the person they know is the

only one with that combination of attributes. They might have less certainty at MSOA that the 1

corresponds to the subject of the claim given the lower likelihood of familiarity with the individuals in

the population, as well as ‘noise’ introduced by error, imputation, record swapping and the cell key

method.

The high level of claims and correct claims at OA make this the main area of risk to address in

planned outputs. Claims made at OA also had the highest level of confidence with an average of 75%

confidence expressed in the claims. The variables used for these claims were consistent with the

general picture, that is, age was a main variable used for identifications, followed by other detailed

classifications such as occupation and country_of_birth. Multivariate tables made the basis for 22 of

the OA claims, of which most were incorrect or partially correct (13/22 or 59%), which demonstrates

that the protections did well at protecting multivariate data as they were designed to do.

Whilst most of the claims were correct at LSOA (5/9 or 56%) this was a small sample and could

equally have been majority incorrect with one fewer correct claim. However, some of the claims made

at OA could equally have been made at LSOA, as they are small enough to make small counts

prevalent, and intruders might have a moderate level of familiarity with most residents within a typical

sized LSOA (1600 people). The level of confidence in LSOA claims was not much less than that

shown in claims made from OA level tables (73% confidence in LSOA, 75% in OA claims). A

majority of LSOA claims (5/9 or 56%) were based in multivariate tables, though a minority of these

were correct (2/5 or 40%). The mean cell count of tables used for claims at LSOA was consequently

much higher.

There was little risk of a correct claim (only 1/6 or 17%) from an MSOA table, so this supported

earlier evaluations of the data that looked only at the sparsity of the likely tables, and restricted fixed-

table outputs of detailed univariates to MSOA geography. The cell counts used for MSOA tables were

higher on average, which is unsurprising given the higher population (typically 7000) that would have

to be divided in the classifications to obtain a cell count of ‘1’ to base an identification upon. The level

of confidence was also significantly lower at average 47%.

That age was shown as a specific risk should be noted; however, some of these claims were claims

made using already publicly available information on Centenarians so arguably the disclosure came

from these sources, not the output. That said, many claims were also identifying people who happened

to be the only one of that age in their area, so single year of age at Output Area geography has been

shown as a specific risk to mitigate.

9

The variables used for correct claims supports current thinking that more ‘definite’ variables are more

disclosive, that is age and country_of_birth are both variables that are likely to be reported

consistently by the person filling in the Census.

Claims based upon occupation were very unlikely to be correct on the other hand, which may be due

to uncertainty about how the question may have been interpreted by the person answering, and how

their answer would have been coded by the automated processing system.

Multivariate claims are also less likely to be correct, possibly because increasing the number of

variables increases the chances an answer would not have been given or been recorded the way the

intruder guessed. The level of risk in these detailed univariates was still limited to smaller sized

geography, so there is no evidence from this test to restrict the use of these variables at MSOA or

higher geography.

In terms of the variables that relate to special category data there was no evidence that variables such

as health, disability, ethnicity, religion, sexual_orientation and gender_identity, all of which were

included in the test, were at significant risk of correct identification claims. This may be due to the

protections put in place for these, and the less definite nature of these variables. Though we know 7

intruders tried to use the sexual_orientation and gender_identity datasets, these were made available

separately through .csv files which may have made them harder to access. In the final outputs they

would not be available below MSOA, so this intruder testing exercise seems to support that decision

in terms of sufficient protection for that data.

The test was conducted pragmatically, and therefore recruited people with more statistical awareness

and knowledge of the data than would be found in the general population, as they were ONS

employees. This may be taken a slightly over-stringent test, as it may over-estimate the risks from

intruder attempts made by the public.

4. Conclusion

The standard to be met to fulfil legal requirements is that claims should not be both made with

confidence and correctness. The level of risk found in the current planned outputs found by this

exercise would meet these legal definitions of safety, and additional steps were taken to decrease this

risk further.

In response to the findings, the rules in the table builder were altered to restrict the availability of

detailed classifications at lower geography, and one more detailed topic summary was replaced with a

classification with fewer categories that consequently posed less risk. The majority of claims made

here would not be possible to make using the actual output system.

Perturbation, swapping, the disclosure rules and general level of doubt in the data together were

shown to be effective at preventing correct identifications.

Awareness of perturbation and swapping did not appear to result in lower levels of intruder’s

confidence in making claims, so this alone cannot be relied upon to meet the legal standards. Further

steps were also taken to ensure LSOA level data was protected by restriction of the level of detail

available at this geography.

10

The evidence seen here, with lower risk at MSOA, supports the decision to limit the geography of

usual residents in communal establishments and households to MSOA, even though those datasets

were not included in the test

The CACD system has been launched since this test took place, and sees some 900,000 interactions

per month (ONS data), demonstrating the usefulness of Census data delivered in a flexible and

immediate format. If this system is to be employed for a wider range of statistical products, further

intruder testing should be considered as a means of measuring and mitigating disclosure risk in those

datasets.

Intruder testing is a highly useful exercise for data providers to employ, where the level of risk

presented by a dataset is in doubt. It gives evidence on the likely level of risk, where that risk lies, and

can inform appropriate action to mitigate those risks.

1

Intruder Testing

Census 2021 England and Wales

Risk and Utility in the Create a Custom Dataset System Sam Trace

2

Background

• Key Census 2021 White Paper promise ‘Every person’s identity will be protected, not only through secure handling and storage of their data, but also by ensuring that our statistical publications do not identify individuals’

• Since 2011, there has been exponential growth in information publicly available about individuals

• There is an all new customizable system for Census 2021

• Census 2021 has new methods protecting the data

3

Statistical Disclosure Control (SDC) methods

• Targeted Record Swapping – identifying people and Households that stand

out in the data, swapping them with a similar record in a nearby area.

• Cell Key Perturbation - this adds noise to the figures, making slight

changes to cell counts

• Disclosure rules – automated rule-based checks run by the system, which

decide if there is a low enough disclosure risk to allow the release of a

dataset.

How do we check these have done enough?

4

Intruder Testing

• Intruder testing is where ‘friendly’ intruders try to identify people in the data to check the risk level

• Census 2011 outputs were intruder tested before release

• It is a practical check to see if the methods worked

• The point of the exercise is to try and find out if it is possible to identify individuals in the data

5

Legal Standard for outputs

• There must be ‘sufficient uncertainty’ about any identification from a small count

• Identifications made with publicly available information in combination with the data are included

• Testers do not need to be specialist hackers

• Methods must cover the ‘means likely reasonably to be used’

6

Method

• Recruit intruders – ONS people only

• Consent intruders

• Train them and advise of the disclosure control methods

• Get the data on a secure pre-release system

• Intruders try to identify individuals in the data

• Collate results including feedback

• Analyse in Excel

7

Results

• 51 Intruders recruited

• 30 confirmed as working on the project

• 24 intruders made claims

• 81 Claims made (excluding duplicates)

8

Claims

49%

41%

10%

Correct Incorrect Partial

10

Confidence and Correctness

12

Variables Used

Correct All % Correct

Age 21 35 60%

Multi 12 29 41%

Occupation 2 8 25%

Other 5 9 56%

40 81 49%

13

Cell Counts

Correct All % Correct

0-49 2 9 22%

50-99 3 9 33%

100-149 9 20 45%

150-199 6 9 67%

200+ 20 34 59%

Total 40 81

14

Options

Remove detailed classifications from the Create Your Own Dataset system

• Loss of useful classifications at higher geography

• There may be other classifications not tried that also pose a risk

Limit max number of Cells

• Loss of useful functionality at higher geography

Specify Max cells specific to geography for univariates

• Would prevent the main risk

15

Limit max cells by Geography?

• The majority of datasets used for claims and correct claims used would not be available

• Might need to apply to LSOA too as some OA claims could equally have been successful at LSOA

• MSOA claims were already likely to be unsuccessful

16

Conclusions

• Detail available at low geography was a risk that was addressed in the live release system

• Some variables carry higher risk than others

• Changes to rules effectively blocked the main risks identified

• Automated rules in the Create a Custom dataset system worked to make claims harder to arrive at

17

Actions

• Limit detail available at low geography

• Keep detailed topic summaries at MSOA level geography

• Releases could take place as planned

18

User Experience We asked the intruders their opinions of the new system

19

Ease of use

Choosing Variables

Choosing

Classifications

Data was clear

Speakers

Samantha Trace

Methodologist

Statistical Disclosure Control

Office for National Statistics

21

  • Slide 1: Intruder Testing
  • Slide 2: Background
  • Slide 3: Statistical Disclosure Control (SDC) methods
  • Slide 4: Intruder Testing
  • Slide 5: Legal Standard for outputs
  • Slide 6: Method
  • Slide 7: Results
  • Slide 8: Claims
  • Slide 10: Confidence and Correctness
  • Slide 12: Variables Used
  • Slide 13: Cell Counts
  • Slide 14: Options
  • Slide 15: Limit max cells by Geography?
  • Slide 16: Conclusions
  • Slide 17: Actions
  • Slide 18: User Experience
  • Slide 19: Ease of use
  • Slide 20: Questions and comments please [email protected] [email protected]
  • Slide 21: Speakers

Intruder testing for Census 2021 England and Wales– checking risk and utility in Build Your Own system , ONS, UK

confidentiality, individual data, cell key method,  disclosure rules, intruder test

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Title : Intruder testing – an empirical measure of the quality of Census 2021

England and Wales Disclosure Control methods

Author(s) Samantha Trace (Office For National Statistics) Dominic Nelson (Office For National Statistics

e-mail [email protected]

Abstract

By law, the Office for National Statistics (ONS) must protect the confidentiality of respondents to

Census 2021. We protected the confidentiality of individuals' data in three ways: swapping records

between areas, applying a cell key method to each table, and applying disclosure rules in deciding

which tables could be published. To assess the effectiveness of these methods and provide assurance,

an intruder test was performed on Census 2021 data using a secure version of the outputs system. 51

intruders were recruited to attempt to identify individuals in the planned data outputs. 30 Intruders

took part, 81 claims were made, and more than half of these claims (41/81) were incorrect. Further

steps were taken reduce the risks identified by the test, making the data the majority of these claims

were made from no longer possible to access through the Create a Custom Dataset system. This gave

the Office for National Statistics evidence there was sufficient uncertainty in the data to meet the

standard required by legal guidance and we would meet our ethical duty to protect confidentiality.

2

1 Introduction

The Office for National Statistics (ONS) has legal obligations under the Statistics and Registration

Service Act (SRSA, 2007) Section 39 and the Data Protection Act (2018) that require the ONS not to

reveal the identity or private information about an individual or organisation.

We have a pledge to respondents that the information will only be used for statistical purposes, so we

must look after and protect the information that is provided to us. Moreover, a breach of disclosure

could lead to criminal proceedings against an individual who has released or authorised the release of

personal information, as defined under Section 39 of the SRSA.

The SRSA defines "personal information" as information that identifies a particular person if the

identity of that person:

• is specified in the information

• can be deduced from the information

• can be deduced from the information taken together with any other published information

Therefore, in order for data to be released, the risk of identifying individuals from it, potentially with

additional publicly available information, must be minimal.

Intruder testing is an empirical test to check that the measures applied to make data sufficiently

difficult to identify individuals within have been successful. This involves recruiting ‘friendly

intruders’ who emulate the actions of potential ‘real intruders’ upon the data.

The standard that needs to be met is suggested by the National Statistician’s Guidance, “the design

and selection of intruder scenarios should be informed by the means likely reasonably to be used to

identify an individual in the statistic”.

So, intruder tests are designed to measure what could be done with the means likely to be available to

an opportunistic attacker, it does not have to cover every imaginable scenario, just the most probable.

The 2011 Census outputs were tested in this way, and the findings were useful in providing assurance

that the disclosure controls measures used on the data were adequate, and provided evidence to what

further steps should be taken to further reduce disclosure risk. Other ad-hoc exercises have been

undertaken by the ONS as required since, with the same purpose – to determine the level of

identification risk in a dataset.

For Census 2021, new disclosure control methods were required for a new output system. On top of

the imputation of missing records done to make the Census as representative as it can be, which also

adds doubt as to whether a particular record is ‘real’ or not, there were new measures in place to

protect the data:

• Targeted Record Swapping – swapping households that are marked as unique in the data with

a similar record in the local area. The geographies were changed for between 7% and 10% of

households, and for between 2% and 5% of individuals in communal establishments.

• Cell Key Perturbation - this adds noise to the figures, making slight changes to cell counts

including zero cell counts, by a method which means that where the same records are

presented in a cell, the number should remain consistent. A typical dataset would have around

14% of cell counts perturbed by a small amount, and small counts were more likely to have

been perturbed than large counts.

3

• Disclosure rules (in the Create a Custom Dataset system) – automated rules including

measures of how many small counts are in the table, that can stop data being given for an area.

These methods were intended to combine as a ‘lighter touch’ approach, allowing some detail to be

possible at low level geography, whilst maintaining the usefulness of the data within the new Create a

custom dataset (CACD) system, and other census outputs. The CACD system allows users to create

their own multivariate datasets, so the rules are set to prevent the possibility of identifying a single

record and building up a list of potential attributes. The level of identification risk should still be

minimal, using information public or private.

2 The Intruder Test

2.1 Method

51 intruders, all ONS employees, were recruited. All had appropriate security clearances and

consented to an enhanced non-disclosure agreement. They were given training on how to use the

output system, and possible methods of working against our statistical disclosure controls. A safe area

of an approved file management system was set up, and they were given access to individualised

folders to record their findings and keep notes.

A version of the planned outputs system was created on a secure internal-access platform and loaded

with the usual resident database. This is the main basis for Census outputs as it includes all people

who are ‘usually resident’ at the enumeration address at the time of the census. This was also

programmed with all the current planned variables and classifications for those variables. A version of

the planned statistical disclosure rules was placed in this system, to auto-control outputs requested by

intruders, and deny access if the output does not pass these rules. The system had built in perturbation

so automatically created outputs with some values slightly changed.

The data placed in the system had targeted swapping already applied and imputed records present, just

as it would be when published. The main census 2021 geographies were available in this system, the

smallest geography used was output area (OA), an area with at least 100 persons in it, though more

typically 400 persons.

Intruders were given individual access to the system, encouraged to collaborate on a private Teams

channel, and to share resources, such as web pages, hints and tips. An errors log was set up to record

system issues, and the details of the claim, including geography, variables and classifications used, as

well as the name and address of the individual being claimed as found, and the confidence level in the

identification as a percentage.

Claims were transcribed from the individual file folders to a single sheet that the checkers had access

to. These checkers were from a different team to ensure the data was fully firewalled from the

intruders, and no actual disclosure would result from the exercise.

The checkers had access to record level data, so could determine whether a claim was correct, partial,

or incorrect. A correct claim would match on name and approximate address. Inaccurate address

matches were counted as correct so long as they would have been within the geographical area used to

make the claim.

Inaccurate name matching was counted as incorrect. A partial match would be where a claim was

made on a 1 in a cell, where more records would have been in that cell but were perturbed down to 1.

4

.

2.2 Limitations

We had considered engaging a third party to take part in the test, however we could not be sure of

start time, and there are few companies engaged in exactly this sort of testing that could have gained

security clearances in time, so it was deemed impractical to engage a third party in this exercise.

Therefore, there may be some organisational biases in our exercise.

Although attempts were made to recruit people from more sparsely populated areas of England and

Wales, most people were still clustered geographically around ONS offices and reflect the socio-

demographic mix of ONS staff rather than the general population.

Intruders also had to use their spare time around their regular work, and the exercise ran in August

when many took leave, although it took place over three weeks to allow more people to participate.

The dataset looked at was not the full range of planned Census outputs. The final system includes not

just Usual Resident, but also Usual Residents in Households and Communal Establishments,

Households and Household reference Persons. The Usual Resident dataset used was taken to be a

sufficient test of the general level of risk in the data.

2.3 Results

2.31 Claims

81 identification claims were made, excluding duplicates. These claims are where an intruder

highlighted a ‘1’ cell count in a dataset, and gave the details of this, and claimed they knew which

person it related it to. Some (2) claims listed various methods to approach the same identification, in

these cases this was still counted as one claim and measures such as cell count were taken from the

first tables stated.

40/81 or 49% of identification claims were correct (the intruder correctly named an individual in a

cell)

8/81 or 10% of identification claims were partially correct (the intruder correctly names an individual

in a cell of apparent size 1, but the cell count is greater than 1 – due to cell key perturbation – the cell

could have been representing any of the people in it)

33/81 or 41% of identification claims were incorrect, the record marked in the cell did not relate to the

individual named.

No attribute claims were made, an attribute claim is where an intruder claims to have found something

new about a person through the data presented.

Of the initial 51, 12 dropped out, citing workload or holiday as reasons, and a further 9 filed no notes

and made no claims. Of the 30 intruders that took part, 6 (20%) did not make any claims. Reasons

cited included not being able to claim anything with certainty, some may also have lacked time to

spend on the project.

2.32 Confidence

5

Figure1: Confidence, correctness and number of claims

This histogram shows numbers of claims by the percentage confidence the intruder reported in the

claim, banded by whether they were correct, partially correct or incorrect.

There was a range of 7.5-100% confidence in claims

The mean confidence placed in a claim was 73.6%, the median was 80%.

2.32 Cell Counts and correctness

The cell count is the number of cells (row * columns) present in the table used to make the claim.

A wide range of table sizes were used to inform claims, (range 7 – 2100, mean 183, median 182).

Figure 2: Cell counts and correctness

The scatter plot shows claims rated by percentage correctness. Partially correct claims are 50%

correct, fully correct are 100% correct. One outlier (cell count 2100) was removed. This shows a

R² = 0.0986

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120

N u m

b er

o

f ce

ll s

fo r

cl ai

m d

at as

et

Correctness

6

positive correlation (R^2 = 0.0986), but with outlier, this relationship was zero. This could suggest

that higher cell counts may increase possibility of identification within limits – very high cell counts

may not.

2.34 Variables Used

To assess which variables were most likely to result in a claim, and which in a correct claim, the

claims were coded to variable type. Any table constructed with a single classification making the bulk

of the cells would be coded to that variable, e.g., any claim using single year of age, or single year of

age plus another less detailed classification such as sex, was coded to ‘age’, any claim using a 3-part

country of birth classification, 10-part age, and sex would be coded ‘multivariate’. A few variables

with only a few claims each were coded to ‘other’, such as country_of_birth.

Table 1: Number of claims by variables used in the datasets those claims came from

Variable Number of Claims Number of Correct Claims % Claims that were correct

Age 35 21 60%

Multivariate 28 12 43%

Occupation 9 2 22%

Other 8 5 63%

The table shows claims where age was the main component had the highest number of claims, and

highest number of correct claims. Multivariate tables were less than 50% likely to yield a correct

claim, and occupation was unlikely to result in a correct claim. The main cause of correct claims from

the ‘Other’ category were claims using country_of_birth.

2.35 Geography

Table 2: Number and correctness of claims by Geography used in datasets

Geography

of the table

used for the

claim

Number of

Claims

Number

of Correct

Claims

% of Claims made

that were correct

Mean %

confidence

Mean cell

count

OA 67 34 51% 75 142

LSOA 9 5 56% 73 248

MSOA 5 1 20% 47 610

The largest geography used for any claim was Middle Super Output Area (MSOA), Output area (OA)

was the main area of risk with the bulk of claims (65/81 or 80%) being made using OA datasets. It

was also the focus of correct claims (34/40 or 85%). There were few claims at MSOA, and only one

7

correct claim. Lower Super Output Area (LSOA) again was used in few claims, and though these were

majority correct, with such a small sample it cannot be concluded that this would always be more

likely to be correct or not.

Table 3: Subject of the Disclosure Claim

Number of Claims Correct claims Percentage of

Claims correct

Family and friends 59 25 42%

People from news/ web 16 11 68%

Self-identification 6 4 67%

Many those known about through news or online articles were centenarians, identified through age

and location.

Other

Though intruders were given access to ‘fixed’ tables as csv files, at least 7 intruders used them, there

were no correct claims from these.

Qualitative evidence suggested the intruders found the new flexible outputs system was very easy to

use (rated 4.3 out of 5 by the 15 intruders surveyed), and low amounts of time were recorded as

typical to arrive at a claim (5-30 minutes) though it is hard to calculate total time taken per claim

accurately as time spent logged in could not be taken as an indication of time spent on this project.

Intruder feedback suggested that the disclosure rules built into the system were working as intended

and when they tried to obtain a cell value of 1 at lower geography, the rules prevented this by denying

the data.

3 Discussion

The overall results show that over half of identification claims were incorrect. However, unlike other

intruder testing exercises carried out previously by ONS, intruders were fairly unlikely to make claims

where they had low confidence. Almost all claims were made with a confidence of 60% or greater.

Generally, the higher percentage of confidence the intruder rated a claim, the more likely they were to

be correct. Although this was statistically significant, the relationship was not so strong, and a

significant portion of those who were over 90% confident were still incorrect or partially correct (35%

or 13/37).

The exercise on 2011 census data saw a drop off in percentage correctness at very high confidence

claims which was not seen here. Possibly, the ease of using the system may have made all intruders

more confident, and meant intruders went for easier identifications, rather than putting forwards ones

they were less sure of.

8

The method used for this exercise did not allow us to know whether an identification was wrong due

to swapping, or other reasons – only if it was perturbed and therefore a ‘partial’. Therefore, it is hard

to evaluate the success of swapping as a single method from this evidence.

Cell counts of tables present an unclear picture, as no correlation was found with table size in cell

count and correctness. Smaller tables may be easier to be sure where a person might be represented,

where a larger table makes it more likely to get a small count to base an identification claim. It seems

more detailed classifications may offer additional risk in some circumstances, but dependent on

geography.

There were no claims at any geography higher than MSOA. It is likely that an intruder would have far

more confidence over a claim at lower geographies since they may have considerable knowledge as to

who lived in an OA with which they are familiar, but far more uncertainty as the geography level

increases. Observing a cell count of 1 in an OA may convince them that the person they know is the

only one with that combination of attributes. They might have less certainty at MSOA that the 1

corresponds to the subject of the claim given the lower likelihood of familiarity with the individuals in

the population, as well as ‘noise’ introduced by error, imputation, record swapping and the cell key

method.

The high level of claims and correct claims at OA make this the main area of risk to address in

planned outputs. Claims made at OA also had the highest level of confidence with an average of 75%

confidence expressed in the claims. The variables used for these claims were consistent with the

general picture, that is, age was a main variable used for identifications, followed by other detailed

classifications such as occupation and country_of_birth. Multivariate tables made the basis for 22 of

the OA claims, of which most were incorrect or partially correct (13/22 or 59%), which demonstrates

that the protections did well at protecting multivariate data as they were designed to do.

Whilst most of the claims were correct at LSOA (5/9 or 56%) this was a small sample and could

equally have been majority incorrect with one fewer correct claim. However, some of the claims made

at OA could equally have been made at LSOA, as they are small enough to make small counts

prevalent, and intruders might have a moderate level of familiarity with most residents within a typical

sized LSOA (1600 people). The level of confidence in LSOA claims was not much less than that

shown in claims made from OA level tables (73% confidence in LSOA, 75% in OA claims). A

majority of LSOA claims (5/9 or 56%) were based in multivariate tables, though a minority of these

were correct (2/5 or 40%). The mean cell count of tables used for claims at LSOA was consequently

much higher.

There was little risk of a correct claim (only 1/6 or 17%) from an MSOA table, so this supported

earlier evaluations of the data that looked only at the sparsity of the likely tables, and restricted fixed-

table outputs of detailed univariates to MSOA geography. The cell counts used for MSOA tables were

higher on average, which is unsurprising given the higher population (typically 7000) that would have

to be divided in the classifications to obtain a cell count of ‘1’ to base an identification upon. The level

of confidence was also significantly lower at average 47%.

That age was shown as a specific risk should be noted; however, some of these claims were claims

made using already publicly available information on Centenarians so arguably the disclosure came

from these sources, not the output. That said, many claims were also identifying people who happened

to be the only one of that age in their area, so single year of age at Output Area geography has been

shown as a specific risk to mitigate.

9

The variables used for correct claims supports current thinking that more ‘definite’ variables are more

disclosive, that is age and country_of_birth are both variables that are likely to be reported

consistently by the person filling in the Census.

Claims based upon occupation were very unlikely to be correct on the other hand, which may be due

to uncertainty about how the question may have been interpreted by the person answering, and how

their answer would have been coded by the automated processing system.

Multivariate claims are also less likely to be correct, possibly because increasing the number of

variables increases the chances an answer would not have been given or been recorded the way the

intruder guessed. The level of risk in these detailed univariates was still limited to smaller sized

geography, so there is no evidence from this test to restrict the use of these variables at MSOA or

higher geography.

In terms of the variables that relate to special category data there was no evidence that variables such

as health, disability, ethnicity, religion, sexual_orientation and gender_identity, all of which were

included in the test, were at significant risk of correct identification claims. This may be due to the

protections put in place for these, and the less definite nature of these variables. Though we know 7

intruders tried to use the sexual_orientation and gender_identity datasets, these were made available

separately through .csv files which may have made them harder to access. In the final outputs they

would not be available below MSOA, so this intruder testing exercise seems to support that decision

in terms of sufficient protection for that data.

The test was conducted pragmatically, and therefore recruited people with more statistical awareness

and knowledge of the data than would be found in the general population, as they were ONS

employees. This may be taken a slightly over-stringent test, as it may over-estimate the risks from

intruder attempts made by the public.

4. Conclusion

The standard to be met to fulfil legal requirements is that claims should not be both made with

confidence and correctness. The level of risk found in the current planned outputs found by this

exercise would meet these legal definitions of safety, and additional steps were taken to decrease this

risk further.

In response to the findings, the rules in the table builder were altered to restrict the availability of

detailed classifications at lower geography, and one more detailed topic summary was replaced with a

classification with fewer categories that consequently posed less risk. The majority of claims made

here would not be possible to make using the actual output system.

Perturbation, swapping, the disclosure rules and general level of doubt in the data together were

shown to be effective at preventing correct identifications.

Awareness of perturbation and swapping did not appear to result in lower levels of intruder’s

confidence in making claims, so this alone cannot be relied upon to meet the legal standards. Further

steps were also taken to ensure LSOA level data was protected by restriction of the level of detail

available at this geography.

10

The evidence seen here, with lower risk at MSOA, supports the decision to limit the geography of

usual residents in communal establishments and households to MSOA, even though those datasets

were not included in the test

The CACD system has been launched since this test took place, and sees some 900,000 interactions

per month (ONS data), demonstrating the usefulness of Census data delivered in a flexible and

immediate format. If this system is to be employed for a wider range of statistical products, further

intruder testing should be considered as a means of measuring and mitigating disclosure risk in those

datasets.

Intruder testing is a highly useful exercise for data providers to employ, where the level of risk

presented by a dataset is in doubt. It gives evidence on the likely level of risk, where that risk lies, and

can inform appropriate action to mitigate those risks.

Smoothing the way for secure data access using synthetic data

de-identified form, accredited researchers, safe access, synthetic data

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

Smoothing the way for secure data access using synthetic data

Authors: Richard Welpton and Emily Oliver, Economic and Social Research Council (ESRC), UK

[email protected] [email protected]

Abstract

In the UK, sensitive and potentially disclosive data (including survey and government-owned administrative data) are kept securely and safely in de-identified form and are only accessible to accredited researchers through Secure Data Environments (SDEs). Using this data for research has enormous potential, although access can be constrained by the need for researchers to understand enough about these complex datasets for them to submit a viable project proposal, tensioned against the resource required for data owners to assess every application to use it, and data guardians to answer questions from researchers about the data. For the researcher, they need to be very invested to engage: they can’t see the data in advance of applying for it; can’t test it to see if it will answer their research question; it can take a long time to get hold of; and when they do, it might not contain what they need. It's also burdensome for the SDE as the researcher needs to spend a lot of time in the SDE exploring and preparing data ready for analysis. The resource costs to both researcher and SDE can be considerable.

Low-fidelity synthetic data can be an effective tool to improve the researcher journey because it can lower the barriers to understanding the data before giving researchers access to the real data. As well as accessing it for training purposes, researchers can use it for exploratory analysis to determine if the real data includes the variables they need. In turn, this can help support researchers to improve the quality of their applications for funding and data access; and to develop and test their code while they are waiting for access to the real data. Researchers can continue to develop their code outside of the SDE, therefore minimising the time and resources spent inside the environment. In the UK, only a small number of data services provide access to synthetic data, despite the development of numerous methods for creating synthetic data in the last decade or so.

The Economic and Social Research Council (ESRC, the UK funding council for social and economic research in the UK) has invested in a programme of work to support the creation and routine operationalisation to supply low-fidelity synthetic data to support data access for research and improve the efficiency of SDEs. This has been done largely through ESRC’s Administrative Data Research UK (ADR UK) programme. They have:

 Conducted an in-depth study of the concerns and myths held by government data owners surrounding synthetic data production and use;

 Funded the creation of a Python Notebook tool to create synthetic data easily, at low cost and minimal risk which has been tested and approved by government departments;

 Formed a position statement across its UK partnership setting the vision for synthetic data within its wider remit and mission;

 Embarked on a significant project to explore the utility and use cases of different approaches to synthetic data creation and to evaluate the efficacy of different models to provide recommendations for how synthetic data production can be achieved at scale whilst still acceptable to data owners;

2

 Developed a public dialogue on the acceptability of synthetic data, and public understanding of it and its uses to increase trust and confidence in its development for research for public good.

This session will describe the secure data landscape within which synthetic data sits in the UK and explain the approach taken by ESRC and ADR UK to utilise it as a catalyst for better quality applications for funding and data access, and a smoother researcher journey. We will demonstrate the effectiveness of provisioning access to low-fidelity data by describing how it makes the researcher journey for accessing data and use of data in a SDE more productive, while simultaneously reducing the burden for data custodians and maintaining confidentiality.

1 Introduction

Considerable progress has been achieved to improve access to sensitive data for research, particularly in the UK. For example, the Office for National Statistics (ONS) launched the Virtual Microdata Laboratory (VML) in the mid-2000s (later to become the SRS – the Secure Research Service). In 2011, the UK Data Archive established the Secure Data Service (now UK Data Service Secure Lab). ESRC’s ADR UK programme, a partnership between government and academic groups across all four UK nations, creates linked datasets from administrative sources, making these available to researchers through four Trusted Research Environments (TREs): SAIL databank (ADR Wales); NISRA (ADR Northern Ireland); eDRIS/Research Data Scotland (ADR Scotland); and ONS Secure Research Service (ADR England). These are all examples of Secure Data Environments (SDEs), also known as Trusted Research Environments (TREs).

These facilities have become common place across the health and social science research sectors because they offer a robust approach to accessing sensitive data. They reassure data owners that data they are responsible for, on behalf of the public, can be accessed safely (mitigating risk to individuals in the data) according to the principles of the Five Safes Framework1. SDEs are now considered the default option as far as access to sensitive data is concerned.

SDEs enable a range of data sources to be accessed securely. Consequently, researchers can better explain a range of health, social and economic phenomena. Examples of these data include:

 Business survey microdata available in the SRS and Secure Lab (these are sensitive due to the difficulty of anonymising the data and keeping enough utility in the data to undertake research)

 Detailed versions of social survey data also available in the SRS and Secure Lab, where the additional detail such as very low-level geographies or occupation codes not available in downloadable versions of the data offer new research insights.

 ADR UK has supported UK and devolved governments to make a range of administrative datasets available through their network of four SDEs. These data are de-identified, but not suitable for download because of their sensitivity, and offer utility for researchers.

 Health data such as cancer registration data and records from primary and secondary care services are accessible to researchers through organisations such as NHS England, and other SDEs.

 Linked health and administrative datasets are also now becoming available to researchers through ADR UK’s network of SDEs.

The UK benefits from a legal climate that permits use of such data for research purposes; but culturally the use of the data described above continues to provide ethical and public perception challenges. Concerns about the misuse of data are understandably a constant feature of public debate in this area. This underlines the important role that SDEs have in maintaining the social licence to use these data for research in the public good. When managed through the Five Safes Framework, secure access to these data through an SDE provides assurance that such access leads to safe use in the public good.

3

Despite the SDE solution, it should be pointed that the cost of setting up and operating an SDE is high. Unlike distribution of data, secure access to data through an SDE requires:

 A technological solution (controlling access to researchers, data, projects; coupled with computational processing power)

 An auditable information governance and assurance framework  Expert staff (technology, research, data management, statistical disclosure control etc.)

An SDE can only support as many data sources, researchers and research projects as its technology and staff capacity can allow. For example, between 2007 and 2010, the VML could support about 12 researchers accessing the facility simultaneously (the number of physical desks available at the offices where researchers could sit to visit the facility). When the Secure Data Service was launched in 2011, it could allow 40 researchers to remotely access the service at any one time; this was increased to 150 recently, following funding from ADR UK to expand and improve the service.

Other capacity constraints remain:

Inputs: procedures that researchers must navigate to access data in an SDE often require the researcher to explain in detail how they will use the data to address their research hypothesis. While metadata and documentation can help (when available), researchers often cannot describe accurately how they will use the data until they actually have access to the data. This creates uncertainty and can lengthen the application process.

Quality and completeness of information: occasionally, researchers who have spent considerable time gaining approvals for access to data discover that the data are not suitable for their research when they finally acquire access: a significant opportunity cost for them (and the data owner and SDE that support their access).

Outputs: In an SDE, researchers need to have their research outputs checked for potential disclosure before being released, a process known as statistical disclosure control. This is largely a manual process: SDE staff receive and process these requests. The ability to support researchers can be constrained simply by the number of staff available to service these requests.

Throughput: Much research involves exploring data and methods before a research question is answered. This iterative process relies on computing power to process data. In practice, little of this processing effort leads to a direct research output (for example, it may take several iterations to estimate a research model that yields statistical results that a researcher decides to publish). Yet depending on the technical architecture of the SDE, researchers may be competing for available compute resource, such as CPU/GPU memory.

One solution to address these constraints is to simply invest more money into SDEs, so more staff can be recruited, and more computational capacity can be sourced, etc. Despite such efforts in recent years, the demand to access these data sources continues to grow. SDEs are unlikely to be able to scale up to keep pace with this demand indefinitely.

This paper describes the potential of synthetic data to reduce these bottlenecks. We provide a vision whereby synthetic versions of sensitive data are routinely produced to:

 Enable researchers to assess data before making an application to access them; making sure they are the right data to support their research and help them accurately justify their use of the data when applying to access the data.

 Support the iterative process of research methodology and execution outside of the SDE and thereby reducing demand on SDE computational resources and demand for staff time to undertake statistical disclosure control (accepting the latter may be automated or partially automated in the future).

4

The next section outlines in more detail the challenges that researchers experience. We proceed by explaining ADR UK’s efforts to pilot the generation of synthetic data, and then describe how these synthetic data can support researchers and enable SDEs to work more efficiently given their limited resources, resulting in improved outcomes for researchers and the policy world they support.

2 Challenges for researchers

Using sensitive data for research, such as administrative data, has huge potential, not least because there is so much of it. Administrative data, by its very nature, includes everyone. The datasets are enormous and complex, rich with potential for discovering insights about behaviours, trends, implications and consequences for individuals, communities and the policies and services they are dependent upon. By linking datasets and combining survey and administrative data, these insights that can be even deeper, and the things they can tell us can be transformational.

Access approvals can be slow to gain particularly for linked administrative datasets, because typically each data owner (that is, the government department, local authority or other public body) will want to approve requests. For the researcher this is dependent upon:

 Knowing what data they want to access – including the dataset, the variables, and even within the variable, the period of time they want to consider. Generally, a data owner will not want to give permission for a researcher to access any data they do not need to answer their specific question (the principle of only providing the minimum data necessary to address the research question). If the data has good, accessible documentation (metadata, user guide etc) this could be possible. Otherwise, they might need to rely on access to an expert who has used the data before and knows it well. The researcher needs to be specific and accurate in their request, but knowing enough about the data to do this before they make the request is not always possible.

 Getting a response from the data owner: this is dependent on the data owner having adequate resources in place to respond to data requests. The data owner needs staff who know and understand the data, who also have the time and remit to respond to these queries. If the data is deemed particularly useful by researchers and/or it does not have good and accessible documentation, the data owner might be inundated with requests for it. During times of political turmoil, such as during and post-elections, industrial action or national crises, processing data access application queries might be deprioritised.

Although a researcher can apply to be accredited to access secure data, it is generally only when the data owner has indicated approval can the researcher apply through more predictable channels: applying to the relevant SDE and getting confirmation from research approval panels.

Dr Paul Calcraft2 has described the process of applying to access linked administrative data in the UK as trying to buy a second-hand car without being able to see it or test drive it first: Does it have all its parts, is anything missing, does it do what you think it will do, are there any quirks you should know about? In applying for data, one cannot see it in advance of applying for it; and one cannot test it to see if it will answer the research question. Accessing the data can be lengthy without certainty it will contain the information needed. Figure 1 sets out the process researchers need to follow to access secure data in England.

5

Figure 1: Process of access to secure data in England

3 Synthetic data as a solution

Bypassing much of the system for accessing secure data by instead accessing a version which is not real data and therefore does not need to be held securely, could be one solution for researchers. At the very least, using a synthetic version of the data to find out if you really do want to embark on a protracted process to access the real data, could be valuable. In this section we describe how this prospect should be considered.

3.1 Types of synthetic data and their potential The utility of synthetic data for different applications is, of course, central to the question of its potential. High fidelity synthetic data which mimics the original data and preserves the statistical relationships between variables could reduce costs and complexities for the researcher, as it could also allow for analyses which are extremely close to those done on the real data. The use of such high fidelity synthetic data does come with a degree of risk for the data owner however, particularly if people misinterpreted findings from such data, or it was ‘passed off’ as real data.

Low fidelity synthetic data can, on the other hand, significantly reduce, if not remove, the risks for data owners, as analyses of the data would not generate meaningful results. It can also provide the researcher with easy access to a dataset which can be used to prepare code, test code, become familiar with the format of the data and learn how it can be used. It can also be used for training purposes, to raise awareness about the data.

In the UK, only a small number of data services provide access to any synthetic data, despite the development of numerous methods for creating it in the last decade or so. Making the production of low fidelity synthetic datasets more common could be beneficial to researchers and data managers alike. However, public perception of it is currently unclear and could be reputationally damaging if not addressed alongside other considerations.

For the purposes of this paper, we have described SDEs as ‘remote access’ solutions, in which the researcher can access and ‘see’ the data they have applied to access to undertake their research. Another approach is the

6

‘remote execution’ model, where a researcher develops statistical programming code using synthetic data, then submits their code to be run remotely on the data. Statistical outputs are then returned back to the researcher, subject to a statistical disclosure control check. Recent develops have included Application Programming Interfaces (APIs) to automate this process (such as DataShield, OpenSafely). Remote execution relies heavily on accurate synthetic data to ensure that the researchers can submit accurate statistical programming code; it may fail otherwise, to the frustration and delay of the research.

3.2 Developments In 2020, ADR UK commissioned the Behavioural Insights Team (BIT) to undertake an investigation into the attitudes to and appetite for the provision of synthetic data by government departments. The intention was to understand the concerns and barriers to it with a view to being able to tackle these head on in a more informed way. It identified technical considerations, risk aversion and lack of knowledge, the use of advanced privacy- preserving technologies, and the need for better understanding of public attitudes to synthetic data alongside clearer communication as the key influencing factors. The results of the study are set out in the project report, Accelerating public policy research with synthetic data, and led to recommendations to:

 Encourage the use and sharing of low-fidelity synthetic data to support rapid discovery of whether the dataset is appropriate for answering the research question; to develop and test code before full access is available; reducing delays in the process, including the amount of time needed to be spent in a secure environment;

 Expand the use of synthetic data for training so that researchers can be exposed to relevant idiosyncratic datasets earlier, thus improving their efficiency on live projects;

 Develop a cross-government repository of synthetic data for restricted access without a specific project proposal to allow for better design and more refined project proposals, and for this to be fed by a semi- automated pipeline to routinely generate low-fidelity synthetic data.

The study was followed up with the development of a synthetic data generation tool in the form of a prototype Python notebook which could be used by government analysts or researchers to generate low-fidelity synthetic datasets quickly and easily. It creates a version of the data that follows the structure and some of the patterns found in the real data. As such, it is plausible and represents the data as a whole. At the same time, because it does not preserve statistical relationships between columns, it reveals very little - if anything - about any individual in the dataset. The tool has now been extensively tested and is available for use. Users need Python (preferably Python 3), two common Python libraries (NumPy and pandas), and a software tool for viewing, editing, and running Python notebooks such as VSCode or Jupyter.

The BIT developers also produced a user guide which provides clear, step-by-step instructions, including how to ensure your system can run it. It guides the user through methods to run the cells in the notebook, explains how output files can be saved, and even tells you how to check that the notebook has worked. There is a useful section on troubleshooting as well as further information for more advanced users.

In an attempt to visualise the benefits of synthetic data for researchers using the process set out above in Figure 1, we have indicated on Figure 2 where the efficiencies could lie.

7

Figure 2: Proposed efficiencies on process when access to synthetic data is added:

Of course, low-fidelity synthetic data is not a silver bullet. There will be instances where higher fidelity synthetic data is both more appropriate and more useful. In an ADR UK-led workshop at the International Population Data Linkage Network (IPDLN) conference in 2022, where different approaches to creating synthetic data were discussed, participants agreed that the value of different tools was entirely reliant on the end utility of the synthetic dataset2. Partners from ADR UK have taken their own approaches to developing synthetic data according to need and appetite in the devolved nations of Wales, Scotland and Northern Ireland and these have recently been published as an Interim Position Statement on Synthetic Data. It sets out ADR UK’s vision for synthetic data and frames it in the wider context of its remit and mission. The statement is intentionally ‘interim’ because of the dynamic nature of this topic and our growing understanding of issues and opportunities associated with it.

3.3 Putting synthetic data into practice While the case for the provision and use of synthetic data is powerful, data owners remain cautious, and we need to find effective ways of engaging the public in discussions about the creation of synthetic data. As such, we are a long way from seeing synthetic data operationalised to the point where trusted research environments can produce it routinely and facilitate access to it at scale. There is also a lack of evidence to support decisions among data owners and data services about how the governance around this might be best implemented. Data owners and services need real-world use case studies on costs and benefits to inform more systematic approaches to creation and sharing of synthetic data.

To inform future practice, ESRC and ADR UK are opening a joint research call to fund individuals and teams to explore how the potential of synthetic data can be harnessed at scale. Recipients of these grants will evaluate the current uptake, utility and governance of synthetic versions of datasets held in SDEs, including the benefits, costs and challenges to researchers, data owners and the SDEs themselves. They will also support a qualitative study of public understanding of and attitudes to synthetic data. The results of these funded projects will

8

collaboratively inform a report and recommendations for how synthetic data production and provision can be achieved at scale and with the trust and support of stakeholders, including the public.

3 Discussion: Challenges and opportunities The use of synthetic data provides an opportunity to reduce demand for SDE access, as analysis to complete projects within an SDE environment could be carried out more quickly. Our desire is that SDEs operate as efficiently as possible, and synthetic data, in our opinion, offers way to improve that efficiency, in the following ways:

It can enable researchers to make much more accurate data access applications. A benefit of this is that researchers will have more certainty that the data they are interested in accessing will support their research. Synthetic data should reduce the number of researchers who apply to access data, are set-up by the SDE to access data, but realise the data cannot support their research after all.

Researchers ought to be able to construct a significant amount of their statistical programming code outside of the SDE; and only use the SDE to refine and run the code on real data. This means they spend less time logged into the SDE and less time using compute resources for iterative coding.

If the use of synthetic data did create more opportunities to train and engage researchers in accessing sensitive data within an SDE environment, improve the quality of applications to access the data held, and also improve the efficiency of how SDEs operate, this may all drive up the use of this data for research in the public good. The process of producing useful synthetic data requires time, skills and customisation although much of the process can also be automated3. There are further challenges to address, including:

 Deciding which organisation is best placed to produce the synthetic data. The data owning organisation, or the organisation running the SDE?

 Should Digital Object Identifiers and other techniques be adopted to monitor version control and use of the synthetic data?

 What training and guidance should be made available to ensure that researchers do not inadvertently try to publish statistical findings that have been drawn from the synthetic version of the data, instead of the real data?

 How do we engage the public in discussions about the creation of synthetic data?

4 Conclusions

Synthetic data provides opportunities to smooth the researcher journey to access sensitive data via an SDE and reduce the burden on data owners and SDEs supporting researchers requesting such access. However, few use cases exist in the literature that evaluate the benefits and costs to stakeholders (researchers, data owners and SDEs), which is hindering scaled production and routine use of it. Evidence of public understanding and positive acceptance is not clear. Other barriers as described in this paper are not insurmountable and could, in the long run, reduce costs for stakeholders if automated systems were put in place. The benefits of the use of synthetic data are becoming clearer as more research is funded using secure data, the complexity of new, linked datasets increases, computational power increases and data science skills become better recognised for research across disciplines. For access to secure data to keep up with demand, synthetic data is a strong enabler and an important consideration for progress.

9

References:

1. Ritchie, F. 2008. Secure access to confidential microdata: four years of the Virtual Microdata Laboratory. Economic and Labour Market Review, vol 2, No.5

2. ADR UK Approaches to creating synthetic data: Workshop at IPDLN conference 2022. 3. Nowok, B., Raab, GM., and Dibben, C., 2017. ‘Providing Bespoke Synthetic Data for the UK

Longitudinal Studies and Other Sensitive Data with the Synthpop Package for R 1’. Statistical Journal

of the IAOS 33/3: 785 – 796. DOI: 10.3233/SJI-150153

Respondent centric survey design and data collection – the Transformed Labour Force Survey - Colin Beavan-Seymour, Maria Tortoriello and Sabina Kastberg (Office for National Statistics, United Kingdom)

Languages and translations
English

Respondent Centric Survey Design and Data Collection – Transformed Labour Force Survey

UNECE Expert Meeting 2023

Maria Tortoriello Principal Social Researcher

Colin Beavan-Seymour Principal Social Researcher

UNECE Expert Meeting 2023

Talk outline

Part 1 – Survey Design

• What is the purpose of the Transformed Labour Force Survey?

• Survey Design – sample, collection modes

• Return rates

Part 2 – Implementation of an Adaptive Survey Design

• Why use an Adaptive Survey Design?

• How was it developed?

• How was it implemented?

• Initial findings

UNECE Expert Meeting 2023

Part 1 – Survey Design

Colin Beavan-Seymour

UNECE Expert Meeting 2023

What is the Transformed Labour Force Survey?

• A new survey which will collect data on key labour market measures

• Developed with a respondent centric approach

• Qualitative and quantitative research

• Online first

• A rationalisation and redevelopment / rethink of how to measure core labour

market concepts

• Extensive qualitative research with members of the public, interviewers,

data users

UNECE Expert Meeting 2023

The journey so far…

2017 2018 2019 2020 2022

Tests 1 & 2 Online response rates Engagement strategies

Test 3 Mixed mode (online & F2F) Statistical outcomes

Test 4 Online attrition test – response rates across 3 waves

TLFS Beta Online only in response to pandemic

Addition of Telephone Online & telephone collection

Knock-to- nudge Using an Adaptive Survey Design

2022/23

UNECE Expert Meeting 2023

Sample Design

Transformed Labour Force Survey

Wave 1

TLFS Wave 2

TLFS Wave 3

TLFS Wave 4

TLFS Wave 5

Opinions

Survey

Other Social

Surveys

Other Social

Surveys

140,000 households

40,000

households

UNECE Expert Meeting 2023

What data did this give us? • A return rate (complete returns & partials) of around 37.5% - a great start!

• However, we were still seeing similar biases in the responding sample that other voluntary surveys in the UK were

experiencing, despite the online mode and user-centric design:

• A large proportion of respondents were over 55, many over 65 – fewer respondents of working age, more economically

inactive

• A majority of respondents owned their homes, many without a mortgage or loan

• Respondents with a white ethnic background comprised the vast majority of the data, under-representation from

other ethnic backgrounds

• The vast majority of data was from the online mode – only a small percentage was from telephone collections

• The 2018 test indicated that interviewers visiting households can increase response from under-represented areas

• But… with a large scale survey of over 500,000 a year… how can we increase the quality of the data collection but

keep the cost of the operation down?

UNECE Expert Meeting 2023

Part 2 – Adaptive Survey Design Maria Tortoriello

UNECE Expert Meeting 2023

What is an Adaptive Survey Design (ASD)?

In November 2022 we implemented an ASD for the TLFS.

• What is an ASD?

➢ Dividing a sample into smaller groups that have

similar characteristics (segmentation)

➢ Applying alternative survey design features for

different groups: • modes, materials, incentives

➢ Objective is to improve targeted survey outcomes • reduce bias, reduce costs

Why use an Adaptive Survey Design?

• TLFS data collection strategy same for all sampled addresses = no adaptive survey design

• Experiencing differential non-response bias which affects estimates

• Statistical processing enables weighting of sample to account for some

bias, but confidence in estimates would only improve with higher quality input data.

• Next step for TLFS was to introduce additional modes - Face to Face follow up

• One size does not fit all!

• ASD allows you to target the right respondents in the right way, rather than targeting all

respondents in the same way = more efficient use of field resources

UNECE Expert Meeting 2023

How was the Adaptive Survey Design developed?

• Closely followed work of Statistics Netherlands (Schouten, B et al.)

• A key objective of ASD is to divide the sample into strata in order to define targeted protocols for each of

the strata

• A logistic regression model was applied to historical TLFS data to identify auxiliary variables strongly

associated with response to formulate the ASD strata.

• Variables considered were Index of Multiple Deprivation (IMD), Urban/Rural Classification, Country of

Birth, Age & Ethnicity (limited by available data).

• Derived and examined CV, R-Indicators and Partial R-Indicators to identify the variables and categories

of variables driving variation in response propensities

• Strongest predictors of response:

• Age (<45)

• Urban/Rural Classification (Urban)

• Index of Multiple Deprivation (IMD deciles 1-4)

UNECE Expert Meeting 2023

ASD: Iteration 1

➢ STRATA 1 = Urban, less deprived areas, 45+

➢ STRATA 2 = urban. more deprived areas, 16-44

➢ STRATA 3 = urban, less deprived areas, 16-44

➢ STRATA 4 = urban, more deprived areas, 45+

➢ STRATA 5 = non-urban, more deprived areas, 16-44

➢ STRATA 6 = non-urban, more deprived areas, 45+

➢ STRATA 7 = non-urban, less deprived areas, 16-44

➢ STRATA 8 = non-urban, less deprived areas, 45+

high priority strata

• Potential to include numerous interventions in the ASD (e.g. mode, incentive, materials..)

• Keeping it simple with 1 intervention = ‘Knock to Nudge’ (KtN) follow up

• ASD will target KtN data collection at under-represented strata based on response propensities in order to

reduce the variation in response propensities for a selected set of auxiliary variables.

• This will ensure that data collection resources are used in the most efficient way whilst increasing response

from historically underrepresented population groups.

UNECE Expert Meeting 2023

ASD Optimisation approach

• We are following a structured ‘trial and error’ approach to optimising our

ASD.

• The optimum solution is unknown and experimental testing is needed

• Start with a simple design that can be accommodated using existing

systems

• Document, evaluate, learn, extend…

• Grow – add features to the ASD as technical and admin systems improve

over time

UNECE Expert Meeting 2023

Early results

• ASD Evaluation project - ongoing

• Operational evaluation – evaluating optimal set up of KtN

oOptimal number of visits = 2/3

oBest days to make contact: Monday, Tuesday, Sunday

oBest time of day to make contact between 3pm-8pm

oKtN not working as well in London and North West

regions

• Data quality evaluation

• Improving variability in response across strata

• Small improvements in representivity of data

oStatistically significant increase in response from 'hard

to reach' groups

First ‘full’ knock-to-nudge month

Thank you for listening!

Any questions?

Contact details:

[email protected]

[email protected]

UNECE Expert Meeting 2023

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design - Laura Wilson (Office for National Statistics, United Kingdom)

Languages and translations
English

Rethinking Data Collection

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design

Laura Wilson Principal Researcher Data collection expert, UK Government Data Quality Hub [email protected]

12 June 2023

UNECE Expert Meeting on Statistical Data Collection

Design Principles What exactly are they and why are they needed?

UNECE Expert Meeting on Statistical Data Collection

Design principles…

• Value statements

• Help us to be Respondent Centred

• Define good design

• Provide clear and practical recommendations for all to follow

• Educational aid

• Support change, consistency and decision making

UNECE Expert Meeting on Statistical Data Collection

ONS’ Design Principles 11 Survey Strategy Research and Development Principles

UNECE Expert Meeting on Statistical Data Collection

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

Take an optimode and adaptive approach to design

Principle 2

Evidence informs decision making

Principle 3

Data users lead the way

Principle 4

Respondents have the answers

Principle 5

Everyone counts

Principle 6

Trust, roles and responsibilities

Principle 7

It’s our job to make things simpler

Principle 8

Follow, reuse, and refresh

Principle 9

Iterate, learn, and share

Principle 10

Think about the whole service and solve problems as a whole

Principle 11

https://analysisfunction. civilservice.gov.uk/polic y-store/office-for- national-statistics-ons- survey-strategy- research-and- development-principles- ssrdp/

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

UNECE Expert Meeting on Statistical Data Collection

Take an optimode and adaptive approach to design

Principle 2

UNECE Expert Meeting on Statistical Data Collection

Evidence informs decision making

Principle 3

UNECE Expert Meeting on Statistical Data Collection

Data users lead the way

Principle 4

UNECE Expert Meeting on Statistical Data Collection

Respondents have the answers

Principle 5 Respondent Centred

Design Framework (RCDF) https://analysisfunction.civilservic

e.gov.uk/policy-store/a-user- centred-design-approach-to-

surveys/

UNECE Expert Meeting on Statistical Data Collection

Everyone counts

Principle 6

UNECE Expert Meeting on Statistical Data Collection

Trust, roles and responsibilities

Principle 7

UNECE Expert Meeting on Statistical Data Collection

It’s our responsibility to make things simpler

Principle 8

UNECE Expert Meeting on Statistical Data Collection

Follow, reuse, and refresh

Principle 9

UNECE Expert Meeting on Statistical Data Collection

Iterate, learn, and share

Principle 10

UNECE Expert Meeting on Statistical Data Collection

Think about the whole service and solve problems as a whole

Principle 11

UNECE Expert Meeting on Statistical Data Collection

Be different when you need to be

Principle 1

Take an optimode and adaptive approach to design

Principle 2

Evidence informs decision making

Principle 3

Data users lead the way

Principle 4

Respondents have the answers

Principle 5

Everyone counts

Principle 6

Trust, roles and responsibilities

Principle 7

It’s our job to make things simpler

Principle 8

Follow, reuse, and refresh

Principle 9

Iterate, learn, and share

Principle 10

Think about the whole service and solve problems as a whole

Principle 11

https://analysisfunction. civilservice.gov.uk/polic y-store/office-for- national-statistics-ons- survey-strategy- research-and- development-principles- ssrdp/

Thank you – questions? Laura Wilson

[email protected]

UNECE Expert Meeting on Statistical Data Collection

2023 abstract UNECE Expert Meeting on Statistical Data Collection 'Rethinking Data Collection' online (12 - 14 June 2023)

Title:

Survey Research and Development Principles: 11 value statements that facilitate Respondent Centred Design.

Speaker:

Laura Wilson

Abstract: To successfully achieve the paradigm shift, to where respondents become central and integral to survey design, we first need to know the values that underpin the new state. This is where design principles step in – they are value statements that set the standards and ways of working for all to follow. They are used to support change, consistency and decision making within teams and across organisations.

Design principles foster a common understanding of what it takes to make a survey respondent centred and they define what good design looks like. Having clear and practical recommendations for research and development teams to follow means that they are more likely to design successful surveys. They can also be used as an educational tool with stakeholders and staff to share and help explain the ethos and future vision.

At ONS, we’ve created 11 Survey Research and Development Principles for the new ONS Survey Strategy. They are:

1. Be different when you need to be 2. Take an optimode and an adaptive approach to design 3. Evidence informs our decisions, not assumptions 4. Data users lead the way 5. Respondents have the answers 6. Everyone counts 7. Trust, roles and responsibilities 8. Achieving simplicity is on us 9. Follow, reuse and refresh 10. Iterate, learn and share 11. Think whole service and solve whole problems

These will be used by all teams creating surveys at ONS. During this talk I will step through them and share how they help to facilitate Respondent Centred Design.

___________________________________________________________

Paper:

The ONS Survey Strategy Research and Development Principles

Be different when you need to be When we find something that works, for example, a letter template or a question pattern, we use it widely. We:

 follow harmonised standards to improve the quality and comparability of our data across government

 use consistency to build legitimacy and brand recognition  use tried and tested products to improve our ways of working and help us all

achieve our goals

But, we also allow ourselves to take a different approach when our evidence shows we need to. This prevents us from complicating the respondent journey, which could compromise user needs. We always aim for consistency and not uniformity.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 aligns with other initiatives across the organisation  products should reduce complexity and burden that exists in the system

Take an optimode and an adaptive approach to design We develop each product, for example a letter or questionnaire, in the best way for the mode, or modes, it is administered in. We also develop each product for its mode specific users. This is known as “optimode” design. By designing in this way, we can help respondents give us the data we need by reducing respondent burden. It also creates products that are more user friendly to our internal users, including interviewers and call centre staff.

We tailor each product to the medium it uses and the specific needs of the users in that mode. This helps us get the data we need and improves data quality.

We use adaptive web design during development which allows the layout to adapt to the screen size appropriately. We design for mobile screens first, and then larger ones. That’s because it helps challenge us to think about the minimum content needed. We justify each piece of content being added, and refer to user needs, user stories and user journeys to do so.

ONS Survey Strategy Delivery Principles

This SSRDP links to the ONS Survey Strategy Delivery Principles that “products should reduce complexity and burden that exists in the system”.

Evidence informs our decisions, not assumptions The designs of our surveys and their products are based on alongside evidence. We:

 do not make assumptions about our users’ needs  do not make a design decision if there is no evidence to support and inform it  avoid assumption led design as this will lead us to produce the wrong thing

Evidence and insights can be gathered from many sources. For example, we could complete some research with respondents or explore existing data to inform our next actions.

ONS Survey Strategy Delivery Principles

This SSRDP links to the ONS Survey Strategy Delivery Principles that “decisions are backed up by evidence”.

Data users lead the way Our surveys meet the data users’ needs because our design journey starts with them. We invest time with our users to learn about their data intention. This includes understanding how they intend to use and analyse the data.

We let our data users lead the way by providing the concepts to be investigated, but they do not design the content itself. We avoid getting data users to design the content because the designs will not be respondent centred.

Once we understand our users’ needs we use this information, alongside the respondent needs, to inform the design of the respondent centred survey products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 clear ownership and accountability  decisions at right level  decisions are backed up by evidence

Respondents have the answers We listen to our respondents and know what they need. This means we design the right thing. We do not make design decisions based on our assumptions, personal views, and biases. Instead, we carry out research to learn about respondent mental models and needs. We explore the cognition and usability of our household and business survey products through testing. We involve interviewers, call centre staff, survey processors and operational staff in the development of products to incorporate their needs and their insights on respondents. This could include insights about issues with an existing questionnaire, for example.

We always ensure we learn about what respondents need, rather than what they want. We use our analysis of respondent needs to develop assets such as respondent journeys and stories which inform the design of survey products.

We follow the Respondent Centred Design Framework to ensure we design based on needs to create respondent centred products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 decisions are backed up by evidence  decisions at right level  products should reduce complexity and burden that exists in the system

Everyone counts Our statistics reflect the experiences of everyone in our society. This means all our surveys are designed to be inclusive and compliant with accessibility legislation.

We think about all types of respondents from the start because we want everyone to be able to take part in our surveys. We increase response and representation in our data by removing barriers to interaction and participation created through exclusionary design. We follow harmonised standards to ensure our survey questions are inclusive and that we collect representative data.

Inclusive and accessible designs reduce burden for all respondents, not just those with additional needs or disabilities. For example, we aim to design each product to meet the average reading age of the UK. This makes our products easier to understand which improves the overall respondent experience. Inclusive and accessible designs improve the quality of our data and build trust in our statistics.

These ways of working also apply to products that are developed for internal ONS users that are part of running a successful survey. This includes interviewers, call centre staff, survey processors, and operational staff. This provides equal opportunity to our workforce by ensuring everyone can use our products.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity and burden that exists in the system  aligns with other initiatives across the organisation  contribution to survey and ONS strategic outcomes is clear

Trust, roles, and responsibilities We trust and involve the right people at the right time at every phase of a project. We define clear roles and responsibilities which helps us run a successful survey and achieve our goals. We are transparent about the design process with stakeholders and involve them in the development journey.

Everyone involved in a project clearly understands their purpose and expected contribution at every point of designing and developing a survey. They know where their role begins and ends, which helps ensure people with the right skills for the job are assigned to the right part of the design and development process. This allows

the organisation to fully benefit from the investment made to employ and train these people who are experts in their roles. It also avoids products being influenced and designed by the wrong people at the wrong time. This can lead to the wrong thing being built. For example, data users are responsible for providing their data needs and analytical requirements to the research and design teams. The research and design teams then fulfil their role in the process, which is to conduct the research to develop the appropriate designs to meet user needs. The roles are clear, the data users do not dictate the design of the questions as the research and design teams are trusted and skilled to produce the right product to meet their needs.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 clear ownership and accountability  decisions at right level

It's our responsibility to make things simpler We have a responsibility to make our surveys easy to understand and use. We do the hard work to make our surveys simple, which removes that burden from our staff and respondents.

We prioritise the respondent experience because we know that without doing so, we risk us not achieving our goals. We develop surveys that meet the needs of respondents and data users by investing time and resources into the early research, design, and testing phases of a project. We monitor respondent burden and use the insights to inform decision making.

We develop surveys that do not rely upon staff intervention and lengthy help and guidance to get the data we need. Instead, they are clear and highly usable on their own, without the need for much additional support or advice. Through good design we empower our respondents to take part in our surveys and provide us the data we need. We only add additional help where research shows that further support is needed.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity and burden that exists in the system  aligns with other initiatives across the organisation  contribution to survey and ONS strategic outcomes is clear

Follow, reuse, and refresh We follow best practice and standards in the design and development of our surveys. This ensures our surveys are high quality, modern, and sustainable.

When we have found something that works, we make it reusable and shareable instead of starting from the beginning of the development process every time. Our harmonised standards are good examples of this. This approach ensures others can

benefit from the investment made in developing that product. It also avoids duplication of effort and spending of public money on creating the same thing.

Sometimes we may need to take a different approach. It is important to remember that each survey is different and may need bespoke products or solutions. The decision to do something different and stray from best practice and standards is always based on evidence and respondent needs, not assumptions, personal views, or biases. For example, harmonised standards are used as the starting point, but they can be adapted to meet the needs of specific surveys.

We continuously refresh our knowledge and understanding of best practice and standards. We can which allows us to constantly add to our evidence base. We look to the research of others around the world to inform our work, but we keep in mind the importance of country context.

We refresh our surveys and carry out continuous improvement to our content to ensure they remain relevant.

We use administrative data or other data sources, where available, to reduce survey length, respondent burden and operational demands while improving processing and the quality of statistical outputs.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 aligns with other initiatives across the organisation  products should reduce complexity and burden that exists in the system  proposals can be sustainably resourced and financed  decisions are backed up by evidence

Iterate, learn, and share We take an agile approach to developing our surveys, which helps us avoid the risk of building the wrong thing and finding that out too late. We always test our survey products and processes with respondent users before releasing them. We iterate and refine them based on research insights and not assumptions, which ensures we develop something that meets the users’ needs.

We are transparent about what does not work and we abandon these things when our research shows they are not suitable. We then work to find an alternative solution. We always test our survey products along the full end-to-end respondent journey to ensure we are providing our respondents with highly usable and coherent products.

Sometimes we may find a problem with a questionnaire after the live phase. There are several ways we might find this out, for example through interviewer feedback, respondent feedback, or by looking at the amount of imputation needed. When this happens, we flexibly adapt and improve the questionnaire rather than needing to run big re-development projects.

We involve topic experts from inside and outside of ONS to support with:

 the design of our surveys and their products  how our surveys are run

We share our insights and learnings widely with others internally and externally. We abandon what does not work in favour of finding something that does, and we remember that discovering something does not work is a valid insight. We share prototypes and progress widely to gather feedback from people with different areas of expertise to create better products for our users.

We recognise the importance of bringing our stakeholders on our development journey to ensure successful survey design. This is why we involve them at all stages of a project.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 decisions are backed up by evidence  decisions at right level

Think about the whole service and solve problems as a whole We see our surveys and the data they generate as an ecosystem. We are aware of the interdependencies of each phase in the data lifecycle and how our decisions at the beginning and throughout affect the quality of the data we collect and produce.

When a survey uses multiple modes for data collection, we do not view or develop each mode separately. Instead, we think about all aspects together and consider them as one so we can create successful surveys that meet respondent needs. When there is a problem, we think about how it affects whole respondent journey and data lifecycle. We then work to fix the problem accordingly.

We use intelligence from paradata, processing and analysis of data collection to help the future design of our surveys. For example, we learn about where we are doing lots of imputation as this may mean that we need to review the questions and concepts for clarity.

We think holistically about the design and appearance of our products to maximise response and data quality. We ensure all products respondents interact with are consistent in tone and appearance. This helps build trust and build a brand identity.

ONS Survey Strategy Delivery Principles

This SSRDP links to the following ONS Survey Strategy Delivery Principles:

 products should reduce complexity & burden that exists in the system  contribution to survey and ONS strategic outcomes is clear  aligns with other initiatives across the organisation

Retaining participants in UK’s COVID-19 Infection Survey (CIS) - Kathryn Littleboy (Office for National Statistics, United Kingdom)

Languages and translations
English

Retaining participants in the UK COVID-19 Infection Survey (CIS)

Kathryn Littleboy CIS Research and Development lead

Office for National Statistics

12 June 2023

Background – The COVID-19 Infection Survey

What is the COVID-19 Infection Survey (CIS)?

• The CIS was considered the Gold standard COVID-19 surveillance survey, stood up at pace in April 2020.

• 448k active participants by October 2021 – a nationally representative sample of UK households

• 3,000 study workers travelling door to door to take swab samples, blood samples and complete a questionnaire

on the doorstep, with each household member every month.

Why was participant retention important?

• The longitudinal sample has been valuable in understanding the impact of vaccines over time and infection rates

of symptomatic and asymptomatic individuals.

• The CIS is the only longitudinal COVID-19 study so is an extremely valuable asset

• It costs significantly more to recruit new participants to replenish sample (extra print and postage costs)

The problem statement: Needed to significantly reduce budget due to reduced funding:

• Needed to reduce sample size but maintain representativity,

• Reducing compensation from 21st March (from £25 to £20)

• Move to a more cost-effective remote approach but would require participants to ‘opt in’

How did we identify problems for our participants?

Gathering feedback from participants through:

• Feedback from study workers about pain points (15 qualitative interviews and general feedback)

• Learning from key participant complaints/queries to the call centre

• Findings from the quantitative user feedback survey about general experience >30k responses

• Findings from 13 in-depth qualitative participant interviews

We found:

Lack of flexibility in when they take part was reducing

response rates – especially since return to work

Study workers were paraphrasing the

questionnaire reducing consistency Awareness of the huge costs and environmental

impact of study workers travelling

Feeling rushed to take samples whilst

someone waits on their doorstep

How did we improve the experience to retain participants?

Removal of study workers and F2F design to a digital-first design:

• Be more flexible around when participants take part (14 day window)

• Reduce C02 emissions (from 10k tonnes per year)

• Be more cost-efficient, by removing the travel time and costs of (circa 3,000) SWs going door to door

• Reduce risk of infection for both participants and SWs through remote data collection

• Reduce interviewer bias by questionnaires being completed primarily online

• Removing time pressure to conduct their biosample tests quickly whilst someone waits

Conducted vast amounts of user research to shape the new design to user needs, make it

accessible, and to gather ongoing feedback to iterate and improve:

• 75 qualitative interviews – feedback about questionnaire wording, design, and materials

• 2,356 responses to quantitative feedback surveys to test favourability of different concepts

• Over 20,000 free text responses from the main CIS questionnaire to explain specific issues

How did we ensure success during the transition? • Used behavioural insights to improve participant

materials

Used a more personal tone, emphasising individual

contribution and importance and relevance.

• Demonstrated commitment to improvements

By allowing participants to report issues/feedback at the

bottom of each page of the questionnaire on an ongoing basis

• Implemented reminder emails

To increase initial opt-in and monthly response rates.

• Highlighted the value of their participation

By sharing latest study findings with participants in quarterly

newsletter

• Continued to collect feedback through a

quantitative survey in a quarterly newsletter To measure and compare satisfaction pre and post transition

and on an ongoing basis(>162k responses).

How did we measure success?

• Retained approximately 90% of our sample

- February 2022 before voucher reduction = 441,266 active participants

- July 2022 post transition = 397,354 active participants

• Measuring changes in representativity of our sample:

Participants providing samples through remote data collection and study worker home visits share generally very similar profiles,

both when unadjusted and when adjusted, with the majority of differences between these population samples being below 1

percentage point.

Age

2 to 7

years

8 to 11

years

12 to 15

years

16 to 24

years

25 to 34

years

35 to 49

years

50 to 59

years

60 to 64

years

65 to 69

years

70 to 74

years

75 to 79

years

80 years

and above

Absolute

difference

in percentage

points

Unadjusted 0.0 0.3 0.4 0.1 -0.8 -0.4 0.0 0.5 0.4 0.2 -0.1 -0.8

Adjusted -0.3 0.1 0.3 0.0 -0.1 0.0 -0.3 0.2 0.0 0.4 0.1 -0.4

Sex Male Female

Absolute

difference

in percentage

points

Unadjusted 0.2 -0.2

Adjusted 0.0 0.0

Ethnicity White Asian Black Other Mixed

Absolute

difference

in percentage

points

Unadjusted 0.3 0.0 -0.1 0.0 -0.1

Adjusted 0.9 -0.4 -0.3 -0.1 -0.2

The latest challenge for retention

• The CIS has been paused whilst UKHSA review their approach to

surveillance, and consider future funding.

• Any further surveillance would need to further reduce costs:

– smaller sample size (but still needs to be representative)

– removal of compensation – concerning impact on representativity

• If we stop collecting data from this valuable longitudinal cohort:

– likely disengagement of the sample

– a data gap, reducing the overall value of the dataset in future

How are we retaining participants right now?

• The ONS have funded a temporary questionnaire-only study called the COVID-19 and Respiratory

Infections (CRIS), from April-June 2023 whilst decisions around the future of the CIS are ongoing.

• The problems CRIS is trying to solve:

• Setting up a separate questionnaire-only study meant that we could proceed at pace through the

National Statistician’s Data Ethics Advisory Committee (NSDEC) , without the need to secure a new

Chief Investigator and resubmit a new medical ethics protocol.

Prevent a

large data gap

Maintain participant engagement

from the valuable longitudinal cohort

Make the questionnaire more relevant

to participants and analytically Continue to monitor symptoms, understand the impact of

all respiratory infections on daily life, absence from work

and use of medical services, and understand the

prevalence of long-COVID

To test, iterate and improve the new

questionnaire, in new in-house

technology

Move from household to individual

sampling to maintain representativity

despite a smaller sample

What next?

• CRIS survey has been running since 11 April. Most participants have

completed their first questionnaire, some are onto their second.

• Collecting feedback and implementing improvements or adding to a backlog.

• First analysis, including results about representativity and sample size will be

published in July 2023.

• Awaiting decisions on continuation or closedown of the study at end of June.

Any Questions? Contact name: Kathryn Littleboy

Email: kathryn.littleboy:@ons.gov.uk

Retaining participants in UK’s COVID-19 Infection Survey (CIS) - Kathryn Littleboy (Office for National Statistics, United Kingdom)

Languages and translations
English

Retaining participants in the UK COVID-19 Infection Survey (CIS)

Kathryn Littleboy CIS Research and Development lead

Office for National Statistics

12 June 2023

Background – The COVID-19 Infection Survey

What is the COVID-19 Infection Survey (CIS)?

• The CIS was considered the Gold standard COVID-19 surveillance survey, stood up at pace in April 2020.

• 448k active participants by October 2021 – a nationally representative sample of UK households

• 3,000 study workers travelling door to door to take swab samples, blood samples and complete a questionnaire

on the doorstep, with each household member every month.

Why was participant retention important?

• The longitudinal sample has been valuable in understanding the impact of vaccines over time and infection rates

of symptomatic and asymptomatic individuals.

• The CIS is the only longitudinal COVID-19 study so is an extremely valuable asset

• It costs significantly more to recruit new participants to replenish sample (extra print and postage costs)

The problem statement: Needed to significantly reduce budget due to reduced funding:

• Needed to reduce sample size but maintain representativity,

• Reducing compensation from 21st March (from £25 to £20)

• Move to a more cost-effective remote approach but would require participants to ‘opt in’

How did we identify problems for our participants?

Gathering feedback from participants through:

• Feedback from study workers about pain points (15 qualitative interviews and general feedback)

• Learning from key participant complaints/queries to the call centre

• Findings from the quantitative user feedback survey about general experience >30k responses

• Findings from 13 in-depth qualitative participant interviews

We found:

Lack of flexibility in when they take part was reducing

response rates – especially since return to work

Study workers were paraphrasing the

questionnaire reducing consistency Awareness of the huge costs and environmental

impact of study workers travelling

Feeling rushed to take samples whilst

someone waits on their doorstep

How did we improve the experience to retain participants?

Removal of study workers and F2F design to a digital-first design:

• Be more flexible around when participants take part (14 day window)

• Reduce C02 emissions (from 10k tonnes per year)

• Be more cost-efficient, by removing the travel time and costs of (circa 3,000) SWs going door to door

• Reduce risk of infection for both participants and SWs through remote data collection

• Reduce interviewer bias by questionnaires being completed primarily online

• Removing time pressure to conduct their biosample tests quickly whilst someone waits

Conducted vast amounts of user research to shape the new design to user needs, make it

accessible, and to gather ongoing feedback to iterate and improve:

• 75 qualitative interviews – feedback about questionnaire wording, design, and materials

• 2,356 responses to quantitative feedback surveys to test favourability of different concepts

• Over 20,000 free text responses from the main CIS questionnaire to explain specific issues

How did we ensure success during the transition? • Used behavioural insights to improve participant

materials

Used a more personal tone, emphasising individual

contribution and importance and relevance.

• Demonstrated commitment to improvements

By allowing participants to report issues/feedback at the

bottom of each page of the questionnaire on an ongoing basis

• Implemented reminder emails

To increase initial opt-in and monthly response rates.

• Highlighted the value of their participation

By sharing latest study findings with participants in quarterly

newsletter

• Continued to collect feedback through a

quantitative survey in a quarterly newsletter To measure and compare satisfaction pre and post transition

and on an ongoing basis(>162k responses).

How did we measure success?

• Retained approximately 90% of our sample

- February 2022 before voucher reduction = 441,266 active participants

- July 2022 post transition = 397,354 active participants

• Measuring changes in representativity of our sample:

Participants providing samples through remote data collection and study worker home visits share generally very similar profiles,

both when unadjusted and when adjusted, with the majority of differences between these population samples being below 1

percentage point.

Age

2 to 7

years

8 to 11

years

12 to 15

years

16 to 24

years

25 to 34

years

35 to 49

years

50 to 59

years

60 to 64

years

65 to 69

years

70 to 74

years

75 to 79

years

80 years

and above

Absolute

difference

in percentage

points

Unadjusted 0.0 0.3 0.4 0.1 -0.8 -0.4 0.0 0.5 0.4 0.2 -0.1 -0.8

Adjusted -0.3 0.1 0.3 0.0 -0.1 0.0 -0.3 0.2 0.0 0.4 0.1 -0.4

Sex Male Female

Absolute

difference

in percentage

points

Unadjusted 0.2 -0.2

Adjusted 0.0 0.0

Ethnicity White Asian Black Other Mixed

Absolute

difference

in percentage

points

Unadjusted 0.3 0.0 -0.1 0.0 -0.1

Adjusted 0.9 -0.4 -0.3 -0.1 -0.2

The latest challenge for retention

• The CIS has been paused whilst UKHSA review their approach to

surveillance, and consider future funding.

• Any further surveillance would need to further reduce costs:

– smaller sample size (but still needs to be representative)

– removal of compensation – concerning impact on representativity

• If we stop collecting data from this valuable longitudinal cohort:

– likely disengagement of the sample

– a data gap, reducing the overall value of the dataset in future

How are we retaining participants right now?

• The ONS have funded a temporary questionnaire-only study called the COVID-19 and Respiratory

Infections (CRIS), from April-June 2023 whilst decisions around the future of the CIS are ongoing.

• The problems CRIS is trying to solve:

• Setting up a separate questionnaire-only study meant that we could proceed at pace through the

National Statistician’s Data Ethics Advisory Committee (NSDEC) , without the need to secure a new

Chief Investigator and resubmit a new medical ethics protocol.

Prevent a

large data gap

Maintain participant engagement

from the valuable longitudinal cohort

Make the questionnaire more relevant

to participants and analytically Continue to monitor symptoms, understand the impact of

all respiratory infections on daily life, absence from work

and use of medical services, and understand the

prevalence of long-COVID

To test, iterate and improve the new

questionnaire, in new in-house

technology

Move from household to individual

sampling to maintain representativity

despite a smaller sample

What next?

• CRIS survey has been running since 11 April. Most participants have

completed their first questionnaire, some are onto their second.

• Collecting feedback and implementing improvements or adding to a backlog.

• First analysis, including results about representativity and sample size will be

published in July 2023.

• Awaiting decisions on continuation or closedown of the study at end of June.

Any Questions? Contact name: Kathryn Littleboy

Email: kathryn.littleboy:@ons.gov.uk

Imputing indices, United Kingdom

Languages and translations
English

Imputing indices

The situation

Using 25-month, GEKS-T, mean splice on

published

Problem: how to impute elementary

aggregates (EAs) when lack-of-matches?

We examine aggregate-level imputation

May be preferred to imputing prices by

ensuring “neutral” impact on aggregates

The plan

Simulate effect of imputation on scanner

data

Drop 10% of EAs at random (different

seeds: different sets of dropped EAs)

Impute EA using CS monthly rate

Calculate residuals = (COICOP5 index with

10% imputed values – original COICOP5

index)

The results

Eleven different seeds, each giving a residual

boxplot

Means are generally close to 0

Slight downward bias in the imputed indices

Boxplot of residuals

Different seeds R

e s id

u a l (b

a s e i n d e

x =

1 0

0 )

Developing reproducible analytical pipelines for the transformation of consumer price statistics: rail fares, UK

At the Office for National Statistics, we are transforming our consumer price statistics by introducing alternative data sources such as scanner data. This paper discusses the process of developing a new production system that can integrate these data as part of our consumer price index production. Firstly, we will discuss the choices made for the infrastructure of the project, including choice of platform and language, and how this has been designed to aid research as well as ensuring an efficient and user-friendly production system.

Languages and translations
English

Developing reproducible analytical pipelines for the transformation of consumer price statistics: rail fares

Authors: Matthew Price and Diogo Marques, Office for National Statistics

Paper submitted to the Meeting of the Group of Experts on Consumer Price Indices, June 2023

Abstract At the Office for National Statistics, we are transforming our consumer price statistics by introducing alternative data

sources such as scanner data. This paper discusses the process of developing a new production system that can

integrate these data as part of our consumer price index production.

Firstly, we will discuss the choices made for the infrastructure of the project, including choice of platform and

language, and how this has been designed to aid research as well as ensuring an efficient and user-friendly

production system.

We then discuss how best practice guidance for reproducible analytical pipelines (for example, code structure,

testing, version control and code deployment) have been implemented in the project.

Finally, we focus on more specific areas within the end-to-end system. This includes the steps taken for data

engineering, including the standardisation of data. We will also cover the choices we have made to implement these

within our existing production round, including the steps needed on an annual basis to reset the basket and allow for

the introduction of new consumption segments.

1 Overview At the Office for National Statistics (ONS), we are currently undertaking an ambitious programme of

transformation across our consumer price statistics, including identifying new data sources, improving methods,

developing systems and establishing new processes. These new data sources allow us to measure inflation from an

improved coverage of price quotes, introduce weights at a lower level of aggregation than before, and allow for

automated data acquisition.

This year, the Consumer price inflation, UK: February 2023 release was the first time our headline measure of

inflation included these new data sources, using expanded data on rail fares. Over the next couple of years, we are

looking to develop the use of these new data sources to include additional categories such as used cars and

groceries, as well as expanding our new system to include the processing of the existing data collection. These

existing data will continue to be used when they cannot be replaced by scanner or web-scraped data, such as for

small independent shops and service providers who do not have a website.

In designing our production systems, we needed to ensure that they were flexible enough so that future categories

of data could be incorporated into the design with minimal additional development. In this paper, we will focus on

how the system works for rail fares but still with an eye to future development. In particular, this includes:

• The team structure and skills required for the transformation project

• The key considerations for the project’s infrastructure and how the platform is designed to facilitate all

aspects of development, research, and production

• The steps taken in engineering supplier data to produce standardised datasets

• A technical summary of how the pipelines used to produce indices are structured

• The end-to-end processes for producing rail fares indices in production

• How coding best practice is implemented into the project to ensure quality systems

2 Team structure for the Consumer Prices Transformation project This Consumer Prices Transformation project is a multidisciplinary project within ONS, comprising of distinct teams

who are experts in their areas.

The owners of this project are the Prices Division within ONS, who oversee the production of the UK’s consumer,

business, and house price statistics. The main project management hub sits within Prices as well as several teams

dedicated to the transformation project. The Data Transformation team is responsible for acquiring and managing

the new alternative data sources. This team is also leading on building in-house web scraping capability. The

Methods Transformation team is responsible for all aspects of methods research in the transformation project,

including the publication of our research updates. The teams are largely resourced by statisticians/methodologists,

with some additional specialism in data science. The Business Change team then works with the CPI Production

teams to implement the agreed systems into the current ongoing monthly production of the consumer price

inflation measures.

Outside of the Prices division, there are other teams working on delivering the technical side of the project. Some of

the key roles are:

• Business analysts who spec out the various requirements from Prices for both the underlying infrastructure

and more general functionality

• Methodologists who specialise in index number theory and support the Prices Methods team in their

research

• IT architects who design the overall system based on the Prices use case and ensure it fits in with the wider

ONS technology base

• Data engineers who manage and build data processing pipelines to ingest and standardise data for

downstream production processes and wider research, build data validation tools and manage part of the

data strategy of the project

• A data acquisitions team who, along with the Data Transformation team in Prices, manage the relationship

with the data supplier and engage with new ones

• Data science/software engineers who are responsible for building the data processing pipelines based on the

requirements specified by Prices

• Testers who ensure that the final development meets the specified requirements

Together, these teams ensure that the final system meets Prices requirements and is developed in line with other

ONS infrastructure to enable sustainable delivery in future.

3 Technical decisions for the project

3.1 Identifying key requirements for the platform ONS’ cloud strategy follows the Government Digital Service (GDS) Cloud Strategy. The GDS Cloud Strategy promotes

a “Cloud First” policy, whereby public sector organisations should consider and fully evaluate potential cloud

solutions before considering any other options. This strategy strongly advocates the use of infrastructure as a service

(IaaS) and platform as a service (PaaS) offerings from cloud providers.

For the UK ONS Consumer Prices Transformation project, there were several key factors that needed to be

considered when determining the most appropriate cloud platform to host the project on. Alternative data sources

are inherently large data sources and therefore the project needed access to a scalable data storage solution that

allows resilient and efficient storage and tools to query big data. To enable big data processing using Apache Spark,

we also needed to have a distributed computing system that allows the use of more complex data transformations

(such as applying complex methodologies to data) to enrich and aggregate data in a reproducible manner (i.e.

through code that follows the RAP principles).

For the production teams, there is also a need to have dashboards in place to cover processes such as validating the

data being delivered to ONS and visualising the calculated indices. Additionally, it was identified that interactive web

applications would be required to enable accessible user input on enriching and validating certain data sources (such

as in manually labelling, flagging, and amending observations in various data sources).

The final requirement was to have the functionality for researchers to perform ad-hoc analysis and data

manipulations to enable ongoing transformation activity for future data categories alongside live production.

3.2 Use of environments to separate workstreams To allow the stable processing of the production indices while allowing continued development, the project makes

use of several environments to separate the different stages of work. The configuration for each environment is

managed through code. This means that it is trivial to ensure that each environment has the exact same

configuration of resources with the only difference being the access permission groups being tailored to the

sensitivity of data present.

There are four types of environments within the project as outlined in Table 1.

1) Develop, where the users have complete freedom over all aspects of their individual sandboxes which can be created and destroyed at will. Because of this, this environment is unique in that it is not managed by the infrastructure code. This space is where the development of pipelines occurs, and all data used is synthetic.

2) Test, the first environment where all the separately developed systems should always work together. Here the test team will perform their tests to ensure that the systems and platform meet the agreed business requirements. Again, all data used in this space is synthetic.

3) Pre-Production, the first environment that contains live data and is identical to Production. This is to allow further end-to-end testing like in the Test environment, but now with the real data.

4) Production, the environment where all delivered data (whether it is being used in production or research) is

ingested into the platform and the data processing pipelines run automatically to produce the production

price indices. As this environment contains the up-to-date classifications and mappers, it is also where the

research team do their work exploring the data and investigating new methodology. Permission groups in

this environment are configured so that the researchers are not able to see the current month’s data for

production data sources, nor be able to interact with any of the outputs from the various production data

processing pipelines.

Table 1 Summary of the different environments used.

Environment Description Main users Data used Stability

Develop Sandbox where teams have full access to explore, test, and do their work.

• Software engineers

• Data engineers

• Infrastructure engineers

Synthetic Not stable

Test Test environment where all systems can work together allowing testing of individual and multiple systems.

• Testers Synthetic Stable

Pre-production Where testing on live data can occur to ensure changes will be stable before moving into production.

• Business change team

Production (data duplicated from the prod environment)

Very stable

Production Where production occurs via automated scheduling of pipelines. Also, where research on data and methods can occur.

• Production team

• Research team

Production (where all new data is ingested, permissions set to prevent researchers from seeing “current month” data for production datasets)

Most stable

This structure for the environments creates an intuitive development cycle for new methodologies and pipelines

(Figure 1). First researchers do an initial investigation on a given data source in their enclave within the production

environment. From their analysis and recommendations, they work with the business architects to write out new

requirements to pass to the various development teams. These are then actioned and implemented by the relevant

team in the developer sandboxes. Once that has undergone the appropriate review, new or updated pipelines are

released into the test environment, where they are end-to-end tested. Success here results in runs in the pre-

production environment, and if successful the new pipelines will be promoted into the production environment as

part of the annual process (see Section 6).

Figure 1 Development workflow across environments

4 Data Engineering To prepare the data sources for use in producing indices they undergo several data engineering stages; data

ingestion, validation, standardisation, and engineering for general use. The data ingestion stage brings the data into

the cloud project, performs virus checks and decryption where applicable, and moves the data into an output bucket

in both Production and Pre-production environments. Once the data enters the environment, the workflow

orchestration tool AirFlow will start and trigger the data validation pipeline. If none of the validation fails, the data

engineering pipeline is initiated and it standardises and appends the latest data to a table. The end result of the data

engineering work is to create staged data ready for the data processing pipelines (section 5). All engineering steps

have extensive logging and will also send appropriate notifications to external communication tools so that users can

action any alerts.

4.1 Ingestion The ingest process uses Apache NiFi, a tool for automating the movement of data between disparate systems, for

ingestion into both our cloud and on-prem systems. Following that, an in-house IT solution is used that does data

integrity checks with supplied manifests, decryption, de-compression and virus scan.

4.2 Data quality – validation pipeline We have built an in-house data validation package using Python and Spark to replace what were previously weekly

manual data checks. This package has been built off the back of the need to automate manual data checks and

standardise them using tested and scrutinised methods.

The validation checks can either be set to warning or error level. If a validation check raises an error the pipeline

stops the process and notifies the data team instantly of the issue, minimising delays in requesting a redelivery from

the supplier. A warning can be used instead of an error to keep an eye on trends or notify the production team of a

change in the data that may require an action, e.g. with rail fares, a new value in the station column could mean the

station mapper that links stations to regions may need revising. However, warnings do not stop the process from

completing.

We considered a range of different tools when building this and used PySpark because it allows us to create one

method that can scale to work on any data size, in our case both on the smaller regular deliveries but also can be

used on datasets of over 10TB, without the need to build and test separate methods. The validation checks include

schema, min/max row count and expenditure and other more nuanced checks.

4.3 Standardisation The main data engineering pipelines read and standardise the data from the raw files creating a continuous time

series in the most accessible format as a table, which has both the data and column names in the format expected

by downstream systems. The pipeline also performs some secondary engineering when the raw data doesn’t contain

all the variables needed to produce the indices. It creates new columns that are either derived by applying rules to

existing columns or mapping data from other sources, for example, an external mapper to assign train stations to

Government Office Regions (GORs). Finally, the pipeline also adds junk columns, which are Boolean columns used to

filter out observations. While this can be seen as a data cleaning step, it doesn’t actively remove data but allows the

user to choose to remove these columns as part of the downstream processes (see data processing pipelines). The

junk columns each apply a separate rule so filtering rules can easily be added/removed, for example, whether to

keep transactions that pertain to "business only" travel (that are out of scope for CPI).

Further reading around some of the data engineering rules applied can be found in our previous article on using

transaction-level rail fares data to transform consumer price statistics.

4.4 Data quality – insights dashboard To ensure data users always have the best understanding of the data that they're using, we have built a dashboard

with visualisations that provide an up-to-date overview of key metrics and more niche insights that could aid

research, for example, expenditure trends (nationally and by region) and null rates in the data. Providing these

visualisations upfront has many benefits, it provides powerful insights achieved through complex SQL queries to

users that don’t have advanced SQL knowledge. It also saves multiple data users running similar queries on terabytes

of data, saving time and computing resources. Powering the visualisations we have a set of materialised views that

contain the SQL queries needed for the visualisations, either in full or to pre-aggregate the data from the terabytes

large tables to a size that's more manageable for the visualisation tool. Both the views and dashboard are refreshed

within a few minutes of new data being engineered and appended onto the main table.

4.5 Resilient and efficient storage To ensure downstream processes use their resources and data most efficiently, all tables are partitioned, and we

apply other methods such as clustering where needed. ONS back up all raw data in a separate service and all tables

have a type of version control that allows any table to be rolled back to a previous snapshot, or a deleted table to be

reinstated within an agreed window of time.

5 Data processing pipelines The data processing pipelines are the pipelines that transform the staged data sources (or one of the derived tables

detailed in Table 2) to further enrich, clean and aggregate the data to produce a full set of indices from the

elementary aggregate level to the overall headline inflation rate. More information about this process can be found

in Section 6: Producing indices for production.

5.1 Pipeline code structure The data processing pipelines serve dual purposes of being used by the researchers to inform decisions on what

methodologies to implement in production, as well as the pipelines used in production. To allow this flexibility each

pipeline is controlled by a configuration file which contains settings for all options in each pipeline. For example, in

the data cleaning pipeline, the configuration file allows the turning on of several different outlier methods and

associated parameters. In the elementary indices pipeline, over 20 permutations of index methods can be applied to

the data. Researchers can explore a wide parameter space at will to make the most informed decisions by changing

the values in this file at run time. In production, the configuration files are locked down to only the chosen methods

for publication with no option to change the values.

Each data processing pipeline is designed to the same template. Following this template allows us to quickly setup

new systems when required for new methodologies or data sources. Furthermore, it means developers and end

users can familiarise themselves with a new pipeline based on their knowledge of other pipelines in the project.

The general structure of a pipeline is to have a “main” script that operates as follows:

1. Take a minimum amount of command line inputs (file paths for a “user” and “backend” configuration file,

the name of the user running the code [for audit purposes], and the desired level of logging [as code

contains substantial debug logs that can be used to diagnose issues]).

2. Read in and validate the “user” configuration file which contains options such as what date source to use,

the date range of the data to use, which methods to apply etc.

3. Read in and validate the “backend” configuration file which contains information that would not need

changing between pipeline runs such as the table paths for inputs and outputs or the stratification for a data

source.

4. Read in all the data needed for the pipeline (using the information from the various configuration files).

5. Pass all relevant configuration parameters as well as the data through to the “core” pipeline code that will

perform all the data transformations of the pipeline (e.g. any pre-processing, applying any methodologies,

performing any aggregations).

6. Add a unique run identification to the output from the previous step.

7. Append the output to the appropriate table (where the unique run ids will allow distinguishing data between

runs).

8. Append key information (e.g. the version of the pipeline, the user and timestamp for the submitted run and

the configuration files used) about the run to a separate pipeline run log table to allow auditing of pipeline

runs.

Developing the pipeline code in such a manner also has the added benefit of separating all the methodological parts

of the code from the input and output code. This is of importance, as given all the code is written in open-source

languages, the only parts of the code that would not be immediately portable to a different cloud platform are the

parts that perform the input and output functions. By placing these at the very start and end of the code only, the

systems can be readily ported to another platform without having to disentangle these I/O processes from deep

within the code base. It also means that retesting a pipeline in such a situation could be performed with ease as the

underlying methods’ code will not need altering to account for any code changes required for the different cloud

platform.

5.2 Pipeline tables To have a clear audit trail for all the data generated by the pipelines, several tables are generated during the

production of the indices (as discussed in section 6). A summary of these tables and where they are created can be

found in Table 2.

Table 2 Overview of main data tables produced by systems

Table Description Produced by

Staged The standardised form of the data delivered by suppliers.

Staging of raw data files delivered to ONS by the data engineering pipelines

Cleaned The data after applying any pre-processing, with additional columns denoting whether a row is identified as being an outlier or not based on the chosen data cleaning methods.

Data cleaning pipeline

Invalid strata A mapper that denotes whether for each elementary strata if an index is calculable at the start of the annual round

Invalid strata pipeline

Processed The aggregated monthly price and quantity observation for each unique product. This is the data used to calculate the elementary aggregate indices.

Elementary indices pipeline

Elementary aggregate indices

The monthly price indices for each elementary aggregate.

Elementary indices pipeline

Aggregated indices The full set of indices from elementary aggregate, aggregated all the way up to COICOP1. Includes additional information on growth rates and contributions.

Index aggregation pipeline

Storing the data like this allows deeper interrogation by researchers or production users if anything is identified in

the final output. As an example, if there was a sudden drop in a particular strata’s index, all the data that fed into

that value can be explored from the initial data as provided by the supplier, to the outcome of the data cleaning

methods on the data, through to the final monthly price and quantity observations for each product in the strata.

6 Producing indices for publication Given we will restrict the number of consumption segments to only include those that have representative items in

the traditional collection, we will also need to continue the process of annually updating our basket of goods and

services to include new consumption segments to reflect changing consumption patterns. The production system

has been designed to be run in an annual round, which initialises the data for subsequent monthly rounds, which are

used to produce the monthly outputs.

The workflow outlined in Figure 2 primarily relates to the processing of the rail fares alternative data sources but

forms the standard workflow that will be applied to future data categories (e.g. used cars or groceries), and allows us

to integrate alternative data sources more readily with traditional data sources. More detail on our hierarchies and

how we are integrating new and traditional data sources is discussed in Integration of alternative data into consumer

price statistics: the UK approach.

Figure 2 Overview of pipeline workflow for the annual and monthly rounds.

6.1 Annual processing We will only introduce new consumption segments (or new alternative data sources retailers) where we already

have a 25-month period of data so we can consistently use a 25-month window for all alternative data based

elementary aggregate indices. This means that every year the consumption segments will essentially be reset, the

data re-classified to the new consumption segments over a historic 25-month period, and then the new index for the

following 13 months would splice onto this recalculated historic index.

The annual round therefore initialises a 25-month historic series based on the classification system in the current

year. For example, given a January 2024 base, the system would initialise the data from January 2022 through to

January 2024. This is important if any classifications have changed or any new consumption segments have been

added to, or removed from, the CPI basket of goods and services.

A 25-month period of data are first cleaned (data cleaning pipeline – see “Outlier detection for rail fares and second-

hand cars dynamic price data” for details on our chosen data cleaning methodology for rail fares) before it is decided

whether a stratum index can be appropriately calculated based on the presence of bilateral pairs of price relatives

(invalid strata pipeline). If no bilateral pair exists in the January of the current year, this is considered an invalid

stratum as it’s unlikely that an index will be calculable in the following months. This is output as a mapper file that

can be reviewed, monitored, and amended if necessary.

The data cleaning pipeline is then run again, removing any data within the 25-month historic series that pertain to an

invalid stratum. Now our data should only contain products from which we would want to calculate an index.

The cleaned data are then used to produce elementary aggregate indices (elementary indices pipeline) for the initial

25-month window that will be used to splice onto during the monthly round. For rail fares we use the GEKS- Törnqvist index method (see “Introducing multilateral index methods into consumer price statistics” for a technical

explainer).

The processed data (output from the elementary indices pipeline) are then used in the annual outputs pipeline to

calculate the stratum-level weights for the year (when created from the data directly – some stratum-level weights

are derived offline using existing methods and ingested to be used alongside the other weights).

The aggregation pipeline can then use the output from the elementary indices pipeline and the annual outputs

pipeline, along with weights, to aggregate the indices and impute any remaining missing indices in the 25-month

historic window.

Now we have initialised our historic back series for all the data and produced the required weights, we can use this

information in the monthly round.

6.2 Monthly processing The first step in the monthly round process is where the data cleaning pipeline cleans the latest month of staged

data using the new staged data and the cleaned back series. Next, the elementary indices pipeline uses the cleaned

data to produce a new index based on the most recent month of data along with 24 historic months of data from the

‘processed’ data output in the previous month. This produces new elementary indices with a 25-month window i.e.

the monthly round in February 2024 would initially produce an index from February 2022 to February 2024. These

are then spliced onto the series’ produced in the index aggregation pipeline from the annual round (in February) or

the index aggregation module from the monthly round (in the remaining months for that year), to ensure no

revisions occur.

These elementary indices are then passed on to the aggregation pipeline to aggregate to higher levels, using the

previously constructed weights. Indices are all re-referenced to Jan=100 to align with, and allow aggregation with,

traditional data sources. Any missing index values are imputed. For the UK, missing stratum indices are imputed

based on the consumption segment level index. Missing consumption segment level indices are imputed based on

the COICOP 4 level index. Missing COICOP indices are imputed based on their nearest parent. More detail on our

aggregation and imputation methods can be found in: Introducing alternative data into consumer price statistics:

aggregation and weights.

6.3 Research capabilities Alongside the processes for creating the publication indices, researchers need the ability to mimic the production

cycle in order to perform analysis on the impact of proposed changes to the output inflation numbers. Reasons to

perform such analysis include; to see the effect if different methodologies are used, to investigate the incorporation

of a new alternative data source or category, or to check the effects of reclassifying data. The split permissions set in

the production environment mean that researchers can access all the data they need to perform this analysis

without interfering with the production tables in a space where they have full control over what additional data and

methods they utilise.

7 Implementing best practices to quality assure systems The data engineering and data processing pipelines built for this project have been done in line with the

Reproducible Analytical Pipelines (RAP) strategy defined by the UK Government Analysis Function (GAF). This

strategy outlines the software engineering and infrastructure best practices to ensure that pipelines are

reproducible, auditable, efficient, and high quality. The following sections details how the UK ONS Consumer Prices

Transformation project meets and exceeds the minimum standard for a RAP.

7.1 Implementing RAP for the code

7.1.1 Minimise manual steps Due to the size of the data being used there is naturally little opportunity for tools like spreadsheets, which would

involve manual steps, to be present in the workflows. The only area where manual steps are involved is in the

creation of mapper files that modify behaviour within the pipeline. For example, we may use a mapper to define the

link between lower-level ONS-specific aggregates (described as consumption segments) to COICOP5. This is a manual

process as these mappings can be revised on an annual basis as part of the annual basket refresh (see section 6). To

minimise any unintended consequences, the processes for creating these files are documented and these inputs are

validated when loaded into the platform to ensure any issues are caught. Using mappers also allows us to avoid

annually updating hard-coded references, reducing the risk of errors by having to update the code directly.

7.1.2 Built using open-source software All the systems are written using the Python language. Due to the size of the data being processed, the pipelines are

also built to leverage distributed computing using Apache Spark and its Python API, PySpark.

7.1.3 Quality assurance processes with peer review Business analysts work with the Prices methods and business change teams to define new functionality as

requirements with clear acceptance criteria, captured via Jira tickets. Each requirement is then coded by the relevant

data or software engineer, and the code is then peer reviewed by other developers within the teams. Doing this

allows both quality assurance and dissemination of technical knowledge across teams.

7.1.4 Use of version control software All code is kept in git repositories on the GitHub platform.

The transformation work requires research and impact analysis to be carried out using a breadth of methods before

selecting the final publication method. Because of this, during development new methods are added via "Research"

releases of the pipeline. Once the methods have been determined and accepted by stakeholders, those methods will

be formally tested via "Release candidate" releases, which after passing will be promoted to "Production" releases.

This workflow is managed via the use of Jira tickets to ensure traceability in what functionality is available in each

release.

Table 3 Overview of releases used by for the Consumer Prices Transformation project

Release Frequency Description

Production Annual These are versions of the pipeline that have been fully tested for all methods required to produce publication outputs. These can only be changed in exceptional circumstances due to the annual production round of producing the indices.

Release candidate

Annual These pipeline versions are created prior to the release of a new production version of a pipeline, as the pipeline must undergo formal testing before it can be signed off as ready to enter production. To enable this, these release candidates of the code are produced for use by the testing team.

Research Monthly These are versions of the pipeline that have undergone testing by the developers, but not the formal test process that the production releases undergo as the methods introduced may not be used in production. These versions are produced every month between annual rounds for a pipeline that is in active development.

Bug patch Ad-hoc These are versions that are created only when required due to bugs being identified in a Production or Research version of a pipeline. For Production bug patches, the code must undergo another round of formal testing by the test team. Research bug patches only undergo standard developer testing.

This development of code is managed through a GitFlow workflow. Using the GitFlow branching strategy aids the

development of "Production" and "Research" releases simultaneously (see Figure 3) and creates a dedicated space

for functional testing to be conducted. Using git tags to "tag" pipeline releases across these branches, new versions

can be released to the processing platform via CI/CD (see section 7.1.9).

Figure 3 Example of the GitFlow workflow.

7.1.5 Open source of code and data used Currently, the code is not publicly available due to the confidential information that must be present in some of the

code. There is however the intention to release code packages of the various methodologies in the future. The UK

data used in calculating our consumer price inflation statistics are commercially sensitive and cannot be shared.

7.1.6 Follow existing good practice guidance set by department The Consumer Prices Transformation project builds on the best practice guidance set within ONS, but as the first

transformation project of its kind in the organisation, it is also setting best practices. To ensure the quality of the

choices being made on the project we have engaged with the GAF team to provide external assurance.

7.1.7 Fully documented code All the project code is documented within the pipelines, with this supplemented by documentation covering a range

of aspects of the project from platform design, data dictionaries to code release patterns kept in a central project

wiki hosted on the ONS intranet.

7.1.8 Testing The pipelines and systems are all thoroughly tested in several ways. For each pipeline, all the code undergoes:

• Unit testing – ensuring that each individual function behaves as expected. These tests are typically

constructed by the developer.

• Component testing – ensuring that each “module” of code behaves as expected. A module can be thought of

as an entire method which is made up of several functions. In the instance where a method is a single

function the component and unit test would be the same thing. The data used for these tests is provided by

the appropriate methodologist.

• Module integration testing – this ensures that the sequence of modules interact with each other as

expected. An example would be seeing that a pipeline can run end to end (but not confirming if the results

are valid).

• Functionally testing – this confirms that each pipeline and its output meets the defined requirements as

written. These tests confirm for example, that for a given input data source and input parameters the

pipeline produces the correct expected output (including negative testing).

• Integration testing – confirming that a sequence of pipelines all run together. For this project, as the output

of one pipeline feeds directly into another, these tests would confirm the entire end-to-end process from

new data being received through to the production of indices.

• Acceptance testing – the final checks to ensure that the entire processes meet the needs of every defined

business scenario. For example, “new data are delivered, are new indices produced and alerts sent” or “bad

data are delivered, are the systems stopped and alerts sent”.

Several of these test stages are automated using the continuous integration tools.

7.1.9 Continuous Integration and Continuous Deployment (CI/CD) Using CI tools we automate the checking of unit, component and integration tests for every change made to each

codebase prior to being incorporated into the pipeline. These checks also cover ensuring consistent code style and

code documentation. The CD tools for the project ensure that the packaging and deployment of code to the

production platform is automated in a reproducible manner. The deployment of code is triggered by the presence of

tagged releases of the code base.

7.1.10 System logging All pipelines utilise a range of logging across the range of severities (i.e. debug, info, warning, error) to allow robust

traceability of processes occurring. Information logging includes information such as how many rows are removed by

a filtering command to allow quick diagnosis of data issues if more rows are dropped than expected, or the schemas

of data being read or written to tables. Debug statements are switched off in systems by default but can be turned

on in the case of runs that error to allow better diagnosis of issues. Warnings are used to alert for behaviour which

while not terminal is not expected behaviour, whilst errors are reported when issues occur that prevent a pipeline

from proceeding further. These logs are captured both within the cloud platform, but also relevant alerts are sent to

the production team to inform them of successes and failures of the systems.

7.2 Implementing RAP for the platform

7.2.1 Automated pipelines The project makes use of Apache Airflow to manage workflow orchestration in several ways speed to minimise user

input and error, for the normal production of monthly indices. Event driven workflows are in place to handle the

staging into tables of new data when new data files are ingested onto the cloud platform. Time based workflows are

used to handle the automated scheduling of the production of indices by passing the staged data through the several

data processing pipelines required for producing indices.

7.2.2 Reproducible infrastructure The entire cloud platform infrastructure is prescribed using Terraform. By doing this all aspects of the cloud platform

from the tools and datasets to user permissions are prescribed in code and can be easily audited, reproduced or

expanded.

8 Future developments The roadmap for the transformation of UK consumer price statistics has several data categories in active

development. These data categories will reuse the workflow as outlined in this paper for producing rail fare indices

for producing their own indices. Due to the modular nature of the approach we have taken in our systems design, it

will be trivial to add extra pipelines if needed for a data category at any point in the data journey. For instance,

research has already identified the need for pipelines to perform automated classification for the clothing data

category, or for identifying potential relaunched products in the grocery data category.

Developing reproducible analytical pipelines for the transformation of consumer price statistics: rail fares, UK

Languages and translations
English

Developing reproducible analytical pipelines for the transformation of consumer price statistics: rail fares

Matthew Price Technical Lead

8 June 2023

Continuous programme of improvements for consumer

price statistics over several years beginning with rail fares

Aims:

• Obtaining robust sources of alternative data

(scanner/web-scraped data)

• Researching methodologies to most effectively

incorporate the data

• Developing statistical systems for existing and new

data and methods

• Embedding new systems and processes

Primarily, new data will help us to inform the narrative

around what is driving inflation for our users

Transforming UK consumer price statistics

2

0 5 10 15 20 25 30 35

Communication

Education

Clothing and footwear

Furniture, household equipment and maintenance

Miscellaneous goods and services

Restaurants and hotels

Recreation and culture

Transport

Alcoholic beverages and tobacco

Food and non-alcoholic beverages

Housing, water, electricity, gas and other fuels

Rental prices

(~24% CPIH) Grocery

scanner data

(13% CPIH)

Rail Delivery Group

(transaction) and Auto

Trader (web provided)

data (2.3% CPIH)

ONS and UK Government platform strategy

• UK Government Cloud Strategy is “Cloud First” and

“Cloud Agnostic”

• Aim to use “Infrastructure as a Service” and “Platform as

a Service”

• ONS utilise a range of in-house and cloud platforms

Key requirements for platform

• Secure

• Scalable data storage

• Distributed, scalable compute system

• Dashboard capabilities

• Ability to host web applications

• Interactive research space

Environment Description Main users Data used Stability

Develop Sandbox environments where teams have full

access to explore, test, and do their work.

•Software engineers

•Data engineers

•Infrastructure engineers

Synthetic Not stable

Test Test environment where all systems can work

together allowing testing of individual and

multiple systems.

•Testers Synthetic Stable

Pre-

production

Environment where testing on live data can occur

to ensure changes will be stable before moving

into production.

•Business change team Production (data duplicated from prod

environment)

Very stable

Production Environment where production occurs via

automated scheduling of pipelines. Also, where

research on data and methods can occur.

•Production team

•Research team

Production (where all new data is ingested,

permissions set to prevent researchers

seeing “current month” data for production

datasets)

Most stable

Platform environment strategy

Platform environment strategy

Also aids the development of

new data sources, methods

and pipelines

Data engineering

• Data sources are delivered as files in specific style for

each supplier

• Data engineering stages data prior to processing by:

• Virus scanning and ingesting data

• Validating data against known metrics

• Enrich data by applying appropriate mappers

• Applying standardisation to each source

Data processing pipelines

• Several distinct pipelines make up the full system

• E.g. railfares: data cleaning > elementary indices > aggregation

• Each pipeline follows the same code structure

• Controlled by a user and backend configuration file

• Each pipeline produces output for straightforward audit

of data journey

Producing indices for production

• Production works with an “annual

update”

• Annual round in February initialises data

back series and calculates weights for

next 12 months

• Monthly round (Feb – Jan) updates back

series and produces the new indices

RAP – Reproducible analytical pipelines

• UK Government developed best practice principles

for analytical systems

• Guidelines aim to:

• improve the quality of the analysis

• increase trust in the analysis by producers, their

managers and users

• create a more efficient process

• improve business continuity and knowledge management

RAP for code

• Minimise manual steps

• Use open source software

• Peer reviewed

• Uses version control

• Open sourced code and data

• Follows department good

practice

• Well documented

• Tested

• Uses CI/CD

• Appropriate logging

RAP for platform

• Automated pipelines

• Restricted access to production

• Reproducible infrastructure

Future developments timeline

• Multiyear transformation project

• Systems design template will allow

scaling out of systems easily

• Rolling out new categories every year

Thank you