Machine learning paves the way for modern, efficient statistical production

15 April 2021

International collaboration facilitated by UNECE is helping statistical organizations around the world move towards producing essential statistics in innovative ways based on machine learning (ML) and artificial intelligence (AI).

The Machine Learning 2021 Group, led by the United Kingdom’s Data Science Campus in collaboration with UNECE, is driving efforts to take ideas about machine learning and artificial intelligence from the realms of science fiction towards a reality that can contribute to cheaper, faster and more accurate statistics for crucial decision-making. A year-long initiative encompassing research, knowledge-sharing and capacity development is aiming to put ML front and centre in international efforts to modernize the way that statistics are produced.

ML is already part of our day-to-day lives

ML and AI might seem like buzzwords, jargon terms relevant only to the world of tech geeks in the software industry. But we live in a digital world, and the truth is that tools based on ML and AI are all around us. The news apps we scroll through as we begin our day; the spam filters that keep junk emails at bay in our work inboxes; the social media that connect us with family and friends; the movie and television streaming sites with which we relax at the end of our day; from dawn until dusk our days are touched by the power of this new technology.

A data revolution

ML has successfully entered every corner of our lives, not only thanks to increases in computing power and advances in methods, but also due to a key phenomenon of recent years, known as the data revolution. The amount of digital data being created is growing at an unprecedented scale, as more and more services become digitalized and knowledge is increasingly being put on the web. Machine Learning depends on data—computers are fed data and instructed to look for patterns—so the more data, the more scope there is for machines to identify these patterns, or ‘learn’ from the information they are fed.

ML in official statistics: a cautious approach needed

The process of extracting patterns from data in ML is not so different from the core business of national statistical offices (NSOs), which traditionally process data from surveys, registers and administrative sources to produce the official statistics we depend upon such as GDP, employment rates and demographic figures. Indeed, many of the techniques used in ML have their roots in the statistical methods that NSOs have been using for decades. Expanding the scope of NSOs’ work to include these new techniques offers the potential to speed up processes that currently take a lot of time or human intervention, as well as the possibility of reducing the response burden on people asked to answer lots of questions in surveys. For example, ML can be used for classifying the jobs people hold and the industry they work in based on their answers to open-ended survey questions, an approach being piloted in Canada, Mexico, Serbia and Iceland.

Coding and classification are essential to ensure data gathered from people or businesses are comparable nationally and internationally—but these processes are resource-intensive, often involving human experts reading responses and assigning codes to them. However, with ML this process can largely be automated. Human experts first work on a small subset of the entire dataset. ML is then used to classify the rest by learning from the pattern of the experts’ work. Such automatic coding leads to faster release of figures, making them more valuable to end users.

In spite of the great promise it offers, the use of ML for official statistics requires a very cautious approach. NSOs operate in a unique way which differs from private sector companies. The Fundamental Principles of Official Statistics require that their statistics be produced in a scientifically sound, reliable, transparent and reproducible manner. New technologies and techniques must be harnessed in ways that maintain public trust. And while NSOs should of course experiment with new ideas to try and streamline their processes and improve their products, they are not as free as others are to try out new ideas and simply drop them if the results start to go awry: once they publish figures as official statistics, people depend on their accuracy and continuity.

UNECE helps statistical organizations around the world to advance the use of ML

UNECE’s High Level Group for the Modernisation of Official Statistics (HLG-MOS) is at the forefront of global efforts to modernize official statistics. In 2019 the group took up the challenge of investigating how ML could be harnessed help NSOs to improve the production of official statistics. The resulting Machine Learning Project, which concluded in December 2020, involved over 120 people from 23 countries. After conducting 19 pilot studies over two years, the project’s final report concludes that ML is not just a buzzword or a passing phase for official statistics; the potential it holds is genuine and huge.

The project members recognized, though, that on the flip side of these benefits there are legitimate concerns. Rigorous quality standards are a cornerstone of official statistics, and those produced with the help of ML can be no exception. Quality, in statistical terms, doesn’t just mean accuracy but refers to other aspects such as timeliness and cost-effectiveness. The new techniques of ML create even more dimensions against which quality must be assessed, such as explainability and reproducibility. Many of the fears expressed by the public about the ethics of ML stem from the intense complexity of algorithms, sometimes thought of as a ‘black box’ since it can be so hard to see what is going on inside. If users cannot understand how decisions are made by ML, they may not feel confident in the outcomes. Indeed, not only the end users but also the NSOs using ML need to be able to understand the contents of the ‘black box’, to ensure that ML does not ‘make correct decisions for the wrong reasons’, or perpetuate accidental bias in datasets. The HLG-MOS project developed a Quality Framework for Statistical Algorithms, aiming to guide NSOs through this minefield of quality assurance when using ML.

International collaboration to ensure the promise is fulfilled

There is still a long way to go before the potential of ML for modernizing official statistics is truly harnessed. Even after successful pilot projects in many UNECE countries, the ML-based solutions have not necessarily been adopted as an integral part of the NSOs’ business. The barriers to successful integration, such as out-of-date IT systems and silo working culture, were investigated as part of the project. Through coordinating this worldwide network of experts, UNECE’s HLG-MOS has begun to foster a shared understanding of these barriers and possible solutions. In 2021 the work continues with a range of work packages. The United Kingdom is leading a strand on ethical use of ML in statistics; the IMF, Mexico, Sweden and others will focus on integrating ML into production; Finland will explore issues of quality in the datasets on which ML algorithms are trained; and Mexico will lead continued endeavours to establish an internationally-agreed quality framework for ML in official statistics.

Machine learning paves the way for modern, efficient statistical production

Follow UNECE