Skip to main content

Japan

(Japan) Introduction of researches in Japan regarding AD/ADAS

Languages and translations
English

Introduction of researches in JAPAN

regarding AD/ADAS

NATIONAL TRAFFIC SAFETY and ENVIRONMENT LABORATORY

(NTSEL)

Item 11. Exchange of views on guidelines and relevant national activities.

Submitted by the expert from Japan Informal document GRVA-18-35 18th GRVA, 22-26 January 2024 Provisional agenda item 11

1

Introduction of NTSEL

UN-ECE

WP29

MLIT*

JASIC **

NTSEL

*MLIT: Ministry of Land, Infrastructure, Transport, and Tourism **JASIC: Japan Automobile Standards Internationalization Center

-Missions of NTSEL-

◆ Comprehensively address various motor vehicle- related issues • Prevent the circulation of vehicles not compliant with regulations via type

approval tests • Faster and secure response to recalls via recall-related technical verification

of motor vehicles • Support the government’s policymaking and regulation development relating

to safety and the environment via tests and studies

◆ Support local transportation systems • Provide technical support of technical evaluation and standard development

for transportation systems via tests and studies

◆ Ensure international coordination • Provide technical support for the promotion of Japanese automotive

technology as part of international regulations • Provide technical support for the promotion of Japanese railway technology

as part of international standards

2

Introduction and contents

<Contents of this presentation>

1 .Indoor VR Testing System

2 .Artificial Rainfall Device

3 .Negligence Requirements Based on Case Analysis

To achieve

Safe & Convenient mobility <Research topics>

• Validation method

- to achieve comprehensiveness with

equivalence and reproducibility

- to validate the robustness of the system

• Social acceptance - to be equivalent to or safer than human drivers

AD/ADAS

3

1. Indoor VR testing system <Research question and objective>

How can comprehensiveness be ensured while guaranteeing equivalence

and reproducibility?

⇒ Investigate the possibilities and challenges of indoor validation methods

w/ actual vehicle w/o actual vehicle

@ proving ground /

real world

(via simulation)

comprehensiveness and

reproducibility are the issues

equivalence is the issue

Need to clarify the extent to

which the validity of the

simulation can be ensured

Proposed method ・address issues with both

・achieve efficient and reliable

validation

4

1. Indoor VR testing system -System configuration-

<Vehicle on Dyno>

<Simulation><Emulation systems>

Gas/ Brake Velocity

Torque

a. Vehicle model → estimate ego behavior

c. Sensor models → convert information of targets

to sensor readable signals

b. Scenarios → control relation with targets

e.g. cut out, cut in, etc.

Software

Hardware -Radar emulator-

-Camera emulator-

Emulation by display with different picture to the stereo camera

Emulation with simulated reflection mm-wave

Perception information

Signal conversion: Ex. Distance → Time delay

Relative velocity → Frequency shift Size → Gain

Display Stereo camera

Radar

Antenna

Constant rainfall in 2-lane wide, 30m long

Precipitations of 20-100 mm/h (every 20 mm/h)

5

• Vehicle can run up to 130 km/h • ADAS functions can activate

2. Artificial Rainfall Device

<Fog>

<Rain>

Rain/Fog stands

Dynamo meters

Diameter [mm]

S p

ee d [

m /s

]

A m

o u

n t o

f rain d

ro p

s

2.Artificial rainfall device

6

-Validation of raindrops-

Diameter [mm]

S p

ee d [

m /s

]

<20 mm/h>

-Natural-

-Reproduced-

<51 mm/h>

Concentrated areas (red) are on the theoretical line → close to natural rainfall

Gunn-Kinzer approximation curve

Diameter [mm]

S p

ee d [

m /s

]

<100 mm/h>

2.Artificial rainfall device

7

-Validation of on-vehicle ADAS system-

• ADAS system activated and

runs at almost set velocity

• Recognizing the target vehicle

and both lines

• Lost the target vehicle and

right-side line

• Runs slower to set velocity

→ System deactivated afterwards

8

3. Negligence Requirements Based on Case Analysis

<Research question and objective> What are the requirements for human drivers to be criminally punished?

⇒ Organize requirements for human drivers as the norm for AD via the analysis

of traffic accident precedents in Japan

⇒Consider the boundaries of criminal penalties

<Basis of Negligence: Abuse of duty of care>

• Foreseeability - Where/when should human drivers recognize story triggers

that lead to danger?

• Preventability - Can human drivers prevent accidents from the above trigger?

… shall not cause any traffic

accidents resulting in injury or

death that are reasonably

foreseeable and preventable.

ECE-TRANS-WP29-2019-34-rev.1e

The requirement

in the Framework Document

9

3. Negligence Requirements Based on Case Analysis Distance remain

Distance to stop Collision 2Collision 1

GuiltyNot guilty !

Foreseeability -The trigger to the obligation to prevent crash-

➢ Deeply relates to the context ➢ Switching of the context

<Jumping out of pedestrian> <Cutting in of a vehicle>

Can see the pedestrian

→Trigger!

Becomes danger at certain

moment…→Trigger?

<Normal context>

No pedestrian on highway

→can drive fast

<Switched context>

Drive with obligation to

foresee the presence of

pedestrians or a collision with them

Road sign

10

3. Negligence Requirements Based on Case Analysis Distance remain

Distance to stop Collision 2Collision 1

GuiltyNot guilty !

Preventability - Can prevent accidents or not from the trigger point-

➢ w/ Braking ➢ w/ SteeringFree run distance + Braking distance

0

0.5

1

1.5

2

2.5

急制動 割り込み/幅寄せ 出会い頭 飛び出し

Median

0.69 s

Median

0.77 s Median

0.71 s

Median

0.79 s

341Amount

0.85Average

0.29Deviation

56Amount

0.85Average

0.40Deviation

232Amount

0.83Average

0.33Deviation

195Amount

0.88Average

0.37Deviation

Traffic situation

Sudden brake Crossing Jumping outCutting in

Lamp response Stepping response

C o g n it

iv e

re ac

ti o n t im

e [s

]

25%ile

75%ile

Min

Max

Deviation is huge especially on cutting in.

Perception on danger may vary.

←<Cognitive reaction time>

0.75 – 0.8 s for drivers in

actual traffic environment

without psychological readiness

Depends on velocity, vehicle spec,

road condition, etc.

→possible to estimate

preventability with braking

• Not obligated when it is unpreventable by brake

• Allowed as a choice of prevention

• Considered based on emergency evacuation

Foreseeability of new hazards

due to steering is important

+ Behavior gets complicated

→difficult to estimate

preventability with

steering in general

11

3. Negligence Requirements Based on Case Analysis -Cognitive reaction time under real traffic situation-■Definition

Outbreak Perception Decision Reaction

Cutting in

Percept the danger

Decide how to avoid Release the

gas pedal

or

Braking Steering Gas pedal

Switch the

pedals

Kick on the

brake pedal

Brake pedal

Delay in perception Delay in decision

Decide emergency

braking is required

Delay in reaction

Release the

gas pedal Foot transfer

Cognitive reaction time

Brake

activate

Outbreak

of danger

Time to enable

braking

■Ex.

*These data were incidentally captured by a driving recorder

with brake trigger, under actual traffic conditions. The situation

and its factor were not be created or manipulated on purpose.

Equipment

➢ More than 300 vehicles joined

➢ Duration of 7 months for data acquisition

→More than 1000 valid data were acquired

for Front for Footage ◼ Model

CS-41FH (CELLSTAR)

◼ Frame rate

30fps

(2 Mpixel) (1 Mpixe w/IR LED)

<Major findings>

0.75 – 0.8 s (on average) for drivers in actual traffic

environment without psychological readiness

12

3. Negligence Requirements Based on Case Analysis Distance remain

Distance to stop Collision 2Collision 1

GuiltyNot guilty !

Foreseeability / Preventability ➢ Standard of negligence as evaluation criteria

• General person

- The context is common or not

- Generally preventable or not

• The said person

- Did he/she get to know the context?

- Is it preventable for him/her? Ex. Professional taxi driver:

Because he passes through that drinking district at the same time every day, he should foresee that it is not surprising to find drunk people sleeping on the street late at night on Friday.

➢ C&C human driver? (Research topic beyond)

→ Define the trigger point and preventability as

competent and careful levels. <Driver behavior observation> Measured driving behavior on public road with eye tracking device

-Major findings- • Competent implies excellent

driving ability • Careful implies that the ability to

recognize contextual triggers is competent (hypothesis)

View Tracker Ⅲ

13

Areas of concern and potential directions for future research

1 .Indoor VR Testing System

• Premature application as an official validation method in the context of type approval

2 .Artificial Rainfall Device

• Reproduction of more precise rain, e.g. travel wind, splash, etc.

• Reproduction of stable fog

3 .Negligence Requirements Based on Case Analysis

• Clarification and organization of trigger points as foreseeability

• Considering the foreseeability of new hazards due to steering

Thank you very much

for your kind attention.

  • 既定のセクション
    • スライド 0
    • スライド 1
    • スライド 2
  • 1. Indoor VR testing system
    • スライド 3
    • スライド 4
  • 2.Artificial rainfall device
    • スライド 5
    • スライド 6
    • スライド 7
  • 3.Negligence requirements
    • スライド 8
    • スライド 9
    • スライド 10
    • スライド 11
    • スライド 12
    • スライド 13
    • スライド 14

A Case Study of Output Checking in Japan, National Statistics Center

output checking, checking rules, on-site use of official statistical microdata, case study Japan,

Languages and translations
English

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

A Case Study of Output Checking in Japan

Yutaka Abe and Kazuhiro Minami (National Statistics Center, Institute of Statistical Mathematics)

[email protected], [email protected]

Abstract

In Japan, the Japan National Statistics Center has been responsible for checking the output of on-site use of official statistical microdata since its launch in 2019. We have accumulated experience in output checking as we examine how to apply checking rules to various outputs produced by researchers in different research fields. We have also identified the need for adding new rules. In this paper, we present our experiences in output checking as a case study in Japan and describe the new rules for quantiles, which we plan to introduce.

2

1 Introduction

In 2019, Japanese Statistical Act has been revised and on-site use of official statistical microdata launched. In Japan, microdata of official statistics has traditionally been provided by the media such as DVD. However, because it is not possible to check the outputs created by the users after the media of microdata are provided, the purpose of use and the outputs the users will create are confirmed in advance before the start of use, and only limited variables necessary to create the outputs are provided. In addition, Japan has decentralized statistical system, which mean each ministry has its own statistical survey. Hence the users need to apply to multiple ministries for each statistical survey, which they need for research purposes. Therefore, for on-site use, we only confirm the outline of the research method and the output image that users will create before the start of use, provides all variables of microdata, and instead, check each output which the users create. Furthermore, each ministry outsources its administration to the National Statistics Center as the central contact point for on-site use procedures. Thus, the National Statistics Center checks all the outputs through on-site use, and is also reviewing the rules for output checking appropriately. The rest of the paper is organized as follows. In Section 2, we introduce Japanese on-site use institution for microdata of official statistics. In Section 3, we present our experience with output checking as a case study; in Section 4, we discuss a new output checking rule for quartiles that we are considering introducing. Section 5 describes future work.

2 On-site Use for the Microdata of Official Statistics in Japan

2.1 On-site Use Institution in Japan For the importance of EBPM has been recognized in Japan in recent years, the Statistics Act has been revised in 2019 to make microdata of official statistics available for policymaking and academic research. In Japan, the microdata of official statistics has traditionally been provided only to public institutions and researchers subsidized by public institutions, etc. However, in order to promote further use of the microdata, the requirements for use have been expanded so that, for example, faculty members affiliated with universities in Japan can use the microdata even if they are not subsidized by public institutions. On the other hand, since the microdata contains many confidential information of the survey individuals, the provision of the microdata by on-site use has also been started in order to provide securely. Especially, for the use requirements added in the 2019 revise are provided only by on-site use. Since on-site use allows for post-checking of the output, when submitting a using request of the microdata, we confirm the outline of the research method and the output image that users will create. This was intended to shorten the time to start on-site use and to allow for explorative and creative analysis with all variables in on- site use. In addition, the portal site (in Japanese) [1] is open to the public for on-site use procedures. Japanese output checking rule has been under consideration even before on-site use was officially launched, as presented at the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality held in September 2017 [2]. Following this consideration, the current output checking rules are based on the principles that, each individual value is confidential, calculate statistical values from the individual data values of 10 or more units (principle of 10 units), remain 10 or more degrees of freedom for mathematical models such as regression coefficients (principle of 10 degrees of freedom). Moreover, dominance rule for statistical surveys covering establishments, and optional rules to prevent group disclosure for sensitive variables are determined.

2.2 Overview of On-Site Systems in Japan Figure 1 shows an overview of Japanese on-site system. As of July 1, 2023, there are 21 on-site facilities in Japan and 19 of which are located at universities. Thin client PCs are established at each on-site facility, and users connect to the central virtual PC server from these thin client PCs to use the survey information on their individual virtual desktops. For secure connection, the communication lines for the on-site system use SINET, an academic backbone network for universities and research institutions built and operated by the National Institute of Informatics, and are disconnected from the general Internet connection.

3

For security purposes, we keep records of who and when enters and exits the on-site facility, and we monitor and record user activity inside the facility by monitoring cameras. Thin client PCs are set up so that external disks and USB flash drives cannot be used. Therefore, users cannot directly take out the outputs of their research and analysis. We retrieve the outputs from the virtual PC on behalf of the user, check the confidentiality of the outputs and send them to the user if they are safe.

Figure 1 Secure microdata access service in On-site facilities

3 A Case Study of Output Checking in Japan

The National Statistics Center checks all of the outputs by on-site use based on predefined output checking rules. The output checking rules are defined for commonly used output formats, however, in the past five years of output checking there were cases that could not check the safety of outputs only by formally applying the rules, and necessary to understand the contents of the outputs and conduct checks other than the rules, or to change the format to enable checks within the rules. In this section, we present a case study of output checking conducted at the National Statistics Center that was suggestive in our work. Note that the values in the case studies are not actual.

3.1 Case of Statistical Tables Containing Hidden Attributes A user created Table 1 in the on-site facility. The user has analyzed the number of women employed non- regular (part-time, etc.) and created a table of the number of employees by employment status and age for female and male total and for female.

4

Table 1 Number of employees by gender, employment status and age

Gender Employment Status Under 24 y.o. 25-29 y.o. ⋯

Female & Male All Status

Female & Male Non-Regular

Female All Status

Female Non-Regular

However, upon review, it was not sufficient to simply apply the existing rules for frequency table to this table. The difference between total and female we can calculate the male frequencies, and the difference between all status of employment and non-regular employment we can calculate the frequencies of normal employment. As shown in Table 2, since it was necessary to check the statistical tables including the hidden attributes, we asked the user to create a complete table as explanatory materials, and we used the rules of frequency tables to check the safety of the complete table.

Table 2 Statistical tables containing hidden attributes

Gender Employment Status Under 24 y.o. 25-29 y.o. ⋯

Female & Male All Status

Female & Male Regular

Female & Male Non-Regular

Female All Status

Female Regular

Female Non-Regular

Male All Status

Male Regular

Male Non-Regular

3.2 Case of Statistics on Total and Breakdown A user created Table 3 in the on-site facility. In order to compare the number of manufacturing establishments and other industrial establishments in an area, the user created a table of the establishments’ frequency, means and standard deviations of the amount of sales.

5

Table 3 Statistical tables on manufacturing and other industries (before suppression)

All Industries Manufacturing Other Industries

Freq. Mean S.D. Freq. Mean S.D. Freq. Mean S.D.

50 7,074.4 14,373.7 10 13,946.4 31,992.8 40 5,356.4 2,870.9

Our output checking rules specify 1, 70 rule and 2, 85 rule for mean calculated from statistical surveys covering establishments. In this table, since in manufacturing the mean did not satisfy the 1, 70 rule and degrees of freedom of the standard deviation was less than 10, the user suppressed the two corresponding cells as shown in Table 4.

Table 4 Statistical tables on manufacturing and other industries (after suppression by user)

All Industries Manufacturing Other Industries

Freq. Mean S.D. Freq. Mean S.D. Freq. Mean S.D.

50 7,074.4 14,373.7 10 X X 40 5,356.4 2,870.9

However, in the case of Table 5, we can easily recalculate the mean of the manufacturing as follows.

&#x1d48e;&#x1d7cf; &#x1d45b; &#x1d45a; &#x1d45b; &#x1d45a;

&#x1d45b;

We also examined whether the standard deviation could be recalculated, then we found that we can recalculate as follows.

&#x1d494;&#x1d7cf; &#x1d45b; 1 &#x1d460; 2&#x1d45a; &#x1d45b; &#x1d45a; &#x1d45b; &#x1d45a; &#x1d45b; &#x1d45a; &#x1d45b; &#x1d45a; &#x1d45b; &#x1d45a; &#x1d45b; 1 &#x1d460;

&#x1d45b; 1

Table 5 Recalculation of Mean and Standard Deviation

All Industries Manufacturing Other Industries

Freq. Mean S.D. Freq. Mean S.D. Freq. Mean S.D.

&#x1d45b; &#x1d45a; &#x1d446; &#x1d45b; &#x1d45a; &#x1d446; &#x1d45b; &#x1d45a; &#x1d446;

Therefore, we coordinated with the user and finally decided to suppress the cells as shown in Table 6 so that they could not be recalculated.

Table 6 Statistical tables on manufacturing and other industries (after final suppression)

All Industries Manufacturing Other Industries

Freq. Mean S.D. Freq. Mean S.D. Freq. Mean S.D.

50 7,074.4 14,373.7 X X X X 5,356.4 2,870.9

6

3.3 Case of Decision Tree A user created Figure 2 in the on-site facility. This is the decision tree created in R for the conditions under which the employee buys stock. There is no output checking rule for the decision tree, nor in this figure is it possible to check the size of the sample from which it was created or the frequencies at each node. However, we thought that if we could ascertain such information, we can check by applying the existing rules of frequency tables.

Figure 2 Decision tree regarding whether or not an employee buying stock

Therefore, we considered an R script that converts the structure of the decision tree into a frequency table, and then asked users to create Table 7 and check it using the rules of the frequency table.

7

Table 7 Conversion of the decision tree into a frequency table

Buy Not buy Total

Total 4033 11497 15530

Company scale over 1000 employees 1917 747 2664

Age 35-59 years old 1579 380 1959

Age under 34 or over 60 years old 338 367 705

Social insurance over 1482.5 yen 311 246 557

Social insurance under 1482.5 yen 27 121 148

Company scale under 1000 employees 2116 10750 12866

Company scale 30-999 employees 1772 4873 6645

Resident tax over 5562.3 yen 1341 2320 3661

Industry manufacture 547 469 1016

Residence wooden 362 191 553

Residence other materials 185 278 463

Industry other 794 1851 2645

Resident tax under 5562.3 yen 431 2553 2984

Company scale under 30 employees 344 5877 6221

4 Consideration of Output Checking Rules for Quantiles

4.1 Previous Case Studies Since the Japanese statistical system is decentralized, each ministry owns their microdata of official statistics. The National Statistics Center is the central contact point for on-site services, however, if there are no predetermined output checking rules for an output that user created, we need to discuss with ministries that own the microdata regarding the checking method, and this requires time to provide the output. Therefore, for commonly used output formats we should determine output checking rule in advance to reduce the time required for provide. The current output-checking rules require that, in principle, create outputs from 10 or more units. The median and quartiles are widely used in descriptive statistics, etc., however, they do not satisfy the principle of 10 units because their values are either the values of an individual or values obtained from two individuals' values. On the other hand, since it is generally difficult to accurately know the rankings of all survey individuals, we thought it would be possible to establish median and quartile rules by setting appropriate rules, and first collected information on previous cases.

8

4.1.1 Case Studies of Data without Boundaries Project The guideline of Data without Boundaries Project [3] stated the following principles for checking percentiles.

1. If the rank ordering of firms is known or guessable, the percentile cannot be released. 2. If the variance around the percentile is low, there is the possibility of group disclosure. 3. If the variance around the percentile is very large, the identity of the percentile respondent might be

guessable Regarding 1, it is possible that the top rankings are known or can be inferred, however in general, if the data size is large enough, it is assumed that it is difficult to accurately determine the rankings of individuals near the median and quartiles. Regarding 2, consider establishing an optional rule to prevent group disclosure that would apply to sensitive variables. Regarding 3, for example as shown in Figure 3, if the frequencies around the median are small, there is a risk that if the attacker has knowledge of the distribution, the attacker will guess that the frequencies around the median are very small, and it is necessary to establish rules to prevent this.

Figure 3 Bimodal distribution with a small number of values around the median

4.1.2 Case Studies of UK Data Service The Handbook UK Data Service [4] introduce the rounding suppression method. This is a method of increasing the number of digits to round the median or quartile value and the every individual value until the frequency of individuals with the same rounded value as the rounded median or the quartile value is 10 or greater. For

9

example, Table 8 shows that if the median and every individual value are rounded to one decimal place, there are 10 units with the same rounded value as the rounded median. For the first and third quartile values, rounding off to the nearest ten, there are 10 or more units with the same rounded value as the rounded quartiles.

Table 8 Example of rounding suppression

first quartile median third quartile

True values 3804.9 5503.7 7983.6

Rounded value 3800 5504 7980

Freq. of individuals which has the same rounded value

62 10 35

The verification of this suppression method showed that if the variance around the median is very large, even if the number of rounding digits is increased up to the upper limit, the frequencies of the same rounded values are not more than 10, and the median not provided; otherwise, it is possible to conceal it, although there is variation in the number of rounding digits.

4.2 Results of the Consideration Further consideration with the results of the information gathering, we are considering that setting two rules in addition to the UK Data Service's rounding suppression method. One is the rule that the frequency of the group for which the median and quartile values are calculated must be at least 40, in order to ensure that the frequency of survey individuals with values below the first quartile or above the third quartile is at least 10. The other is an optional rule for sensitive variables that requires some degree of dispersion around the median to prevent group disclosure. Specifically, we consider the rule that the interquartile range, which mean the difference between the third quartile and the first quartile, must be more than 30% of the median.

5 Future Work

On-site use in Japan began in 2019. In Japan, on-site use of the microdata of official statistics over the past five years has been mainly in the fields of economics and sociology, although use in other fields, such as medical research, has been increasing. With the increase in the use of new fields in the future, it is expected that the number of outputs in new formats will increase, and there is also concern about the creation of outputs whose contents are extremely difficult to understand, such as machine learning. Therefore, we continue to record cases in our operations and will review the output checking rules appropriately based on case studies of other countries.

References

[1] National Statistics Center, "Using microdata of official statistics (in Japanese)," https://www.e- stat.go.jp/microdata/data-use/on-site.

[2] R. Kikuchi and K. Minami, "On-site service and safe output checking in japan," in Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, North Macedonia, 2017.

10

[3] Data without Boundaries project, "Guidelines for the checking of output," https://ec.Europa.eu/eurostat/cros/system/files/dwb_standalone-document_output-checking-guidelines.pdf.

[4] UK Data Service, "Handbook on Statistical Disclosure Control for Outputs," https://ukdataservice.ac.uk/app/uploads/thf_datareport_aw_web.pdf.

信頼に応えて作る確かな統計

National Statistics Center

A Case Study of Output Checking in Japan

Yutaka Abe National Statistics Center

Kazuhiro Minami The Institute of Statistical Mathematics, National Statistics Center

UNECE Expert Meeting on Statistical Data Confidentiality 2023

2023/9/27

Introduction of National Statistics Center (NSTAC)

We, NSTAC have 3 Missions

(Mission I)

Produce statistics

(Mission II)

Utilize statistics

(Mission III)

Support statistics

Ensuring the

credibility of statistics

based on reliable

techniques

2

Statistical System of Japan

• Japan has decentralized statistical system (multiple government ministries have their own statistical survey)

Ministries Survey Titles

Cabinet Secretariat Basic Survey on Human Connection

Cabinet Office Annual Survey of Corporate Behavior, etc.

Children and Families Agency Survey on the Living of Children, etc.

Ministry of Internal Affairs and Communications Population Census, Economic Census, Labour Force Survey,

Survey on Time Use and Leisure Activities, etc. , etc.

Ministry of Finance Financial Statements Statistics of Corporations by Industry

Ministry of Education, Culture, Sports, Science and Technology School Basic Survey, School Teachers Survey, etc.

Ministry of Health, Labour and Welfare Vital Statistics, Basic Survey on Wage Structure,

National Health and Nutrition Survey, etc.

Ministry of Agriculture Forestry and Fisheries Census of Fisheries, Statistics on Marine Fishery Production

Ministry of Economy, Trade and Industry Basic Survey of Japanese Business Structure and Activities,

Census of Manufacture, etc. , etc.

Ministry of Land, Infrastructure, Transport and Tourism Statistics on Building Construction Started,

Consumption Trend Survey for Foreigners Visiting Japan, etc.

Minister of the Environment Survey of industrial waste generation and treatment, etc.

List of the ministries providing the microdata via on-site use

3

Application of the Microdata via DVD

4

Researchers

Ministries

Apply

Microdata

Application of the Microdata via On-site Use (Since 2019)

5

Researchers

Ministries

Apply

Microdata

Microdata

Microdata

NSTAC

Account for

on-site use

Entrust

6

Overview of On-Site System in Japan

7

Map of On-Site Facilities in Japan

Statistical Data Utilization Center

National Statistics Center

⚫19 universities

⚫2 public offices (as of 1st Sep. 2023)

8

Output Checking in Japan (1/2)

In the onsite-system

•Checking form

•explanation of

outputs

User

Analysis

Outputs

• Output checking

• Masking

Confirmed

outputs Secure

outputs Inquire

(if needed)

The output checking rules are defined for commonly used output formats

and published as part of the onsite-use manual [1].

User NSTAC

Rechecking

Reconfirmed

outputs

User

Publication

E-mail

Output Checking in Japan (2/2)

9

• If output format are not included in the output checking rules, we first evaluate if we can handle them with the guiding principles. (Case studies of checking based on the principles are described in the paper.)

• If there are no checking rules for the output, we need to discuss with survey-own ministry about checking methods, so it requires extra time to provide the output.

• For new output formats we frequently encounter, we need to revise the manual to add new rules, to avoid an inquiry to the survey-own ministry.

Consideration of Output Checking Rules for Quartiles

10

• The median and quartiles are widely used in descriptive statistics, etc.

• Not satisfy the principle of 10 units; their values are calculated from 1 or 2 individuals.

• We have considered Japanese output checking rule before on-site use was launched [2], but we could not find the explicit rules for median and quartiles.

• Generally difficult to accurately infer the rankings of all survey individuals

• It would be possible to establish median and quartile rules by setting proper assumptions.

Case Studies of Data without Boundaries (DwB) Project (1/5)

11 Eurostat (2014, August). Guidelines for the checking of output based on microdata research [3].

Case Studies of DwB Project (2/5)

12

P.17

T1. If the rank ordering of firms is known or

guessable, the percentile cannot be released.

T2. If the variance around the percentile is low,

there is the possibility of group disclosure.

T3. If the variance around the percentile is very

large, the identity of the percentile respondent

might be guessable.

Case Studies of DwB Project (3/5)

13

T1. If the rank ordering of firms is known or

guessable, the percentile cannot be released.

Our Assumption:

It is possible that the top rankings are known or

can be inferred, however in general, if the data

size is large enough, it is assumed that it is

difficult to accurately determine the rankings of

individuals near the median and quartiles.

Case Studies of DwB Project (4/5)

14

T2. If the variance around the percentile is low, there is the

possibility of group disclosure.

→ We should introduce an additional rule to prevent

group disclosure that would apply to sensitive

variables as we do for sum and mean.

Example of group disclosure on frequency tables by region and income

0-1 million

(yen)

1 million–

2 million

2 million–

3 million

3 million– Sum

Region 1 20 20 30 25 95

Region 2 125 5 3 0 133

Region 3 30 30 30 43 133

Sum 175 55 63 68 361

We can estimate income of a resident living in region 2 is 0-1 million with high probability.

Case Studies of DwB Project (5/5)

15

3. Consider the case where the variance around the percentile is very large.

the attacker who knows about

the distribution will guess the

frequencies around the median

are very small.

Median

Case Studies of UK Data Service (1/2)

16 UK Data Service (2019, July). Handbook on Statistical Disclosure Control for Outputs [4].

Case Studies of UK Data Service (2/2)

17

「Rounding Suppression」

P.33-34 • Increase the number of digits to round the median or

quartile value and the every individual value until the frequency of individuals with the same rounded value as the rounded median or the quartile value is 10 or greater.

1st Quartile Median 3rd Quartile

True value 3804.9 5503.7 7983.6

Rounded value 3800 5504 7980

Freq. of individuals which has the

same rounded value 62 10 35

Experiment of Rounding Suppression (1/3)

18

1st Quartile Median 3rd Quartile

True value 100178.4 196825.0 299731.9

Rounded value 100200 200000 299700

Freq. of individuals which has the

same rounded value 32 228 36

Example with random generated bimodal data 1

Experiment of Rounding Suppression (2/3)

19

1st Quartile Median 3rd Quartile

True value 100136.5 550353.1 999868.4

Rounded value 100100 NA 999900

Freq. of individuals which has the

same rounded value 39 0 32

Example with random generated bimodal data 2

Experiment of Rounding Suppression (3/3)

20

T3. If the variance around the percentile is very

large, the identity of the percentile respondent

might be guessable.

Solution:

When the variance around the median or

quartile is very large, the rounding suppression

prevent to publish the value.

Experiment with skewed data (1/2)

21

Living expenditure

Mean (yen) skewness

Yearly income 6401900 2.12

Living expenditure 298373.7 3.01

Food 68740.2 1.26

Housing 16127.9 12.25

Fuel, light and

water charges 19421.0 1.36

Furniture and

household utensils 9374.0 7.63

Clothes and footwear 12054.8 6.41

Medical care 13281.0 7.17

Transportation and

communication 44692.4 7.99

Education 15014.9 14.59

Reading and recreation 31099.3 4.59

Other living expenditure 68568.3 5.94

Synthetic data of National Survey of Family Income,

Consumption and Wealth 2009

Sample size 45,811

Education

Experiment with skewed data (2/2)

22

1st

Quartile

Rounded

1st Q.

Change

rate Median

Rounded

Median

Change

rate

3rd

Quartile

Rounded

3rd Q.

Change

rate

Yearly income 3804.9 3800 0.13% 5503.7 5504 0.01% 7983.6 7980 0.05%

Living expenditure 194216.5 194200 0.01% 260255.8 260300 0.02% 354990.9 355000 0.00%

Food 48104.4 48100 0.01% 63497.8 63500 0.00% 83502.0 83500 0.00%

Housing 659.3 660 0.11% 2876.9 2880 0.11% 17461.6 17500 0.22%

Fuel, light and

water charges 13589.7 13590 0.00% 17905.0 17900 0.03% 23544.7 23540 0.02%

Furniture and

household utensils 2877.6 2880 0.08% 5613.9 5610 0.07% 11057.8 11060 0.02%

Clothes and footwear 3893.7 3890 0.09% 7686.1 7690 0.05% 14515.6 14500 0.11%

Medical care 3849.0 3850 0.03% 7687.3 7690 0.03% 15434.6 15430 0.03%

Transportation and

communication 10372.4 10400 0.27% 23464.7 23460 0.02% 48760.9 48800 0.08%

Education 0.0 0 0.00% 1761.2 1760 0.07% 13535.3 13500 0.26%

Reading and recreation 12182.3 12180 0.02% 21707.7 21710 0.01% 38310.9 38300 0.03%

Other living expenditure 24793.2 24800 0.03% 44634.2 44600 0.08% 80909.8 80900 0.01%

Change rate := | True value - Rounded value | / True value

Frequency Threshold Rule (1/2)

23

T1. If the rank ordering of firms is known or guessable, the percentile cannot be released.

Our Assumption:

It is possible that the top rankings are known or can be inferred, however in general, if the data size is large enough, it is assumed that it is difficult to accurately determine the rankings of individuals near the median and quartiles.

Our Solution:

Ensure that either 1st or 3rd quartile doesn’t belong to the range of the bottom or to rankings.

Frequency Threshold Rule (2/2)

24

• The frequency of the group for which the median and quartile values are calculated must be at least 40. (assume that an attacker can infer top or bottom 10 ranking.)

Min. Max.1st Q. 3rd Q.Median

Individuals > 10 Individuals > 10

Interquartile Range Threshold Rule (1/2)

25

T2. If the variance around the percentile is low, there is the

possibility of group disclosure.

→ We should introduce an additional rule to prevent

group disclosure that would apply to sensitive

variables as we do for sum and mean.

Our Solution:

Introduce an additional rule requires that sample has

some degree of dispersion.

Interquartile Range Threshold Rule (1/2)

26

• The interquartile range must be more than

30% of the median.

Min. Max.1st Q. 3rd Q.Median

interquartile range > 30% of median

Conclusion

27

• The rounding suppression method of UK Data

Service is simple yet satisfying the principle of

10 units, and coping with T3 in DwB.

• To cope with T1 and T2 in DwB, we introduce

the frequency threshold rule and the additional

interquartile range threshold rule.

References

[1] National Statistics Center, "Using microdata of official statistics (in

Japanese)," https://www.e-stat.go.jp/microdata/data-use/on-site.

[2] R. Kikuchi and K. Minami, "On-site service and safe output

checking in japan," in Joint UNECE/Eurostat Work Session on

Statistical Data Confidentiality, Skopje, North Macedonia, 2017.

[3] Data without Boundaries project, "Guidelines for the checking of

output,"

https://ec.Europa.eu/eurostat/cros/system/files/dwb_standalone-

document_output-checking-guidelines.pdf.

[4] UK Data Service, "Handbook on Statistical Disclosure Control for

Outputs,"

https://ukdataservice.ac.uk/app/uploads/thf_datareport_aw_web.

pdf. 28

  • Slide 1: A Case Study of Output Checking in Japan
  • Slide 2: Introduction of National Statistics Center (NSTAC)
  • Slide 3: Statistical System of Japan
  • Slide 4: Application of the Microdata via DVD
  • Slide 5: Application of the Microdata via On-site Use (Since 2019)
  • Slide 6: Overview of On-Site System in Japan
  • Slide 7: Map of On-Site Facilities in Japan
  • Slide 8: Output Checking in Japan (1/2)
  • Slide 9: Output Checking in Japan (2/2)
  • Slide 10: Consideration of Output Checking Rules for Quartiles
  • Slide 11: Case Studies of Data without Boundaries (DwB) Project (1/5)
  • Slide 12: Case Studies of DwB Project (2/5)
  • Slide 13: Case Studies of DwB Project (3/5)
  • Slide 14: Case Studies of DwB Project (4/5)
  • Slide 15: Case Studies of DwB Project (5/5)
  • Slide 16: Case Studies of UK Data Service (1/2)
  • Slide 17: Case Studies of UK Data Service (2/2)
  • Slide 18: Experiment of Rounding Suppression (1/3)
  • Slide 19: Experiment of Rounding Suppression (2/3)
  • Slide 20: Experiment of Rounding Suppression (3/3)
  • Slide 21: Experiment with skewed data (1/2)
  • Slide 22: Experiment with skewed data (2/2)
  • Slide 23: Frequency Threshold Rule (1/2)
  • Slide 24: Frequency Threshold Rule (2/2)
  • Slide 25: Interquartile Range Threshold Rule (1/2)
  • Slide 26: Interquartile Range Threshold Rule (1/2)
  • Slide 27: Conclusion
  • Slide 28: References

The Potential of Differential Privacy Applied to Detailed Statistical Tables Created Using Microdata from the Japanese Population Census , Chuo University, Japan

privacy protection methods, differential privacy, perturbative methods, additive noise, data swapping, 

Languages and translations
English

1

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE

CONFERENCE OF EUROPEAN STATISTICIANS

Expert Meeting on Statistical Data Confidentiality

26-28 September 2023, Wiesbaden

The Potential of Differential Privacy Applied to Detailed Statistical

Tables Created Using Microdata from the Japanese Population Census

Shinsuke Ito (Chuo University, Japan)*, Masayuki Terada (NTT DOCOMO, INC., Japan)**,

Shunsuke Kato (Statistics Bureau of Japan)

* [email protected], ** [email protected], *** [email protected]

Abstract

In numerous countries, perturbative methods are used as a privacy protection method for official statistics. The U.S.

Census Bureau has studied the applicability of perturbation based on differential privacy for official statistics, and empiri-

cally investigated the mechanism of differential privacy for the publication of statistical tables created based on data from

the 2020 Census. In particular, the U.S. Census Bureau has examined the applicability of differential privacy for 2010 Census

data in order to create and publish statistical tables for smaller geographical areas and as a protection against “database

reconstruction attacks”.

Several empirical studies on the effectiveness of perturbative methods such as additive noise, data swapping and PRAM

for Japanese official microdata have been conducted (ex. Ito et al. (2018)). Other studies have investigated the possibility of

adapting differential privacy for detailed geographical data from the Japanese Population Census, and examined the potential

of differential privacy as an anonymization method for Japanese statistical data (Ito and Terada (2019) and Ito et al. (2020)).

When discussing future directions for the creation and publication of statistical tables, it is important to consider the

potential of differential privacy. Towards this objective, this paper conducts a comparative study of the effectiveness of

differential privacy for Japanese Population Census data while taking into account the actual situation regarding the appli-

cation of differential privacy to official statistics in other countries. Specifically, this research conducts a comparative anal-

ysis of data usability for various differential privacy methods (with PRAM as a traditional disclosure limitation method) for

statistical tables at different geographical levels created using individual data from the 2015 Japanese Population Census.

2

1 Introduction

Recent international trends in privacy-protecting techniques applied to official statistics include the active

use of perturbative methods. For example, the U.S. Census Bureau (hereinafter, “Census Bureau”) has investi-

gated the applicability of perturbative methods based on the methodology of differential privacy, which was

originally developed in the field of computer science. In particular, the Census Bureau has examined ways to

create statistical tables that use differential privacy for the 2020 United States Census. In addition, for creating

and publishing tract/block-level Census tables, the Census Bureau has explored the practicality of using differ-

ential privacy as a way to prevent “database reconstruction attacks” (Abowd (2018), Garfinkel et al. (2019),

Garfinkel (2022)), in which perpetrators attempt to identify personal information by combining multiple pub-

lished statistical tables.

These international trends suggest that the application of differential privacy techniques is potentially effec-

tive not only in creating and publishing statistical tables, but also in constructing synthetic data.

Several empirical studies on the effectiveness of perturbative methods such as additive noise, data swapping

and PRAM for Japanese official microdata have been conducted in Japan (ex. Ito et al. (2018)). Other studies

have investigated the possibility of adapting differential privacy for detailed geographical data from the Japanese

Population Census, and examined the potential of differential privacy as an anonymization method for Japanese

statistical data (Ito and Terada (2019) and Ito et al. (2020)). Therefore, investigating the applicability of differ-

ential privacy techniques to Japanese official statistics is worthwhile not only from the standpoint of discussing

the future creation and publication of official statistical tables, but also the future direction of secondary use of

official statistics.

In this context, this paper explores the applicability of differential privacy to data from the Japanese Popu-

lation Census (hereinafter, “Population Census”) not only by empirically demonstrating the characteristics of

data obtained under various perturbative methods, but also by comparing and examining the utility of these meth-

ods for Japanese official statistical data.

2 Application of Differential Privacy to Census Statistics

2.1 Definition and Interpretation of Differential Privacy

Differential privacy is a privacy protection framework aimed at achieving comprehensive (ad omnia) data

security against arbitrary attacks including unknown attacks. A consistent safety index (&#x1d700; ≥ 0) is quantitatively

provided for various privacy protection methods (Dwork, 2007)1. As the value of the index becomes lower, the

level of privacy protection becomes higher.

If privacy loss from a privacy protection method &#x1d440; is guaranteed to be less than or equal to &#x1d700;, &#x1d440; is said to

satisfy &#x1d700;-differential privacy, which is defined more rigorously below.

Definition 1: For any adjacent databases &#x1d437;1 and &#x1d437;2 (&#x1d437;1, &#x1d437;2 ∈ &#x1d437;), the randomization function &#x1d440; satisfies &#x1d700;-dif-

ferential privacy if &#x1d440;: &#x1d437; → &#x1d445; satisfies the following inequality, where S is any subspace of the output space &#x1d445; of

&#x1d440; (S⊆R).

&#x1d443;&#x1d45f;[&#x1d440;(&#x1d437;1) ∈ &#x1d446;] ≤ &#x1d452;&#x1d700; ∙ &#x1d443;&#x1d45f;[&#x1d440;(&#x1d437;2) ∈ &#x1d446;].

An intuitive interpretation of this definition is that if the result of applying &#x1d440; to database &#x1d437;1 containing data

on individual A is indistinguishable from the result of applying &#x1d440; to database &#x1d437;2 not containing data on the in-

dividual, then the output of &#x1d440; does not violate privacy of individual A; and the lower the value of &#x1d700;, the more

indistinguishable the former result is from the latter, meaning that greater security is provided by the privacy

protection method &#x1d440;. Put differently, &#x1d700; is an index indicating how much privacy is lost due to the output of &#x1d440;.

For this reason, &#x1d700; is also referred to as privacy loss or the privacy loss budget.

As Definition 1 shows, differential privacy does not refer to a certain privacy protection method, but rather

is a framework for defining the level of data security provided. The specific method that protects privacy (&#x1d440; in

Definition 1) based on differential privacy is called a mechanism. The Laplace mechanism (described below) is

a typical mechanism for achieving differential privacy. Some traditional statistical disclosure control (SDC)

1 For a deterministic method, ε →∞ (no security provided).

3

methods (anonymization methods), such as PRAM (Post RAndomization Methods), are also known to provide

data security based on differential privacy.

2.2 Challenges in Applying Differential Privacy to Japanese Population Census Data

There are various types of mechanisms that achieve differential privacy, and their suitability for a given

dataset depends on the nature of the dataset, the type of desired data output (the type of queries), and other factors.

Therefore, not all mechanisms are suitable for statistical tables from the Japanese Population Census. Also, even

if some mechanisms are suitable, using the wrong mechanism would significantly reduce the utility of the output

statistics.

As an example, applying the Laplace mechanism - which is a well-known mechanism that satisfies differ-

ential privacy - to a contingency table is quite easy. Specifically, a random number from the Laplace distribution

centered on zero (Laplace noise) can be added independently to each cell (even if its value is 0) in the contingency

table2.

It is known, however, that simply applying the Laplace mechanism to a large-scale contingency table, such

as those from the Population Census, causes the practical issues listed below and reduces the utility of statistics

(Terada et al. (2015), Ito and Terada (2020)).

1. Deviation from the nonnegative constraint: After applying the Laplace mechanism, the output may contain many nega-

tive values that are unlikely to be observed in actual census data. 2. Loss of sparseness: Since almost all cells that initially had unstructured zeros receive non-zero values, applying the

mechanism to large sparse data, such as a Population Census dataset, significantly increases the volume of output data. 3. Loss of accuracy in partial sums: Since noise is added to the value of each cell, the error resulting from summing the

values of multiple cells (e.g., to obtain the total population of a certain region) becomes large, reducing the accuracy of

the partial sum. The Laplace mechanism satisfies differential privacy by adding a Laplace noise to each cell as described

above. Since this noise can take a negative value (half of the population receives negative values), if the Laplace

mechanism is applied to cells for population size with a value of zero or a small value, the cells can end up with

negative values. Such deviation from the nonnegative constraint makes the data unnatural, and also means that

conventional analytical methods and tools, in which nonnegative input values are assumed, cannot be used.

Therefore, violation of the nonnegative constraint is hard to accept from a practical standpoint. This can be easily

remedied by simply replacing negative cell values with zero. This operation also increases the number of cells

with zeros and thereby alleviates the problem of increased data volume. However, the simple “zeroing out” of

negative values causes a large error (overbias) in partial sums.

Contingency tables with detailed geographical divisions, such as small area statistics from the Population

Census, often contain many zero-valued cells (i.e., they are highly sparse). Since sparse data are often expressed

in a form that omits zero-valued cells for the purpose of computation from the standpoint of memory efficiency,

data volume is roughly proportional to the number of cells with non-zero values (and not to the raw number of

cells). However, probabilistically the noise to be added by the Laplace mechanism is almost never zero, and all

values in the output of the Laplace mechanism are most likely to be non-zero, which significantly increases the

volume of the data.

Regarding the problem of lost accuracy in partial sums, in practical applications of small area statistics, such

as trade area analysis, what is important is the sum of the values of multiple cells included in the scope set for a

given analysis (e.g., the trade area of a retailer), but not the value in a cell representing the smallest geographical

unit. With the Laplace mechanism, the noise added to a partial sum is the sum of the noise added to the values of

the relevant cells, and its variance increases as the number of relevant cells increases. In other words, as the range

of the cells used for a partial sum expands, the error of the partial sum increases.

For statistical tables consisting of integers only, such as those from the Population Census, the geometric

mechanism, which is based on random numbers from a two-sided geometric distribution (a discretized form of

the Laplace distribution), can be used instead of the Laplace mechanism. While the geometric mechanism has

the feature that its output always consists of integers, its other properties are roughly the same as those of the

Laplace mechanism, and the above discussions directly apply to the geometric mechanism.

2 The Laplace distribution is also called the double exponential distribution. The scale of the distribution depends on the

value of the privacy loss budget &#x1d700; and the so-called global sensitivity, the value of which is determined by the query type.

4

2.3 Methods for Achieving Differential Privacy Applicable to the Japanese Population Census

As an approach to solving the problems discussed above, a method based on wavelet transform has been

shown to be useful for mesh population statistics. The Privelet method (Xiao et al. (2011)) introduces the Haar

wavelet transform in the process of noise injection so that the noises of neighboring cells offset one another, and

thereby increases the accuracy of the partial sum for a continuous domain. However, the price of using the Privelet

method is that it requires injection of larger amounts of noise than the Laplace mechanism. In addition, the Priv-

elet method solves neither the problem of deviation from the nonnegative constraint nor the problem of loss of

sparseness. Though the technique can solve the former problem by zeroing out the cells with negative values as

in the case of the Laplace mechanism, the technique is still subject to the overbias problem.

Terada et al. (2015) propose a method based on the Morton order mapping and the Wavelet transform with

nonnegative refinement (hereinafter, “nonnegative wavelet method”). In this method, noise is injected over the

wavelet space as in the Privelet method. For two-dimensional data such as mesh statistics, the wavelet transform

is applied after conversion to one-dimensional data via Morton order mapping (a type of locality-preserving

mapping), which prevents an increase in the magnitude of noise associated with the use of multidimensional

wavelets. Also, by applying an inverse wavelet transform while correcting coefficient values to prevent the output

from deviating from the nonnegative constraint (i.e., while performing nonnegative refinement), the nonnegative

wavelet method produces population data that satisfies the nonnegative constraint and guarantees differential

privacy. Population data obtained through this method has the characteristics that the accuracy of partial sums

can be controlled based on the properties of the wavelet transform, and that the sparseness of data can be restored

in the process of nonnegative refinement. In other words, this method can be expected to simultaneously solve

the three aforementioned problems.

An empirical experiment (Ito and Terada (2020)), in which the nonnegative wavelet method is applied to

mesh statistics from the 2010 Population Census, indeed shows that the nonnegative wavelet method solves the

three problems discussed above. Also, Ito et al. (2020) show that for mesh statistics, the nonnegative wavelet

method is more useful than the top-down construction approach with constrained optimization, which is discussed

below. For these reasons, the nonnegative wavelet method is considered an effective method for applying differ-

ential privacy to mesh statistics from the Population Census. However, it is not evident how it could be applied

to other statistical tables.

Another method is based on constrained optimization. Specifically, noise is injected via the Laplace mech-

anism or a geometric mechanism, and optimization is performed subject to a total-number constraint and

nonnegative constraint; and the solution becomes the output.

Lee et al. (2015) propose an algorithm for such constrained optimization that uses ADMM (alternating

direction method of multipliers) which is a type of numerical optimization method. For the 2020 Census, a con-

strained optimization method was implemented using the commercial solver Gurobi. It should be noted that con-

strained optimization involves high computational costs. For example, it took 21 s for the method by Lee et al.

(2015) to be applied to a dataset containing 4,096 cells. For the 2020 Census, a large-scale system was constructed

based on Amazon EMR (Elastic Map-Reduce, a distributed computing system) offered on Amazon Web Services,

a commercial cloud computing infrastructure, which uses up to 100 high-performance machines (each with 96

virtual CPUs and 768GB of RAM).

Terada et al. (2017) present a high-speed method for large-scale data, taking advantage of the fact that an

optimization problem with a total-number constraint and nonnegative constraint can be reduced to the problem

of projection onto a canonical simplex in a multidimensional vector space. This method is shown to process data

with 100,000 cells in 12.6 ms on a typical laptop computer at that time, which makes it a suitable method for

Population Census statistics and other large-scale statistics.

Two approaches to data construction are considered in this study. In one approach, one of the aforemen-

tioned methods is applied to a statistical table from the Population Census as follows: in general, the method is

applied to only the population of the smallest geographical unit (the basic unit district, in the case of the Japanese

Population Census); and the resulting district-level population data are summed to obtain the population at the

municipal or prefectural level in a bottom-up manner. Another approach works in a top-down, recursive manner

as follows: one of the aforementioned methods is applied to the prefecture-level population data with the total

national population setting the total-number constraint in order to obtain privacy-protected prefecture-level pop-

ulation data; privacy-protected municipality-level population data is then obtained with the prefecture-level pop-

ulation setting the total-number constraint; and so on. In this paper, the former is called the bottom-up data con-

struction approach, and the latter is called the top-down data construction approach. The top-down algorithm

(TDA) used for the 2020 U.S. Census is an algorithm based on the top-down data construction approach.

5

In both the bottom-up and top-down data construction approaches, constrained optimization satisfies the

nonnegative constraint, and data sparseness is expected to be restored. In addition, the top-down data construction

approach is expected to solve the problem of lost accuracy in partial sums.

Both data construction approaches are applicable not only to mesh statistics, but also to other statistics.

However, there have been few empirical studies that apply either approach to non-mesh statistics from the Japa-

nese Population Census. Therefore, the practicality and quantitative properties of these approaches are unclear.

In theory, the top-down approach is expected to produce better output than the Laplace mechanism in terms of

the accuracy of partial sums, but the extent of the superiority has not been quantitatively clarified.

3 Applying Differential Privacy to the 2015 Japanese Population Census Data

This study performs a comparative experiment concerning the application of differential privacy to Popula-

tion Census data. This section presents the data and procedures used in the experiment and presents the experi-

ment’s results.

3.1 Data Used in the Experiment

The experiment is based on three small area statistics with different aggregation categories that were created

from the 2015 Population Census individual data (full data). For each aggregate data table, the focus is on popu-

lation size; the level of aggregation is at the basic unit district (minimum geographic district) level; and the three

aggregation categories considered are “all”, “males and females”, and “males and females in 5-year age groups”.

In other words, the experiment is based on three aggregate data tables for (1) basic unit district-level total popu-

lation, (2) basic unit district-level population by gender, and (3) basic unit district-level population by gender and

5-year age group. These tables are hereinafter referred to as Aggregation Tables 1, 2, and 3, respectively.

3.2 Experimental Methods

This empirical study aims to gain knowledge relevant to the aforementioned three research questions by

implementing various differential privacy methods for Aggregation Tables 1 to Aggregation Table 3, and evalu-

ates the utility of the data. The specific procedures for method implementation and indices for comparative eval-

uation are explained below.

PRAM, the Laplace mechanism, the top-down data construction method, and the bottom-up data construc-

tion method, which are discussed in Section 3, were used as methods for achieving differential privacy. While

methods based on wavelet transform are effective for mesh statistics, they were excluded from consideration as

they are difficult to apply to data tables with other geographical aggregation levels. As mentioned earlier, the

output of the Laplace mechanism can include negative population values (which violate the nonnegative con-

straint). These were zeroed out as a post-processing adjustment. The constraint optimization method used with

the top-down data construction approach or the bottom-up data construction approach was the one proposed by

Terada et al. (2017), which are based on projection onto a canonical simplex. To apply the top-down approach to

Aggregation Tables 2 and 3, the results for the (attribute-based) aggregation categories were merged. For example,

in the case of applying the top-down approach to Aggregation Table 2, the results of applying the top-down

approach to the male population data and the results of applying this approach to the female population data were

merged into one table, and this table was treated as the result of applying the top-down approach to Aggregation

Table 2.

In this experiment, the highest geographical level was not the national level, but the prefectural level. This

is due to limitations of the computational environment available at the on-site data access facility used for the

experiment. Therefore, instead of creating an aggregate data table for Japan as a whole and applying each method

to that table, a total of 47 aggregate data tables were created for all the prefectures, each method was applied to

each of those tables independently, and the resulting tables were then merged to calculate evaluation indices

(discussed below)3.

The aforementioned four methods were applied with each of the following eight values for the privacy loss

budget (&#x1d700;) set for the experiment: 0.1, 0.2, 0.7, 1.0, 1.1, 5, 10, and 20. The values 0.7 and 1.1 were chosen as

approximations of &#x1d459;&#x1d45c;&#x1d454;&#x1d452;2 and &#x1d459;&#x1d45c;&#x1d454;&#x1d452;3, respectively, which are frequently used as values of the privacy loss

3 Ideally, in the top-down data construction approach, for example, the national level should be above the prefectural level.

In this experiment, however, the prefecture level is highest of the four geographical levels considered, followed by munici-

pality level, town/village level, and basic unit district level.

6

budget). Though PRAM and the top-down data construction approach can be configured to allocate different

privacy loss budgets to different geographical levels or attribute categories, in this experiment, each value of the

privacy loss budget was evenly allocated.

Utility of the data was evaluated for population data at the basic unit district level (the most detailed popu-

lation data), and for population data at higher geographical levels (partial sums) in consideration of real-world

use of aggregate data. Specifically, the errors in the prefecture-level, municipality-level, town/village-level, and

basic unit district-level population data were quantitatively compared. The mean absolute error (MAE) was used

as an error index and was calculated for each of the three aggregation tables, the four methods applied, the eight

values of the privacy loss budget, and the four geographical levels used for partial sums.

3.3 Experimental Results

Tables 1 to 3 show the evaluation results. In each table, (a) PRAM, (b) Laplace, (c) BottomUp, and (d)

TopDown refer to PRAM, the Laplace mechanism (plus zeroing out of negative values), the bottom-up data

construction method, and the top-down data construction method, respectively. Tables 1, 2, and 3 summarize the

evaluation results for Aggregation Tables 1, 2, and 3, respectively.

It should be noted that in Table 1, the errors for the prefecture-level population data are zero for the three

methods other than the Laplace mechanism. These three methods (PRAM, the bottom-up data construction

method, and the top-down data construction method) have the characteristic that the total number of records in

the input data is preserved in the output data (which is referred to as satisfaction of the total-number constraint).

As mentioned above, the prefectural level was the highest geographical level in this experiment (due to limitations

of the computational environment). Therefore, when a method that satisfies the total-number constraint is applied

to Aggregation Table 1 (which does not have attribute-based aggregation categories), the errors for the population

data at the highest geographical level will be zero (because of the constraint). In other words, the result that errors

at the prefectural level in Table 1 are zero is attributable to the conditions of this experiment, and the same result

would not be obtained if the highest geographical level were the national level (instead, errors for total population

at the national level would become zero).

Attention should also be paid to the interpretation of the result for PRAM for the basic unit district level,

especially for small values of the privacy loss budget. For example, in Table 3, the errors caused by PRAM to

the basic unit district-level data vary very little across different values of the privacy loss budget (&#x1d700; = 0.1 to 20).

The result reflects the fact that the degree of perturbation by PRAM required for a given privacy loss budget

is so great that the output population data seem to follow a uniform distribution. In other words, since the basic

unit district-level aggregate data table used for Table 3 is quite sparse, with most of the values being either 0 or

1, even if the output data follows a random uniform distribution, at first glance its accuracy does not appear

undesirable at the basic unit district level. However, this is just a false accuracy and is not statistically meaningful.

Similarly, the errors caused by PRAM at the municipality level and the town/village level shown in Table 3 reveal

that the accumulation of errors greatly degrades the accuracy of the partial sums and there is significant degrada-

tion of the characteristics of the original aggregate data table.

4 Discussion

The experimental results show that the errors vary significantly across the differential privacy methods ap-

plied and also across the different values of the privacy loss budget (&#x1d700;) considered. As discussed in Section 3, for

a given privacy loss budget, differential privacy guarantees the same level of privacy protection regardless of the

differential privacy method used, but the utility of the resulting data depends on the method and the use of the

data. The results obtained in the experiment support this description. It is therefore necessary, in discussing the

practicality of applying differential privacy and the utility of the output data, to use various differential privacy

methods and different values of the privacy loss budget, as in this experiment, and to examine the results quanti-

tatively.

The experimental results also show that the characteristics of the errors associated with data for the smallest

geographical unit (the basic unit district) and the characteristics of the errors associated with the partial sums for

larger geographical units (e.g., municipalities) are not necessarily the same. If the errors at the basic unit district

7

Table 3: Evaluation results for Aggregation Table 3

(basic unit district-level total population by gender and 5-year

age group)

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 0.00 14408.44 520.10 48.65

(b)Laplace 98607.22 2490.96 83.50 17.60

(c)BottomUp 0.00 855.53 72.06 17.38

(d)TopDown 0.00 79.09 73.35 49.83

(a)PRAM 0.00 14399.53 520.26 48.64

(b)Laplace 30844.00 817.25 39.15 9.23

(c)BottomUp 0.00 367.38 36.86 9.21

(d)TopDown 0.00 41.12 37.78 30.42

(a)PRAM 0.00 14401.57 520.09 48.65

(b)Laplace 5433.01 157.62 11.06 2.75

(c)BottomUp 0.00 97.37 10.81 2.75

(d)TopDown 0.00 11.62 11.26 10.42

(a)PRAM 0.00 14406.80 520.17 48.65

(b)Laplace 3483.00 104.64 7.65 1.92

(c)BottomUp 0.00 65.78 7.52 1.91

(d)TopDown 0.00 7.84 7.83 7.38

(a)PRAM 0.00 14400.56 520.19 48.64

(b)Laplace 3117.37 94.48 6.96 1.74

(c)BottomUp 0.00 57.97 6.84 1.74

(d)TopDown 0.00 7.13 7.12 6.73

(a)PRAM 0.00 14351.58 518.16 48.47

(b)Laplace 609.34 19.08 1.54 0.39

(c)BottomUp 0.00 13.10 1.51 0.38

(d)TopDown 0.00 1.60 1.57 1.52

(a)PRAM 0.00 10215.78 359.65 34.59

(b)Laplace 314.63 9.65 0.77 0.19

(c)BottomUp 0.00 6.43 0.76 0.19

(d)TopDown 0.00 0.84 0.79 0.76

(a)PRAM 0.00 3.53 0.23 0.02

(b)Laplace 152.12 4.80 0.38 0.10

(c)BottomUp 0.00 3.15 0.38 0.10

(d)TopDown 0.00 0.42 0.39 0.38

(a)PRAM 0.00 0.00 0.00 0.00

(b)Laplace 30.92 0.96 0.08 0.02

(c)BottomUp 0.00 0.64 0.08 0.02

(d)TopDown 0.00 0.08 0.08 0.08

20

100

1.1

5

10

0.1

0.2

0.7

1

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 35301.70 7277.54 261.95 24.80

(b)Laplace 158482.08 3940.62 96.57 16.10

(c)BottomUp 4800.07 977.40 68.02 15.55

(d)TopDown 60.08 79.80 69.48 36.40

(a)PRAM 34432.49 7273.24 261.97 24.81

(b)Laplace 50430.20 1271.48 41.92 8.77

(c)BottomUp 2081.00 443.80 36.11 8.68

(d)TopDown 24.09 39.48 36.56 24.83

(a)PRAM 29972.11 7260.04 261.72 24.79

(b)Laplace 7016.17 193.09 11.17 2.72

(c)BottomUp 432.39 100.55 10.83 2.71

(d)TopDown 9.27 11.83 11.16 9.67

(a)PRAM 27463.17 7255.24 261.68 24.80

(b)Laplace 4239.02 120.71 7.69 1.90

(c)BottomUp 310.38 68.29 7.50 1.89

(d)TopDown 7.14 8.10 7.77 7.00

(a)PRAM 26432.21 7247.65 261.63 24.80

(b)Laplace 3728.69 107.08 7.00 1.73

(c)BottomUp 271.13 61.45 6.84 1.73

(d)TopDown 5.63 7.30 7.07 6.43

(a)PRAM 5492.60 7215.53 261.27 24.78

(b)Laplace 653.28 19.92 1.54 0.38

(c)BottomUp 59.69 12.58 1.51 0.38

(d)TopDown 1.19 1.58 1.57 1.51

(a)PRAM 530.02 7197.73 260.39 24.70

(b)Laplace 315.57 9.95 0.77 0.19

(c)BottomUp 26.27 6.56 0.76 0.19

(d)TopDown 0.54 0.82 0.78 0.76

(a)PRAM 7.51 5115.43 180.82 17.74

(b)Laplace 159.90 4.92 0.39 0.10

(c)BottomUp 14.20 3.23 0.38 0.10

(d)TopDown 0.25 0.40 0.39 0.38

(a)PRAM 0.00 0.00 0.00 0.00

(b)Laplace 32.34 1.00 0.08 0.02

(c)BottomUp 2.49 0.64 0.08 0.02

(d)TopDown 0.08 0.08 0.08 0.08

100

5

10

20

0.7

1

1.1

0.1

0.2

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 15899.43 587.27 18.50 2.05

(b)Laplace 376314.96 9323.57 173.38 10.77

(c)BottomUp 14008.77 544.73 25.62 3.23

(d)TopDown 81.64 76.50 30.75 3.44

(a)PRAM 15874.93 586.70 18.49 2.05

(b)Laplace 175973.88 4359.93 81.69 5.69

(c)BottomUp 11884.21 447.52 19.48 2.80

(d)TopDown 41.25 39.09 20.72 3.28

(a)PRAM 15703.82 584.13 18.46 2.05

(b)Laplace 40261.21 997.68 19.59 1.90

(c)BottomUp 5618.71 203.17 8.85 1.55

(d)TopDown 11.51 11.42 8.38 2.66

(a)PRAM 15573.72 582.26 18.44 2.05

(b)Laplace 25349.47 628.32 12.71 1.38

(c)BottomUp 4077.82 146.72 6.56 1.20

(d)TopDown 7.94 7.93 6.20 2.37

(a)PRAM 15540.75 581.12 18.43 2.05

(b)Laplace 22484.80 557.35 11.37 1.27

(c)BottomUp 3725.80 133.89 6.05 1.12

(d)TopDown 7.26 7.28 5.73 2.29

(a)PRAM 12827.43 541.94 17.99 2.04

(b)Laplace 3506.22 87.20 2.10 0.31

(c)BottomUp 795.49 28.69 1.50 0.30

(d)TopDown 1.61 1.60 1.45 0.96

(a)PRAM 6345.74 469.39 17.20 2.02

(b)Laplace 1684.23 41.92 1.03 0.16

(c)BottomUp 395.63 14.25 0.76 0.15

(d)TopDown 0.77 0.80 0.74 0.55

(a)PRAM 356.10 433.32 16.58 1.98

(b)Laplace 839.77 20.90 0.52 0.08

(c)BottomUp 197.79 7.14 0.38 0.08

(d)TopDown 0.40 0.40 0.38 0.29

(a)PRAM 0.00 0.00 0.00 0.00

(b)Laplace 167.76 4.17 0.10 0.02

(c)BottomUp 39.47 1.42 0.08 0.02

(d)TopDown 0.08 0.08 0.08 0.06

10

20

100

1

1.1

5

0.1

0.2

0.7

Table 1: Evaluation results for Aggregation Table 1

(basic unit district-level total population)

Table 2: Evaluation results for Aggregation Table 2

(basic unit district-level total population by gender)

8

level are taken as indices of the utility of the relevant data, then the output data from the bottom-up data con-

struction method and the Laplace mechanism seem superior4. However, for partial sums at the municipality level

and town/village level, the errors tend to be larger for both the bottom-up method and the Laplace mechanism,

and the tendency is particularly noticeable with the Laplace mechanism. The reason is that zeroing out negative

values to satisfy the nonnegative constraint in the experimental implementation of the Laplace mechanism intro-

duces small positive biases throughout all of the output data5. These biases rarely show up as large errors for the

basic unit district-level output data; however, they can cause serious overestimations in the output data for higher

geographical levels.

The top-down data construction method is inferior to the bottom-up data construction method in terms of

the errors at the basic unit district level. However, for the top-down method, the errors are not accumulated at

higher geographical levels. In other words, the errors are largely unchanged across different geographical levels

and indicate high levels of data utility. Specifically, the results for a privacy loss budget (&#x1d700; = 20), which is close

to the value used for the 2020 U.S. Census, show that for every type of output data based on a single variable or

a combination of variables, the errors are smaller than 1 for the partial sums at all geographical levels, which

indicates relatively high levels of data utility. However, in comparing the output data, it should be noted that the

number of cells in an aggregate data table increases as the number of variables increases, or as more detailed

categories are used, and that the errors for partial sums based on an aggregate data table with a large number of

cells will tend to be large.

Comparing PRAM with the other methods from this viewpoint shows that, under a given set of conditions,

in most cases PRAM is significantly inferior in terms of privacy protection efficiency. For small values of the

privacy loss budget, the results of applying PRAM at the basic unit district level seem to be superior to the results

of other methods. However, as discussed in Section 3, this is attributable to false accuracy and does not have real

meaning.

In addition, while data utility associated with the methods other than PRAM tend to improve as the value of

&#x1d700; increases, this tendency is hardly seen for PRAM. This result implies that the privacy protection efficiency of

PRAM is far worse than that of the other methods. Perturbation in PRAM is performed as follows: values of a

certain attribute in individual data remain unchanged with a given probability ρ or randomly changed with prob-

ability 1-ρ. If PRAM achieves differential privacy, the value of ρ is determined as a function of &#x1d700;. However, the

actual calculated value of ρ is often close to zero unless the value of &#x1d700; is very large. In other words, even if privacy

protection is sacrificed by moderately increasing the value of &#x1d700;, ρ remains close to zero, which does not lead to

improved data utility. This suggests that PRAM can achieve differential privacy, but entails quite low privacy

protection efficiency, and that the use of other methods should be considered.

Then, what method is suitable for applying differential privacy to Japanese Population Census statistics? As

mentioned before, satisfying the nonnegative constraint is a problem for a simple Laplace mechanism. Even if

an attempt is made to satisfy the constraint by zeroing out negative values as in this experiment, it is still difficult

to obtain a practically usable aggregate data table because of the large overestimation bias affecting partial sums.

Also, PRAM clearly fails to achieve both a reasonable level of privacy protection and data utility.

Regarding the bottom-up data construction method and the top-down data construction method, the errors

at the basic unit district level show that the bottom-up method provides higher data utility, but its errors for partial

sums increase as the range of cells used for the partial sums expands, which indicates decreasing data utility. In

contrast, for the top-down method the errors for partial sums remain small. Therefore, when partial sums are

calculated for a higher geographical level, the degree of data utility is maintained.

In summary, judging from the errors for the output data at the basic unit district level, data utility is relatively

high for the bottom-up method, but data utility deteriorates for partial sums since the errors associated with them

tend to increase significantly. For the top-down method, errors at the basic unit district level are larger than those

for the bottom-up method, and if the level of privacy protection is properly set, the utility of different output data

considered in this study is maintained, even when the effect on partial sums is taken into account.

4 In Table 2, PRAM seems to be the best for small values of &#x1d700;, but this observation is attributed to a false accuracy and does

not bear real meaning, as discussed in the previous section. 5 In other words, without zeroing-out of negative values, errors would not expand significantly, but, at the same time, the

nonnegative constraint would not be satisfied. For this reason, the resulting tables could contain a large number of negative

population values.

9

It should be noted, however, that the variables used in this empirical experiment are limited, and that only

the errors in output data, including partial sums, in relevant cells are used as indices for evaluating data utility

and no other statistics are used for evaluation purposes. Further investigation into the applicability of various

methods to the Population Census data should be carried out using a variety of statistics for analysis.

5 Conclusion

To explore the applicability of differential privacy to Population Census statistics, this paper evaluated the

utility of statistical tables for different geographical levels which were created using individual data from the

Population Census and by applying various differential privacy methods. In addition, a comparison was made

with the conventional anonymization method PRAM. The results of this study show that in applying differential

privacy to Japanese Population Census data, the top-down data construction method yields a higher level of data

utility than the other methods. As described above, the Census Bureau has adopted TDA for creating and pub-

lishing detailed statistical tables for the 2020 Census. The results of this study are therefore consistent with the

Census Bureau’s choice of technique used in its statistical work. This suggests that given a hierarchical geograph-

ical structure, reasonable results from the standpoint of data utility can be obtained by top-down, consistent allo-

cation of the noise generated based on differential privacy to the cells of a statistical table (rather than injecting

noises to the cells at the basic unit district level and performing aggregation for a higher geographical level by

summing the values of cells in a town/village-level table).

This experimental study was the first attempt to apply differential privacy to detailed statistical tables (a

basic unit district-level statistical table) from the Japanese Population Census. This study’s investigation was

based on basic unit district-level cross tables with age and gender categories. Since the Population Census collects

demographic data (beyond age and gender), employment data (including employment status, industry, and occu-

pation), and residential data, future studies can potentially examine the data utility resulting from applying a

differential privacy method to statistical tables created with these other variables used for aggregation categories.

Our future research agenda also includes further investigation into the effectiveness of differential privacy based

on aggregate data tables created with various Population Census variables.

6 Note

This paper is the revised version of Ito et al. (2023) which were published in Japanese.

7 References

Abowd, J. M. (2018) Staring-down the database reconstruction theorem, Joint Statistical Meetings, Vancouver,

BC, Canada.

Dwork, C. (2007) “An Ad Omnia Approach to Defining and Achieving Private Data Analysis”, Proc. 1st intl.

conf. Privacy, security, and trust in KDD, pp. 1-13.

Garfinkel, S. Abowd, J. M., and Martindale, C. (2019) “Understanding Database Reconstruction Attack in Pub-

lic Data”, Communications of the ACM, Vol. 62 No. 3, ACM, pp. 46-53.

Garfinkel, S. (2022) “Differential Privacy and the 2020 US Census”, MIT Case Studies in Social and Ethical

Responsibilities of Computing, Winter 2022.

Ito, S., Yoshitake, T., Kikuchi, R., Akutsu, F. (2018) “Comparative Study of the Effectiveness of Perturbative

Methods for Creating Official Microdata in Japan”, Josep Domingo-Ferrer and Francisco Montes (eds.)

Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD

2018, Valencia, Spain, September 26–28, 2018, Proceedings (Lecture Notes in Computer Science),

Springer, pp.200-214.

Ito, S. and Terada, M. (2019) The potential of anonymization method for creating detailed geographical data in

Japan, Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 1–14.

Ito, S. and Terada, M. (2020) “An Evaluation of Anonymization Methods for Creating Detailed Geographical

Data” (in Japanese), Journal of the Japan Statistical Society, Vol. 50, No. 1, pp.139-166.

10

Ito, S., Miura, T., Akatsuka, H., and Terada, M. (2020) “Differential Privacy and Its Applicability for Official

Statistics in Japan – A Comparative Study Using Small Area Data from the Japanese Population Cen-

sus”, Josep Domingo-Ferrer and Krishnamurty Muralidhar(eds.) Privacy in Statistical Databases:

UNESCO Chair in Data Privacy, International Conference, PSD 2020, Tarragona, Spain, September

23–25, 2020, Proceedings (Lecture Notes in Computer Science), Springer, pp.337-352.

Ito, S., Terada, M., Kato, S. (2023) “An Empirical Study on the Effectiveness of Perturbative Methods Applied

to Japanese Population Census Data” (in Japanese) Research Paper, No.58, pp.1-26.

Lee, J., Wang, Y., and Kifer, D. (2015) “Maximum Likelihood Postprocessing for Differential Privacy under

Consistency Constraints.” Proc. 21st ACM SIGKDD intl. conf. Knowledge Discovery and Data Mining

(KDD ’15), pp. 635–644.

Terada, M., Suzuki, R., Yamaguchi, T., and Hongo, S. (2015) “On Publishing Large Tabular Data with Differ-

ential Privacy” (in Japanese), Transactions of the Information Processing Society of Japan, Vol. 56,

No. 9, pp. 1801-1816.

Terada, M., Yamaguchi, T., and Hongo, S. (2017), “On Releasing Anonymized Microdata with Differential

Privacy” (in Japanese), Transactions of the Information Processing Society of Japan, Vol. 58, No. 9,

pp. 1483-1500.

Xiao, X., Wang, G., Gehrke, J. and Jefferson, T. (2011). Differential Privacy via Wavelet Transforms, IEEE

Transactions on Knowledge and Data Engineering, 23(8), 1200–1214.

The Potential of Differential Privacy Applied to Detailed Statistical

Tables Created Using Microdata from the Japanese Population Census

Shinsuke Ito, Chuo University, Japan

Masayuki Terada, NTT DOCOMO, Inc

Shunsuke Kato, Statistics Bureau of Japan

1. Introduction

2. Application of Differential Privacy to Census Statistics

3 Applying Differential Privacy to the 2015 Japanese Population Census Data

4. Discussion

5. Conclusion and Outlook 1

1. Introduction

2

・Recent international trends in privacy-protecting techniques applied to official

statistics include the active use of perturbative methods: The U.S. Census

Bureau has investigated the applicability of perturbative methods based on the

methodology of differential privacy.

・Several empirical studies on the effectiveness of perturbative methods for

Japanese official microdata have been conducted in Japan (ex. Ito et al. (2018)).

Other studies have investigated the possibility of adapting differential privacy for

detailed geographical data from the Japanese Population Census (Ito and Terada

(2019) and Ito et al. (2020)).

1. Introduction

3

・ Investigating the applicability of differential privacy techniques to Japanese official

statistics is worthwhile from the standpoint of discussing the future creation and

publication of official statistical tables and the future direction of secondary use of

official statistics.

・This paper explores the applicability of differential privacy to data from the

Japanese Population Census not only by empirically demonstrating the

characteristics of data obtained under various perturbative methods, but also by

examining the utility of these methods for Japanese official statistical data.

2. Application of Differential Privacy to Census Statistics

・Differential privacy is a privacy protection framework aimed at achieving

comprehensive (ad omnia) data security against arbitrary attacks including

unknown attacks.

・Differential privacy does not refer to a certain privacy protection method, but

rather is a framework for defining the level of data security provided.

・ The Laplace mechanism is a typical mechanism for achieving differential

privacy. Some traditional statistical disclosure control methods such as PRAM

(Post RAndomization Methods) are also known to provide data security based

on differential privacy.

4

2. Application of Differential Privacy to Census Statistics

・A random number from the Laplace distribution centered on zero (Laplace

noise) can be added independently to each cell (even if its value is 0) in the

contingency table.

・Not all mechanisms are suitable for statistical tables from the Japanese

Population Census. Even if some mechanisms are suitable, using the wrong

mechanism would significantly reduce the utility of the output statistics.

・Simply applying the Laplace mechanism to a large-scale contingency table,

such as those from the Population Census, causes the issues listed below and

reduces the utility of statistics (Terada et al. (2015), Ito and Terada (2020)).

5

(1) Deviation from the nonnegative constraint

(2) Loss of sparseness

(3) Loss of accuracy in partial sums

6

2. Application of Differential Privacy to Census Statistics

・A method based on wavelet transform has been shown to be useful for mesh

population statistics. The Privelet method (Xiao et al. (2011)) introduces the Haar

wavelet transform in the process of noise injection so that the noises of neighboring

cells offset one another, and thereby increases the accuracy of the partial sum for a

continuous domain.

・Terada et al. (2015) propose a method based on the Morton order mapping and

the Wavelet transform with nonnegative refinement (hereinafter, “nonnegative

wavelet method”). In this method, noise is injected over the wavelet space as in the

Privelet method. By applying an inverse wavelet transform while correcting coefficient

values to prevent the output from deviating from the nonnegative constraint, the

nonnegative wavelet method produces population data that satisfies the nonnegative

constraint and guarantees differential privacy.

・An empirical experiment (Ito and Terada (2020)), in which the nonnegative wavelet

method is applied to mesh statistics from the 2010 Population Census, shows that the

nonnegative wavelet method solves the three problems discussed above.

7

2. Application of Differential Privacy to Census Statistics

・In order to solve the above problems such as deviation from the nonnegative

constraint, loss of sparseness and loss of accuracy in partial sums, applying

the constrained optimization method which searches for nearest neighborhood

vectors, satisfying with a total-number constraint and nonnegative constraint, to

population statistics other than mesh statistics is useful for achieving differential

privacy.

・There are two types of approach, bottom-up data construction approach and

top-down construction approach as the methods for the application of

constrained optimization.

(1) Bottom-up data construction approach: the method is applied to only the

population of the smallest geographical unit (the basic unit district, in the case

of the Japanese Population Census); and the resulting district-level population

data are summed to obtain the population at the municipal or prefectural level

in a bottom-up manner.

(2) Top-down data construction approach: the methods is applied to the

prefecture-level population data with the total national population setting the

total-number constraint in order to obtain privacy-protected prefecture-level

population data and the method is applied recursively in order of municipality-

level, town/village-level, and basic unit district-level, based on the same

approach used for US 2020 Census data.

8

2. Application of Differential Privacy to Census Statistics

3.1 Data Used in the Experiment

・The experiment is based on three small area statistics (Aggregation Tables 1 to Aggregation

Tables 3) with different aggregation categories that were created from individual data from the

2015 Population Census.

・The experiment is based on three aggregate data tables for

(1) basic unit district-level total population (Aggregation Table 1)

(2) basic unit district-level population by gender (Aggregation Table 2)

(3) basic unit district-level population by gender and 5-year age group (Aggregation

Table 3).

・ For each aggregate data table, the focus is on population size; the level of aggregation is at

the basic unit district (minimum geographic district) level; and the three aggregation

categories considered are “all”, “males and females”, and “males and females in 5-year

age groups”. 9

3. Applying Differential Privacy to the 2015 Japanese Population Census Data

3.2 Experimental Methods

・This empirical study aims to gain knowledge by implementing various differential

privacy methods for the Aggregation Tables 1 to Aggregation Table 3, and

evaluates the utility of the data.

・(1) PRAM, (2) Laplace mechanism, (3) top-down data construction method,

and (4) bottom-up data construction method were used as methods for

achieving differential privacy.

・Output of the Laplace mechanism can include negative population values. These

were zeroed out as a post-processing adjustment. The constraint optimization

method used with the top-down data construction approach or the bottom-up data

construction approach was the one proposed by Terada et al. (2017).

10

3. Applying Differential Privacy to the 2015 Japanese Population Census Data

・Four methods were applied with each of the following eight values for the privacy loss

budget (&#x1d73a;) set for the experiment: 0.1, 0.2, 0.7, 1.0, 1.1, 5, 10, and 20(The values 0.7

and 1.1 were chosen as approximations of &#x1d459;&#x1d45c;&#x1d454;&#x1d452;2 and &#x1d459;&#x1d45c;&#x1d454;&#x1d452;3, respectively, which are

frequently used as values of the privacy loss budget).

・Though PRAM and the top-down data construction approach can be configured to

allocate different privacy loss budgets to different geographical levels or attribute

categories, in this experiment, each value of the privacy loss budget was evenly

allocated.

・Utility of the data was evaluated for population data at the basic unit district level, and

for population data at higher geographical levels (partial sums). Specifically, the errors

in the prefecture-level, municipality-level, town/village-level, and basic unit

district-level population data were quantitatively compared.

・The mean absolute error (MAE) was used as an error index in this study.

11

3. Applying Differential Privacy to the 2015 Japanese Population Census Data

Table 1: Evaluation Results for Aggregation Table:Basic Unit District-level

Total Population

12

The errors for the prefecture-level population data are zero for the three methods other than the Laplace mechanism. PRAM, the bottom-

up data construction method, and the top-down data construction method have the characteristic that the total number of records in the

input data is preserved in the output data.

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 0.00 14408.44 520.10 48.65

(b)Laplace 98607.22 2490.96 83.50 17.60

(c)BottomUp 0.00 855.53 72.06 17.38

(d)TopDown 0.00 79.09 73.35 49.83

(a)PRAM 0.00 14399.53 520.26 48.64

(b)Laplace 30844.00 817.25 39.15 9.23

(c)BottomUp 0.00 367.38 36.86 9.21

(d)TopDown 0.00 41.12 37.78 30.42

(a)PRAM 0.00 14401.57 520.09 48.65

(b)Laplace 5433.01 157.62 11.06 2.75

(c)BottomUp 0.00 97.37 10.81 2.75

(d)TopDown 0.00 11.62 11.26 10.42

(a)PRAM 0.00 14406.80 520.17 48.65

(b)Laplace 3483.00 104.64 7.65 1.92

(c)BottomUp 0.00 65.78 7.52 1.91

(d)TopDown 0.00 7.84 7.83 7.38

(a)PRAM 0.00 14351.58 518.16 48.47

(b)Laplace 609.34 19.08 1.54 0.39

(c)BottomUp 0.00 13.10 1.51 0.38

(d)TopDown 0.00 1.60 1.57 1.52

(a)PRAM 0.00 10215.78 359.65 34.59

(b)Laplace 314.63 9.65 0.77 0.19

(c)BottomUp 0.00 6.43 0.76 0.19

(d)TopDown 0.00 0.84 0.79 0.76

(a)PRAM 0.00 3.53 0.23 0.02

(b)Laplace 152.12 4.80 0.38 0.10

(c)BottomUp 0.00 3.15 0.38 0.10

(d)TopDown 0.00 0.42 0.39 0.38

5

10

20

0.1

0.2

0.7

1

13

Table 2: Evaluation Results for Aggregation Table 2: Basic Unit District-level

Total Population by Gender

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 35301.70 7277.54 261.95 24.80

(b)Laplace 158482.08 3940.62 96.57 16.10

(c)BottomUp 4800.07 977.40 68.02 15.55

(d)TopDown 60.08 79.80 69.48 36.40

(a)PRAM 34432.49 7273.24 261.97 24.81

(b)Laplace 50430.20 1271.48 41.92 8.77

(c)BottomUp 2081.00 443.80 36.11 8.68

(d)TopDown 24.09 39.48 36.56 24.83

(a)PRAM 29972.11 7260.04 261.72 24.79

(b)Laplace 7016.17 193.09 11.17 2.72

(c)BottomUp 432.39 100.55 10.83 2.71

(d)TopDown 9.27 11.83 11.16 9.67

(a)PRAM 27463.17 7255.24 261.68 24.80

(b)Laplace 4239.02 120.71 7.69 1.90

(c)BottomUp 310.38 68.29 7.50 1.89

(d)TopDown 7.14 8.10 7.77 7.00

(a)PRAM 5492.60 7215.53 261.27 24.78

(b)Laplace 653.28 19.92 1.54 0.38

(c)BottomUp 59.69 12.58 1.51 0.38

(d)TopDown 1.19 1.58 1.57 1.51

(a)PRAM 530.02 7197.73 260.39 24.70

(b)Laplace 315.57 9.95 0.77 0.19

(c)BottomUp 26.27 6.56 0.76 0.19

(d)TopDown 0.54 0.82 0.78 0.76

(a)PRAM 7.51 5115.43 180.82 17.74

(b)Laplace 159.90 4.92 0.39 0.10

(c)BottomUp 14.20 3.23 0.38 0.10

(d)TopDown 0.25 0.40 0.39 0.38

10

20

0.2

0.7

1

5

0.1

14

Table 3: Evaluation Results for Aggregation Table 3: Basic Unit District-level

Total Population by Gender and 5-year Age Group

The errors caused by PRAM to the basic unit district-level data vary very little across different values of the privacy loss budget. The MAE

calculated for PRAM at &#x1d700;=0.1, 0.2 are smaller than for the other three methods.

ε Method Prefecture Municiparity Town/Village Basic Unit

District

(a)PRAM 15899.43 587.27 18.50 2.05

(b)Laplace 376314.96 9323.57 173.38 10.77

(c)BottomUp 14008.77 544.73 25.62 3.23

(d)TopDown 81.64 76.50 30.75 3.44

(a)PRAM 15874.93 586.70 18.49 2.05

(b)Laplace 175973.88 4359.93 81.69 5.69

(c)BottomUp 11884.21 447.52 19.48 2.80

(d)TopDown 41.25 39.09 20.72 3.28

(a)PRAM 15703.82 584.13 18.46 2.05

(b)Laplace 40261.21 997.68 19.59 1.90

(c)BottomUp 5618.71 203.17 8.85 1.55

(d)TopDown 11.51 11.42 8.38 2.66

(a)PRAM 15573.72 582.26 18.44 2.05

(b)Laplace 25349.47 628.32 12.71 1.38

(c)BottomUp 4077.82 146.72 6.56 1.20

(d)TopDown 7.94 7.93 6.20 2.37

0.7

1

0.1

0.2

3.3 Experimental Methods

・For tables 1 to 3, (a) PRAM, (b) Laplace, (c) BottomUp, and (d) TopDown refer to

PRAM, the Laplace mechanism (plus zeroing out of negative values), the bottom-

up data construction method, and the top-down data construction method, respectively.

・The result that errors at the prefectural level in Table 1 are zero is attributable to the

conditions of this experiment, and the same result would not be obtained if the

highest geographical level were the national level (instead, errors for total population

at the national level would become zero).

・In the case of PRAM, the basic unit district-level aggregate data table used for

Table 3 is quite sparse. Therefore, at first glance its accuracy does not appear

undesirable at the basic unit district level. However, this is a false accuracy and not

statistically meaningful. In fact, the errors caused by PRAM at the municipality level

and the town/village level shown in Table 3 reveal that the accumulation of errors

greatly degrades the accuracy of the partial sums and there is significant

degradation of the characteristics of the original aggregate data table. 15

3. Applying Differential Privacy to the 2015 Japanese Population Census Data

・For a given privacy loss budget, differential privacy guarantees the same level of

privacy protection regardless of the differential privacy method used, but the utility of

the resulting data depends on the method and the use of the data.

・If the errors at the basic unit district level are taken as indices of the utility of the

relevant data, then the output data from the bottom-up data construction method and

the Laplace mechanism seem superior.

・For partial sums at the municipality level and town/village level, the errors tend

to be larger for both the bottom-up method and the Laplace mechanism, and the

tendency is particularly noticeable with the Laplace mechanism.

・The top-down data construction method is inferior to the bottom-up data construction

method in terms of the errors at the basic unit district level. However, for the top-

down method, the errors are not accumulated at higher geographical levels. 16

4. Discussion

・Satisfying the nonnegative constraint is a problem for a simple Laplace

mechanism. Even if an attempt is made to satisfy the constraint by zeroing out

negative values as in this experiment, it is still difficult to obtain a practically

usable aggregate data table because of the large overestimation bias affecting

partial sums.

・PRAM clearly fails to achieve both a reasonable level of privacy protection and

data utility. Under a given set of conditions, in most cases PRAM is significantly

inferior in terms of privacy protection efficiency. For small values of the privacy

loss budget, the results of applying PRAM at the basic unit district level seem to be

superior to the results of other methods. However, this is attributable to false

accuracy.

・Regarding the bottom-up data construction method and the top-down data

construction method, the errors at the basic unit district level show that the bottom-up

method provides higher data utility, but its errors for partial sums increase as the

range of cells used for the partial sums expands, which indicates decreasing

data utility. 17

4. Discussion

・For the top-down method, the errors for partial sums remain small. Therefore,

when partial sums are calculated for higher geographical levels, the degree of data

utility is maintained.

・Judging from the errors for the output data at the basic unit district level, data utility is

relatively high for the bottom-up method, but deteriorates for partial sums since

the errors associated with them tend to increase significantly.

For the top-down method, errors at the basic unit district level are larger than for the

bottom-up method, and if the level of privacy protection is properly set, the utility of

different output data considered in this study is maintained, even when the effect

on partial sums is taken into account.

18

4. Discussion

(1) This paper evaluates the utility of statistical tables for different geographical

levels which were created using individual data from the Population Census and

by applying various differential privacy methods.

(2) The results of this study show that in applying differential privacy to Japanese

Population Census data, the top-down data construction method yields a

higher level of data utility than the other methods.

(3) This study also suggests that given a hierarchical geographical structure,

reasonable results from the standpoint of data utility can be obtained by

top-down, consistent allocation of the noise generated based on differential

privacy to the cells of a statistical table.

(4) Our future research agenda also includes further investigation into the

effectiveness of differential privacy based on aggregate data tables created with

various Population Census variables.

5. Conclusion and Outlook

19

  • スライド 1: The Potential of Differential Privacy Applied to Detailed Statistical Tables Created Using Microdata from the Japanese Population Census
  • スライド 2: 1. Introduction
  • スライド 3: 1. Introduction
  • スライド 4: 2. Application of Differential Privacy to Census Statistics
  • スライド 5: 2. Application of Differential Privacy to Census Statistics
  • スライド 6
  • スライド 7
  • スライド 8
  • スライド 9
  • スライド 10: 3. Applying Differential Privacy to the 2015 Japanese Population Census Data
  • スライド 11
  • スライド 12: Table 1: Evaluation Results for Aggregation Table:Basic Unit District-level Total Population
  • スライド 13: Table 2: Evaluation Results for Aggregation Table 2: Basic Unit District-level Total Population by Gender
  • スライド 14: Table 3: Evaluation Results for Aggregation Table 3: Basic Unit District-level Total Population by Gender and 5-year Age Group
  • スライド 15: 3. Applying Differential Privacy to the 2015 Japanese Population Census Data
  • スライド 16: 4. Discussion
  • スライド 17: 4. Discussion
  • スライド 18: 4. Discussion
  • スライド 19

New Hedonic Quality Adjustment Method using Sparse Estimation, Japan

In the application of the hedonic quality adjustment method to the price index, multicollinearity and the omitted variable bias arise as practical issues. This study proposes the new hedonic quality adjustment method using ‘sparse estimation’ in order to overcome these problems. The new method deals with these problems by ensuring two properties: the ‘grouped effect’ that gives robustness for multicollinearity and the ‘oracle property’ that provides the appropriate variable selection and asymptotically unbiased estimators.

Languages and translations
English

New Hedonic Quality Adjustment Method using Sparse Estimation Sahoko Furuta* [email protected] Yudai Hatayama** [email protected] Atsushi Kawakami*** [email protected] Yusuke Oh** [email protected]

No.21-E-8 July 2021

Bank of Japan 2-1-1 Nihonbashi-Hongokucho, Chuo-ku, Tokyo 103-0021, Japan

* Research and Statistics Department ** Research and Statistics Department (currently, Financial System and Bank Examination Department) *** Research and Statistics Department (currently, International Department)

Papers in the Bank of Japan Working Paper Series are circulated in order to stimulate discussion and comments. Views expressed are those of authors and do not necessarily reflect those of the Bank. If you have any comment or question on the working paper series, please contact each author. When making a copy or reproduction of the content for commercial purposes, please contact the Public Relations Department ([email protected]) at the Bank in advance to request permission. When making a copy or reproduction, the source, Bank of Japan Working Paper Series, should explicitly be credited.

Bank of Japan Working Paper Series

1

New Hedonic Quality Adjustment Method using Sparse Estimation*

Sahoko Furuta,† Yudai Hatayama,‡ Atsushi Kawakami,§ Yusuke Oh∗∗

July 2021

Abstract

In the application of the hedonic quality adjustment method to the price index, multicollinearity and the omitted variable bias arise as practical issues. This study proposes the new hedonic quality adjustment method using ‘sparse estimation’ in order to overcome these problems. The new method deals with these problems by ensuring two properties: the ‘grouped effect’ that gives robustness for multicollinearity and the ‘oracle property’ that provides the appropriate variable selection and asymptotically unbiased estimators. We conduct an empirical analysis applying the new method to the producer price index of passenger cars in Japan. In comparison with the conventional standard estimation method, the new method brings the following benefits: 1) a significant increase in the number of variables in the regression model; 2) an improvement in the fit of the regression model to actual prices; and 3) reduced overestimation of the product quality improvements due to the omitted variable bias. These results suggest the possible improvement in the accuracy of the price index while enhancing the usefulness of the hedonic quality adjustment method.

JEL Classification: C43, E31, C52 Keywords: Price Index, Quality Adjustment, Hedonic Regression Model, Multicollinearity,

Omitted Variable Bias, Sparse Estimation, Adaptive Elastic Net

* The authors thank Naohito Abe, Fumio Funaoka, Yukinobu Kitamura, Chihiro Shimizu, Shigenori Shiratsuka, and the staff of the Bank of Japan for their valuable comments. We also thank Yuto Ishikuro, Marina Eguchi, Taiki Kubo, and Kotaro Shinma for their cooperation in data calculations. All remaining errors are our own. The views expressed in this study are those of the authors and do not necessarily reflect the official views of the Bank of Japan. † Research and Statistics Department, Bank of Japan (E-mail: [email protected]) ‡ Research and Statistics Department, (currently, Financial Systems and Bank Examination Department), Bank of Japan (E-mail: [email protected]) § Research and Statistics Department, (currently, International Department), Bank of Japan (E-mail: [email protected]) ∗∗ Research and Statistics Department, (currently, Financial Systems and Bank Examination Department), Bank of Japan (E-mail: [email protected])

2

1. Introduction

The hedonic quality adjustment method is one of the quality adjustment methods for the

price index.1 As the price index indicates ‘pure’ price changes of products over time, it is

essential to adjust for differences in quality between old and new products in response to

the renewal of products in the market. In the hedonic approach, based on the assumption

that the quality of a product can be represented by the accumulation of its individual

characteristics, we decompose the difference in the observed prices between old and new

products into a quality change and a pure price change using the regression model which

estimates the relationship between characteristics and prices. The hedonic quality

adjustment method has two main advantages: 1) it can objectively evaluate the quality

changes of products using data and statistical methods rather than the subjective

judgement by the authorities; and 2) even if there are various changes in characteristics

of products, it can comprehensively evaluate the effects of these changes on product

prices. Therefore, the hedonic approach has been applied to the compilation of the

consumer price index and the producer price index in many countries.

However, there are some issues in applying the hedonic quality adjustment method

in practice.2 First, in the regression model, if the characteristics of the products are highly

correlated, the problem of multicollinearity on the explanatory variables is likely to occur,

and the estimated parameters for the variables may become unstable. In addition, the

parameters of the variables included in the regression model can be biased due to the

omitted variables when it is difficult to obtain all the characteristic data of the products.

Furthermore, considering that the relationship between characteristics and prices is not

always linear, in the hedonic approach we often estimate non-linear models. However, it

is known that the problems of multicollinearity and omitted variable bias can be more

serious as the functional form for the model becomes more complex.3

Although these issues of the hedonic quality adjustment method long been known,

1 For the representative study of the hedonic approach, see Shiratsuka (1998). 2 For the practical issues of the hedonic approach, see Triplett (2006). 3 See Cropper et al. (1988) for details.

3

practically sufficient solutions have not been available until now. Therefore, in this study,

we attempt to overcome these problems by improving the estimation method. Specifically,

we propose a method to deal with the problems of multicollinearity and omitted variable

bias using ‘sparse estimation’ as an estimation method for the hedonic regression model.

Sparse estimation has gone through a process of improvement in statistics over a long

time, and is often used in many academic fields, such as machine learning, in recent years.

True to its meaning, ‘sparse’ estimation selects only meaningful explanatory variables

from a large number of candidates, and estimates the parameters of the other variables to

be exactly zero. Because of this property, in comparison with the conventional estimation

method used in the hedonic model—for example, the ordinary least squares (OLS)—the

new method with sparse estimation has the advantage that it can automatically select

variables in the model. In particular, among sparse estimation methods, the adaptive

elastic net (AEN), which is used in the new estimation method proposed in this study, is

superior in that it has two desirable properties (Zou and Zhang (2009)): the ‘group effect’

that enhances robustness for multicollinearity, and the ‘oracle property’ that ensures

appropriate variable selection and asymptotically unbiased estimators. In this paper, we

show that these properties of the AEN can help to solve the above-mentioned problems

of the hedonic approach. To our knowledge, there have been empirical analyses using

AEN in various fields in recent years, however, there is no previous study applying AEN

to the hedonic regression model.

The results of the analysis in this paper are as follows. We perform an empirical

analysis applying the new method to passenger car prices in the Corporate Goods Price

Index (CGPI) in Japan, which mostly corresponds to the producer price index, compiled

by the Research and Statistics Department of the Bank of Japan. As a result, compared

with the conventional estimation method used in the hedonic approach for the CGPI, the

new method using AEN brings the following multiple and varied benefits. First, the

number of variables incorporated into the regression model increase significantly, and

this leads to an expansion of the characteristics that can be taken into account in the

quality adjustment. Second, the fit of the regression model improves not only for the

sample prices during the estimation period, but also after the estimation. In addition, we

4

confirmed that estimated parameters are more stable than the conventional method with

the change in estimation period. Third, when we examine the effect of change in the

estimation method on the actual price index, it is confirmed that the rate of decline of the

price index estimated by the new method becomes more gradual than that of the

conventional method. This fact suggests that the conventional method may overestimate

the quality improvement rate due to the omitted variable bias, while new estimation

method could solve this problem. These results suggest that the new method can

contribute to improvement in the accuracy of the price index while enhancing the

usefulness of the hedonic quality adjustment method.

The rest of this paper is organized as follows. Section 2 describes the overview of

conventional hedonic regression model and its problems. Section 3 explains the new

hedonic quality adjustment method using sparse estimation and its properties. Section 4

shows the results of empirical analysis applying the new method to the producer price of

passenger cars in Japan. Section 5 summarizes the paper.

2. Conventional method and issues

2-1. Conventional method

In this section, we provide an overview of the conventional method used in the hedonic

quality adjustment, taking the CGPI compiled by the Research and Statistics Department

of the Bank of Japan as an example. In the hedonic approach, a regression analysis is

performed using the prices of the products as the dependent variables and the data

representing the characteristics of products as the explanatory variables. Then, the

estimated parameters are applied for the quality adjustment between new and old products.

In the regression procedure, although we have to assume some specific functional form

for the hedonic model, from the perspective of economic theory, it is known that there are

no a priori restrictions on this form. 4 Since there are innumerable functional form

4 The hedonic function is theoretically described as an envelope with respect to a bid function for a characteristic through consumer's utility maximization and an offer function derived from producer's profit maximization, in a perfectly competitive market where all characteristics can be selected continuously. Therefore, there are no a priori restrictions on the functional form. See Rosen (1974) for details.

5

candidates for the estimation, in practice, it is necessary to choose a proper functional

form in terms of goodness of fit and consistency of the estimated parameters, e.g.,

significance, sign, and so on.5 However, it is necessary to consider non-linearity because

the relationship between product prices and characteristics is not always linear. From this

point of view, in order to take into account the non-linearity, previous research has

proposed using the regression model with the Box-Cox transformation of variables as

follows.6

Box-Cox transformation

&#x1d465;&#x1d465;(&#x1d706;&#x1d706;) = � &#x1d465;&#x1d465;&#x1d706;&#x1d706; − 1 &#x1d706;&#x1d706;

(&#x1d706;&#x1d706; ≠ 0)

log &#x1d465;&#x1d465; (&#x1d706;&#x1d706; = 0)

λ in the above indicates the Box-Cox parameter and is a coefficient that determines the

degree of nonlinearity of the function. Conventional hedonic regression model with the

Box-Cox transformed term is as follows.

Conventional hedonic regression model

&#x1d466;&#x1d466;&#x1d456;&#x1d456;(&#x1d706;&#x1d706;0) = &#x1d6fd;&#x1d6fd;0 + �&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d450;&#x1d450;&#x1d465;&#x1d465;&#x1d450;&#x1d450;&#x1d450;&#x1d450;,&#x1d456;&#x1d456; �&#x1d706;&#x1d706;&#x1d457;&#x1d457;�

&#x1d45d;&#x1d45d;&#x1d450;&#x1d450;

&#x1d450;&#x1d450;=1

+ �&#x1d6fd;&#x1d6fd;&#x1d451;&#x1d451;&#x1d451;&#x1d451;&#x1d465;&#x1d465;&#x1d451;&#x1d451;&#x1d451;&#x1d451;,&#x1d456;&#x1d456;

&#x1d45d;&#x1d45d;&#x1d451;&#x1d451;

&#x1d451;&#x1d451;=1

(2)

&#x1d466;&#x1d466;&#x1d456;&#x1d456;: theoretical price, &#x1d465;&#x1d465;&#x1d450;&#x1d450;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;: continuous variable, &#x1d465;&#x1d465;&#x1d451;&#x1d451;&#x1d451;&#x1d451;,&#x1d456;&#x1d456;: dummy variable,

&#x1d6fd;&#x1d6fd;0: constant term, &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d450;&#x1d450;: coefficient on a continuous variable,

&#x1d6fd;&#x1d6fd;&#x1d451;&#x1d451;&#x1d451;&#x1d451;: coefficient on a dummy variable,

&#x1d706;&#x1d706;0: Box-Cox parameter for theoretical price, &#x1d706;&#x1d706;&#x1d450;&#x1d450;: Box-Cox parameter for a continuous variable,

&#x1d45d;&#x1d45d;&#x1d450;&#x1d450;: number of continuous variables, &#x1d45d;&#x1d45d;&#x1d451;&#x1d451;: number of dummy variables

According to the values of &#x1d706;&#x1d706;0 and &#x1d706;&#x1d706;&#x1d450;&#x1d450;, the above formula is classified as follows;

(a) Log-Linear model when both the dependent and explanatory variables are log-linear

5 Shiratsuka (1997) states that the criteria for function selection in hedonic methods should include goodness of fit and coherence of the parameters as well as value interpretability and estimation burden. 6 For more details on the Box-Cox transformation, see Box and Cox (1964). In addition, Halvorsen and Pollakowski (1981) advocate utilizing the Box-Cox transformation as a general functional form for the hedonic model and performing the likelihood ratio test to select a specific functional form.

6

(&#x1d706;&#x1d706;0 = &#x1d706;&#x1d706;&#x1d450;&#x1d450; = 0)

(b) Semi Log-Linear model when only the dependent variable is log-linear (&#x1d706;&#x1d706;0 = 0, &#x1d706;&#x1d706;&#x1d450;&#x1d450; = 1)

(c) Linear model when both the dependent and explanatory variables are linear (&#x1d706;&#x1d706;0 = &#x1d706;&#x1d706;&#x1d450;&#x1d450; = 1)

(d) Semi Box-Cox model when only the dependent variable is applied the Box-Cox transformation (&#x1d706;&#x1d706;&#x1d450;&#x1d450; = 1)

(e) Double Box-Cox model when both the dependent and explanatory variables are applied the Box-Cox transformation

These five regression models should be tested when selecting the functional form. It is known that, in the hedonic regression, the Double Box-Cox model is selected in many cases as a result of such a test.7

2-2. Issues (i): Multicollinearity

One of the issues that the conventional hedonic estimation method is likely to face is

multicollinearity. Multicollinearity refers to a state in which there is a high

intercorrelation among explanatory variables in a regression model. Multicollinearity

makes it difficult to identify the effects of variables and estimate the parameters accurately.

As a result, the parameters of the variables that are supposed to have an important effect

on the dependent variable become insignificant.

It is known that the hedonic regression model is prone to the problem of

multicollinearity. Using the example of the passenger car, total length and weight of the

car body are highly correlated, leading to the problem of multicollinearity (Chart 1). In

the dataset of items to which the hedonic quality adjustment method is applied, there are

correlations among many variables, which are not limited to those that inevitably arise

from technologically-based relationships such as the example of total length and weight.

This is why companies have multiple product lines in different price ranges as a marketing

7 Triplett (2006) mentions that statistical tests are more likely to reject linear model and log-linear model rather than Box-Cox. Actually, most of the hedonic regression model used for quality adjustment on CGPI in Japan is Double Box-Cox model.

7

strategy. Such as high-end products are equipped with many various functions, while

those functions are reduced to the minimum necessary in low-end products. As a result,

correlations are likely to occur between variables that do not necessarily have a strong

technologically-based relationship, for example, between maximum power output and

whether there is a power-controlled seat.8

There are two major approaches to deal with multicollinearity. The first is to perform

principal component analysis beforehand and use some of the obtained principal

components as explanatory variables on the regression of hedonic function. In Shiratsuka

(1995), the hedonic regression for passenger cars is performed with the principal

component added as an explanatory variable. It notes that improvements in the coefficient

of determination of the regression equation are marginal and that it is difficult to interpret

the estimated parameters because the effect of each characteristic to the principal

components varies over time. On that basis, it concludes that while principal component

analysis is useful to guess important characteristics, it is not always an appropriate method

for dealing with multicollinearity using the principal components as explanatory variables.

Therefore, a second, simpler approach is widely used in practice. This method excludes

one of the correlated variables from the equation (stepwise method). In other words, if an

effect of multicollinearity is suspected from the estimation results, it can be avoided to a

certain extent by reestimating without the variables that may be the cause. However, as

in the passenger car example above, it is not always easy to select the variables properly

under a strong correlation between characteristics. Therefore, there is the inevitable

burden of repeating the estimation until a plausible result is obtained.

2-3. Issues (ii): Omitted variable bias

The second issue that conventional hedonic estimation method is likely to face is bias in

the parameters caused by omitted variables. An omitted variable is a variable that is not

included in the regression model, although it is highly relevant to the explained variable.

8 Triplett (2006) distinguishes between “multicollinearity in the universe” due to technologically-based correlation like length and weight and “multicollinearity in the sample” due to the correlation of functions depending on the grades of products.

8

In the hedonic method in price statistics, this is the case when the regression model does

not include characteristics and performance that would have affected the price of the

product.

There are two types of situations in which omitted variables occur in the hedonic

method: (a) the case that occurs at the stage of data set construction; and (b) the case that

occurs as a result of variable selection. In the case of (a), the problem arises from the fact

that the characteristics have an impact on price but cannot be observed. For example, it is

inherently difficult to include characteristics into the regression model, which we could

not quantify well such as product design, style, and brand value. We can only deal with

this problem partially by using dummy variables that identify the manufacturer as a proxy

variable. Besides, if a new function emerges due to technological innovation, it is

necessary to wait to incorporate variables into the regression model until a product with

that function has penetrated the market to a certain extent. In the case of (b), the problem

arises from inadvertent inclusion of variables with a slight impact on prices and exclusion

of variables that truly have an impact on prices under the circumstances where we could

not help selecting the limited number of variables due to multicollinearity.

Due to the presence of omitted variables, the parameters of the variables selected in

the estimated regression model are distorted. If the distortion causes bias in the price index,

the difference in the relative rate of quality improvement of the omitted and employed

variables determines whether the distortion causes upward or downward bias. For

example, if the omitted variable has a significant improvement over the employed

variable, the quality improvement is underestimated, resulting in an upward bias in the

price index. Conversely, if the employed variable with a distorted parameter improves

significantly in quality while the omitted variable improves only slightly, there is a

downward bias in price index as a result of overestimating quality improvement. Triplett

(2006) applies the hedonic method to PC prices and finds that in the presence of omitted

variables, a downward bias of about -0.2% to -1.0% arises in the price index over a five-

month period. He provides the contextual background that employed variables such as

processing speed and memory size may have improved better than the omitted variables.

Sawyer and So (2018) also estimate how much the rate of price decline of

9

microprocessors derived from hedonic regression differs among possible subsets of the

regressors. He shows that the rate of price decline (on average over four years) when only

one variable is employed is up to -45.11%, lower than -8.77% when all characteristics are

employed, due to the omitted variable bias. Such previous research suggests that the

presence of an omitted variable in the hedonic regression model may lead to a downward

bias in the price index (an overestimation of the rate of quality improvement).

It is known that the omitted variable bias becomes more severe on complex

functional forms. Cropper et al. (1988) states that it is appropriate to select the simpler

functions through the hedonic estimation for a real estate price when there may be omitted

variables. Particularly for functional forms with Box-Cox transformed terms, there is a

risk of extreme values of Box-Cox parameters depending on the subset of explanatory

variables.9 Adopting a distorted functional form causes a problem—that the fit to the

dataset used for the estimation is good, but the fit to the new product that comes after the

estimation is poor (‘overfitting’). Therefore, it is necessary to repeat estimation with

changing the subset of the variables each time so that the Box-Cox parameters do not

become excessively high order. We sometimes observe that the hedonic estimation result

change greatly after re-estimation. For example, the Box-Cox parameter of passenger cars

(Minivans) of CGPI changed from 3.4 to almost zero, logarithmic form (Chart 2). These

changes may suggest parameter instability due to the presence of omitted variables.

2-4. Issues (iii): Interactions between characteristics

An additional issue faced in applying the hedonic model is the issue of ‘interactions’

between characteristics. The hedonic regression model is often performed under the

assumption that the parameters for characteristics are the same among the products, but

in practice, the assumption is not always valid. For example, there may be an interaction

where a quality improvement in one characteristic increases the impact of quality

improvement of another characteristic on price, or we estimate under the assumption that

9 Graves et al. (1988) also estimates the hedonic regression model for the real estate value with various subset of variables and various functional forms. They note that within the functional form including the Box-Cox transformed terms, the choice of specification greatly affected the estimation results.

10

they are the same product but, in fact, they are classified by more detailed categories.

To deal with these interactions, it is useful to introduce a cross term for the variables

in the regression. This allows us to capture the situation in which the impact of each

characteristic on price depends on the state of another characteristic. However, it is

difficult in practice to employ all of the cross terms when estimating because there is a

huge number of potential combinations of cross term variables. The more cross terms

employed, the higher the correlation between the explanatory variables, potentially

leading to multicollinearity, and possibly making the parameters more unstable. As a

result, the conventional hedonic regression model has been limited in the employment of

cross terms. However, given the omitted variable bias stated above, there is a risk that the

parameters of the variables and cross terms may be biased when estimating without the

important cross terms. We consider that the presence of interactions has not been resolved

in the hedonic regression, as it faces both multicollinearity and omitted variables bias, as

described above.

3. New hedonic quality adjustment method using sparse estimation

3-1. Sparse estimation

This section will explain the hedonic regression model using sparse estimation. Sparse

estimation selects only the meaningful variables from many candidates of explanatory

variables and gives zero coefficients to the rest of the variables (called ‘sparsity’). Sparse

estimation performs variable selection and coefficient estimation at the same time under

sparsity. This method has an advantage over the conventional one using OLS estimation

in which it can automatically derive a stable and well fitted model. Sparse estimation has

been used in various fields of empirical analysis, not only in economics. In this section,

we explain how this type of method is useful in dealing with the issues of hedonic

regression for price statistics (multicollinearity and omitted variable bias).10

10 Sparse estimation is also useful when analyzing observational data, for example, it was used in the world’s first black hole imaging by the international project (The Event Horizon Telescope Collaboration (2019)). In addition, also in the field of geographic information science, sparse estimation is used in quantifying inter-regional heterogeneity. For example, Jin and Lee (2020) estimate housing prices with a

11

Many methods of sparse estimation have been proposed to date, starting with the

"Lasso" (least absolute shrinkage and selection operator) proposed by Tibshirani (1996).

The new estimation method proposed in this study employs an adaptive elastic net (AEN),

which enjoys two desirable properties: the ‘group effect’ that gives robustness for

multicollinearity and the ‘oracle property’ that ensures the adequacy of variable selection

and estimated coefficients.11 To our knowledge, there have been no studies applying AEN

to the hedonic regression model.12 In the following, we will provide an overview of sparse

estimation and the above two properties in turn.

First, we will outline how sparsity is satisfied using Lasso, a typical sparsity

estimation. Lasso is a method for estimating &#x1d737;&#x1d737;, which minimizes the function of the sum

of the squared error in the equation including the &#x1d43f;&#x1d43f;1 norm (sum of absolute values) of &#x1d737;&#x1d737;

as a regularization term.13

Lasso

&#x1d737;&#x1d737;�(&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;) = argmin &#x1d737;&#x1d737;

�|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;��&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;� &#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

� (3)

&#x1d706;&#x1d706; > 0: regularization parameter (A relatively smaller number of variables are selected if λ is large)

A traditional method that could deal with multicollinearity is ridge regression.14 It is

same as Lasso, in that it minimizes the function of the sum of the squared error including

the regularization term, but ridge regression uses the &#x1d43f;&#x1d43f;2 norm (the sum of the squares)

of &#x1d737;&#x1d737; as the regularization term. It leads to a key difference that Lasso satisfies sparsity,

while ridge regression does not.

spatial vector auto regression model using sparse estimation and Wheeler (2009) suggests adopting sparse estimation on a geographically weighted regression model. 11 For details of estimation method and each of the properties, see Zou and Zhang (2009). 12 There are some studies using Lasso among sparse estimation for hedonic regression model. For example, Zafar and Himpens (2019) apply Lasso to analyze webscraped laptop prices and characteristics and compare the result with other estimation methods which consider nonlinearity. 13 We centralize the dependent variables and standardize the explanatory variables. That is, for the number of observations n, we set 1

&#x1d45b;&#x1d45b; ∑ &#x1d466;&#x1d466;&#x1d456;&#x1d456;&#x1d45b;&#x1d45b; &#x1d456;&#x1d456;=1 = 0, 1

&#x1d45b;&#x1d45b; ∑ &#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456; &#x1d45b;&#x1d45b; &#x1d456;&#x1d456;=1 = 0, and 1

&#x1d45b;&#x1d45b; ∑ &#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;

2&#x1d45b;&#x1d45b; &#x1d456;&#x1d456;=1 = 1.

14 For details, see Hoerl and Kennard (1970).

12

Ridge regression

&#x1d737;&#x1d737;�(&#x1d445;&#x1d445;&#x1d456;&#x1d456;&#x1d451;&#x1d451;&#x1d445;&#x1d445;&#x1d445;&#x1d445;) = argmin &#x1d737;&#x1d737;

�|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;�&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450; 2

&#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

� (4)

&#x1d706;&#x1d706; > 0: regularization parameter (Coefficient is estimated to be smaller if λ is large)

We show how differences in regularization terms satisfy or does not satisfy the

sparsity intuitively in Chart 3 in line with the discussion described in Tibshirani (1996).

If there are two variables, from the Karush-Kuhn-Tucker condition, &#x1d737;&#x1d737;�(&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;) and

&#x1d737;&#x1d737;�(&#x1d445;&#x1d445;&#x1d456;&#x1d456;&#x1d451;&#x1d451;&#x1d445;&#x1d445;&#x1d445;&#x1d445;) can be transformed into the formulas described in Chart 3. Considering a

plane consisting of &#x1d6fd;&#x1d6fd;1-axis and &#x1d6fd;&#x1d6fd;2-axis, the sum of the squared error illustrates an ellipse

centered on &#x1d737;&#x1d737;�(&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;). The constraint corresponding to each regularization term illustrates

a rhombus in Lasso and a circle in ridge regression. Under these conditions, &#x1d737;&#x1d737; is derived

from the tangent point of the sum of the squared error (ellipse) and the constraint

(rhombus or circle). Here, in Lasso, the constraint represents rhombus, and the two

conditions are likely to intersect at the corners. In other words, the corner solution is likely

to be selected in the constraint of &#x1d737;&#x1d737;. In this case, one of the parameters is estimated to be

exactly zero and the variable is selected automatically. On the other hand, ridge regression

is not prone to automatic variable selection. This is because the constraint region is a

circle, and the sum of the squared error and the constraint are not likely to intersect at any

particular point, making it unlikely that one parameter will be estimated at exactly zero.

3-2. Group effect

For Lasso, the results of variable selection are known to be unstable in data with strong

multicollinearity. For example, suppose that the true values of the parameters for two

variables are &#x1d6fd;&#x1d6fd;1 ∗ and &#x1d6fd;&#x1d6fd;2

∗. As an extreme example, if the values of these two variables

are exactly the same, then the solution to the optimization by Lasso is not uniquely

determined because there are innumerable solutions as follows.

&#x1d737;&#x1d737;�(&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;) = � &#x1d43f;&#x1d43f;(&#x1d6fd;&#x1d6fd;1

∗ + &#x1d6fd;&#x1d6fd;2 ∗)

(1 − &#x1d43f;&#x1d43f;)(&#x1d6fd;&#x1d6fd;1 ∗ + &#x1d6fd;&#x1d6fd;2

∗)� for any s ∈ [0,1] (5)

13

Similarly, when there are two highly correlated variables, Lasso's variable selection

is greatly affected by slight changes in data and the variable into the regression is not

stable.

Since the variables for hedonic regression model are often highly correlated, it is

necessary to adopt sparse estimation, which is robust under the multicollinearity

conditions described above. One of the typical sparse estimation with this property is the

elastic net (EN). EN is a method for estimating &#x1d737;&#x1d737;, which minimizes the function of the

sum of the squared error in the equation plus both the &#x1d43f;&#x1d43f;2 norm and the &#x1d43f;&#x1d43f;1 norm of &#x1d737;&#x1d737;

as a regularization term.15 This enables EN to have the advantages of both Lasso and ridge

regression: variable selection and robustness for multicollinearity.

Elastic Net (EN)

&#x1d737;&#x1d737;�(&#x1d438;&#x1d438;&#x1d438;&#x1d438;) = �1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b; � �argmin

&#x1d737;&#x1d737; �|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;2�&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;

2 &#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ &#x1d706;&#x1d706;1��&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;� &#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

�� (6)

&#x1d706;&#x1d706;2 > 0: &#x1d43f;&#x1d43f;2 norm regularization parameters &#x1d706;&#x1d706;1 > 0: &#x1d43f;&#x1d43f;1 norm regularization parameters

&#x1d45b;&#x1d45b;: number of observations

The robustness of EN for multicollinearity is called the ‘group effect’. Group effect

is a property that gives smaller differences between the coefficients on those variables

when the correlation between the variables is high.16 As an extreme case, if the values of

two variables are exactly the same, the EN estimates the parameters on those two

variables as exactly equal, as follows. This allows for stable variable selection and

parameter estimation, even in situations where it is difficult to discern which variables

surely have impact on price from the data under multicollinearity.

&#x1d737;&#x1d737;�(&#x1d438;&#x1d438;&#x1d438;&#x1d438;) = �

1 2 (&#x1d6fd;&#x1d6fd;1

∗ + &#x1d6fd;&#x1d6fd;2 ∗)

1 2 (&#x1d6fd;&#x1d6fd;1

∗ + &#x1d6fd;&#x1d6fd;2 ∗) � (7)

15 For details, see Zou and Hastie (2005). 16 To be more specific, the maximum absolute value of difference between the parameters is directly proportional to �1 − &#x1d70c;&#x1d70c;, when the sample correlation &#x1d70c;&#x1d70c; is greater than zero.

14

3-3. Oracle property

Another property that must be satisfied by the estimator derived from sparse estimation

is the ‘oracle property’. Specifically, with the true coefficient &#x1d737;&#x1d737;∗, it is defined that an

estimator &#x1d737;&#x1d737;� has the oracle property when it satisfies the following two conditions.

Oracle property

(1) Variable selection consistency

lim &#x1d45b;&#x1d45b;→∞

&#x1d443;&#x1d443;��̂�&#x1d6fd;&#x1d450;&#x1d450; = 0� = 1 &#x1d464;&#x1d464;&#x1d456;&#x1d456;&#x1d464;&#x1d464;ℎ &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450; ∗ = 0

(2) Asymptotic normality of the non-zero coefficients

lim &#x1d45b;&#x1d45b;→∞

��̂�&#x1d6fd;&#x1d450;&#x1d450; − &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450; ∗�

&#x1d70e;&#x1d70e;��̂�&#x1d6fd;&#x1d450;&#x1d450;� ~N(0,1) &#x1d464;&#x1d464;&#x1d456;&#x1d456;&#x1d464;&#x1d464;ℎ &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;

∗ ≠ 0

&#x1d70e;&#x1d70e;2��̂�&#x1d6fd;&#x1d450;&#x1d450;�: asymptotic variance of estimator

Of the two conditions above, (1) ‘variable selection consistency’ means that the

estimator of the coefficient satisfies consistency for a variable whose true coefficient is

zero. The ‘asymptotic normality of the non-zero coefficients’ in (2) means that for

variables whose true coefficients are non-zero, the estimation error on those coefficients

follows an asymptotic normal distribution.

The oracle property is an important property that asymptotically guarantees the

appropriateness of both the ‘variable selection’ and the ‘coefficient estimation’ that sparse

estimation simultaneously performs. However, Lasso and EN are known not to satisfy

oracle property depending on the data, no matter how properly the regularization

parameters are chosen. Therefore, we adopt the following adaptive elastic net (AEN) as

a new estimation method for the hedonic regression model, which satisfies the oracle

property in sparse estimation.

Adaptive elastic net (AEN)

&#x1d737;&#x1d737;�(&#x1d434;&#x1d434;&#x1d438;&#x1d438;&#x1d438;&#x1d438;) = �1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b; � �argmin

&#x1d737;&#x1d737; �|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;2�&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;

2 &#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ &#x1d706;&#x1d706;1 ∗�&#x1d464;&#x1d464;�&#x1d450;&#x1d450;�&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;�

&#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

�� (8)

&#x1d464;&#x1d464;�&#x1d450;&#x1d450; = ���̂�&#x1d6fd;&#x1d450;&#x1d450;(&#x1d438;&#x1d438;&#x1d438;&#x1d438;)�� −&#x1d6fe;&#x1d6fe;

15

&#x1d706;&#x1d706;1 ∗ > 0: &#x1d43f;&#x1d43f;1 norm regularization parameters (2nd stage)

&#x1d464;&#x1d464;�&#x1d450;&#x1d450; > 0: adaptive weight &#x1d6fe;&#x1d6fe; > 0: adaptive parameter

(Larger &#x1d6fe;&#x1d6fe; imposes larger penalties corresponding to the absolute value of the coefficient)

The AEN estimation is performed in two stages. At the first stage, we estimate the

coefficients with EN. Then, EN is performed again after the regularization term of the &#x1d43f;&#x1d43f;1

norm is adjusted for each variable to impose greater penalties for variables with small

absolute values of the coefficients.17 This two-step estimation allows us to enjoy oracle

property with almost no dependence on the properties of the dataset.

Chart 4 provides an intuitive explanation of the reason why AEN satisfies oracle

property, referring to the discussion in Zou (2006). Here, we artificially generate the

matrix X of the explanatory variables and the vector ε of the disturbance terms, then

calculate the vector Y of the dependent variable based on the true model (&#x1d480;&#x1d480; = &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;∗ + &#x1d73a;&#x1d73a;).

Then we check how OLS, Lasso, and AEN estimate &#x1d737;&#x1d737; when Y and X are the observed

values. The true coefficient &#x1d6fd;&#x1d6fd;∗ is on horizontal axis, and the coefficients of �̂�&#x1d6fd;(&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;),

�̂�&#x1d6fd;(&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;) and �̂�&#x1d6fd;(&#x1d434;&#x1d434;&#x1d438;&#x1d438;&#x1d438;&#x1d438;) are plotted on the vertical axis. First, at Lasso, we see that �̂�&#x1d6fd; =

0 when |&#x1d6fd;&#x1d6fd;∗| < &#x1d706;&#x1d706; and it is clear that it satisfies sparsity. On the other hand, when |&#x1d6fd;&#x1d6fd;∗| ≥

&#x1d706;&#x1d706;, �̂�&#x1d6fd; is estimated to be smaller by λ in absolute value than the true value &#x1d6fd;&#x1d6fd;∗. In other

words, we can see that the regularization parameter λ and the condition of the oracle

property are in a trade-off; as λ increases, it becomes easier to estimate the zero coefficient

and satisfy the consistency of variable selection, while the estimated value becomes

smaller by λ in absolute value, making it difficult to satisfy the asymptotic normality for

the non-zero coefficient.

In contrast, at AEN, when |&#x1d6fd;&#x1d6fd;∗| is small, a great penalty is imposed based on the

small value of the coefficients in the first-stage estimation, and �̂�&#x1d6fd; = 0 is derived. On the

other hand, when |&#x1d6fd;&#x1d6fd;∗| is large, we can see that �̂�&#x1d6fd; approaches &#x1d6fd;&#x1d6fd;∗ asymptotically due

to lower penalties. Thus, by adjusting the penalties corresponding to estimates at the first-

stage, �̂�&#x1d6fd; is likely to be estimated zero when the coefficient is small, while the shrinkage

17 In the second-stage of the EN estimation, we drop the variables whose parameter is estimated zero at the first-stage.

16

of the estimates in absolute value is minimized when the coefficient is large. This makes

it easier to satisfy the two conditions for oracle property.

3-4. Selection of functional form

In the new estimation method proposed in this study, the hedonic function is formulated

as a quadratic polynomial. AEN determines which terms should be included in the

regression model and performs both variable selection and functional form selection at

the same time. Since all cross terms are considered, it is possible to include the interaction

effects into the regression model, unlike in the conventional method.18 The reason we

limit the degree of equation to second is to prevent overfitting caused by higher-degree

terms, which sometimes occur in Box-Cox method.19

According to the above, the hedonic regression model with the new method is

estimated as follows.

Hedonic regression model using AEN

&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; ≡ log&#x1d466;&#x1d466;&#x1d456;&#x1d456;

&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; = �̂�&#x1d6fd;00 + ��̂�&#x1d6fd;0&#x1d450;&#x1d450;&#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;

&#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ ��̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d450;&#x1d450;&#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456; 2

&#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ � �̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;&#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;&#x1d465;&#x1d465;&#x1d451;&#x1d451;,&#x1d456;&#x1d456; &#x1d451;&#x1d451;>&#x1d450;&#x1d450;≥1

(9)

&#x1d464;&#x1d464;ℎ&#x1d445;&#x1d445;&#x1d452;&#x1d452;&#x1d445;&#x1d445;

&#x1d737;&#x1d737;� = �1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b; � �argmin

&#x1d737;&#x1d737; �|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;2 � &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;

2

&#x1d451;&#x1d451;≥&#x1d450;&#x1d450;≥0

+ &#x1d706;&#x1d706;1 ∗ � &#x1d464;&#x1d464;�&#x1d450;&#x1d450;&#x1d451;&#x1d451;�&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;� &#x1d451;&#x1d451;≥&#x1d450;&#x1d450;≥0

��

18 If we calculate the cross terms for all subset of variables, a perfect multicollinearity occurs in most cases. Therefore, in this study, we drop some variables which are linearly dependent on other variables before estimation. 19 Another argument on the estimation using AEN is the setting of hyperparameter (&#x1d706;&#x1d706;1, &#x1d706;&#x1d706;1

∗, &#x1d706;&#x1d706;2, &#x1d6fe;&#x1d6fe;). From several methods of the setting the parameters mentioned in Zou and Zhang (2009), in this study, we select &#x1d43e;&#x1d43e;-fold cross validation which is often used in fields of machine learning. We split the dataset into K groups, then take one group as test data and take the remaining K-1 groups as a training data. We fit a model on the training data and evaluate it on the test data. We can evaluate the model by retaining this procedure K times with resampling group. When we select the appropriate degree for the K, we have to pay attention to a trade- off between the bias on the coefficients which affects estimation accuracy and the variances due to differences in training data. We choose &#x1d43e;&#x1d43e;=10 in our analysis, which is commonly used.

17

&#x1d464;&#x1d464;�&#x1d450;&#x1d450;&#x1d451;&#x1d451; = ���̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451; 1&#x1d460;&#x1d460;&#x1d460;&#x1d460; �� −&#x1d6fe;&#x1d6fe;

&#x1d737;&#x1d737;�1&#x1d460;&#x1d460;&#x1d460;&#x1d460; = �1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b; � �argmin

&#x1d737;&#x1d737; �|&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;|2 + &#x1d706;&#x1d706;2 � &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;

2

&#x1d451;&#x1d451;≥&#x1d450;&#x1d450;≥0

+ &#x1d706;&#x1d706;1 � �&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;� &#x1d451;&#x1d451;≥&#x1d450;&#x1d450;≥0

��

&#x1d466;&#x1d466;&#x1d456;&#x1d456;: theoretical price, &#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;: explanatory variable, �̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;: coefficient on &#x1d465;&#x1d465;&#x1d450;&#x1d450;,&#x1d456;&#x1d456;&#x1d465;&#x1d465;&#x1d451;&#x1d451;,&#x1d456;&#x1d456;,

&#x1d45d;&#x1d45d;: number of candidate explanatory variables, &#x1d45b;&#x1d45b;: number of observations,

&#x1d706;&#x1d706;1 > 0: &#x1d43f;&#x1d43f;1 norm regularization parameter (1st stage),

&#x1d706;&#x1d706;1 ∗ > 0: &#x1d43f;&#x1d43f;1 norm regularization parameter (2nd stage),

&#x1d706;&#x1d706;2 > 0: &#x1d43f;&#x1d43f;2 norm regularization parameter,

&#x1d6fe;&#x1d6fe; > 0: adaptive parameter, &#x1d464;&#x1d464;�&#x1d450;&#x1d450;&#x1d451;&#x1d451; > 0: adaptive weight

4. Empirical analysis using new estimation method

4-1. Dataset for estimation

In this section, we apply the new hedonic regression model using AEN to passenger cars

in Japan and discuss its properties.

We use the same data for estimation as used in the hedonic regression for CGPI in

Japan compiled by the Research and Statistics Department of the Bank of Japan.

Specifically, retail price data are taken from the Goo-net by the PROTO CORPORATION

and average discounts are taken from the Monthly Car Magazine JIKAYOSHA by the

Naigai Publishing Corp. Price data on passenger cars are compiled by the retail prices and

average discounts. The period examined is from the 3rd quarter of 2016 to the 2nd quarter

of 2018 and number of observations is 940.

The product specification data are basically taken from the Goo-net as well, but other

important specifications unlisted in the database are taken from the specification sheet of

each passenger car. The characteristics and performance used are shown in Chart 5. The

data contains about 20 continuous variables measuring quantitative characteristics and

about 100 dummy variables measuring qualitative characteristics.20 The large amount of

20 This includes vehicle configuration dummy, brand dummy and time dummy besides characteristics.

18

variables are due to the complicated characteristics of passenger cars, and how to select

appropriate variables is particularly challenging in these complicated products. As stated

in the previous section, the method using sparse estimation is superior in that it selects

variables automatically, and this advantage is expected to be especially great when

adjusting the quality of products with a lot of quality characteristics, such as passenger

cars.

4-2. Comparison of new and old estimation results

Here, we show the results of applying the conventional estimation method and the new

method using AEN. First, the result of the conventional estimation method is shown in

Chart 6. As explained in Section 2, the conventional hedonic regression model is

performed with the Box-Cox transformed term and the double Box-Cox model is selected

based on the results of likelihood ratio test. Among the explanatory variables, only room

space, fuel efficiency × equivalent inertia weight, and maximum output were selected for

continuous variables. Note that here we employ dummy variables for each vehicle

configuration to account for the difference of the impact of characteristics on the price.

For example, the room space was not significant for sedans and wagons but was

significant only for minivans. Dummy variables were significant for powertrain (e.g.,

4WD, RWD), interior and exterior equipment (e.g., leather seats, LED headlamps, etc.),

and brand (dummies for each automakers), respectively.

Next, the results from the new hedonic method using AEN are shown in Chart 7. As

mentioned earlier, sparse estimation, such as AEN, can estimate with a large number of

explanatory variables and perform both the ‘variable selection’ and the ‘coefficient

estimation’ simultaneously. In this study, we limit the order of non-linearity in the

equation to second and employ many cross terms to account for the presence of

interactions between variables. As a result, compared to the conventional method, a large

number of variables are employed and many cross terms are captured in the regression

19

model.21,22

However, the regression model in the conventional and new methods have very

different functional forms, and the parameters derived from the estimation cannot be

simply compared. Therefore, we calculate the contribution of each variable to the

theoretical price as follows and compare the results.

&#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;= &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;(�̅�&#x1d465;&#x1d459;&#x1d459; + ∆&#x1d465;&#x1d465;&#x1d459;&#x1d459;, &#x1d499;&#x1d499;�−&#x1d459;&#x1d459;) − &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;(�̅�&#x1d465;&#x1d459;&#x1d459;,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;(�̅�&#x1d465;&#x1d459;&#x1d459;,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;) × 100 (10)

log &#x1d466;&#x1d466;&#x1d434;&#x1d434;&#x1d434;&#x1d434;&#x1d434;&#x1d434;(�̅�&#x1d465;&#x1d459;&#x1d459;,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;) = �̂�&#x1d6fd;00 + ��̂�&#x1d6fd;0&#x1d450;&#x1d450;�̅�&#x1d465;&#x1d450;&#x1d450;

&#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ ��̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d450;&#x1d450;�̅�&#x1d465;&#x1d450;&#x1d450;2 &#x1d45d;&#x1d45d;

&#x1d450;&#x1d450;=1

+ � �̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d451;&#x1d451;�̅�&#x1d465;&#x1d450;&#x1d450;�̅�&#x1d465;&#x1d451;&#x1d451; &#x1d451;&#x1d451;>&#x1d450;&#x1d450;≥1

(11)

&#x1d466;&#x1d466;&#x1d435;&#x1d435;&#x1d435;&#x1d435;&#x1d435;&#x1d435;−&#x1d436;&#x1d436;&#x1d435;&#x1d435;&#x1d435;&#x1d435;(�̅�&#x1d465;&#x1d459;&#x1d459;,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;)(&#x1d706;&#x1d706;0) = �̂�&#x1d6fd;0 + ��̂�&#x1d6fd;&#x1d450;&#x1d450;&#x1d450;&#x1d450;�̅�&#x1d465;&#x1d450;&#x1d450;&#x1d450;&#x1d450;�&#x1d706;&#x1d706;&#x1d457;&#x1d457;� &#x1d45d;&#x1d45d;&#x1d450;&#x1d450;

&#x1d450;&#x1d450;=1

+ ��̂�&#x1d6fd;&#x1d451;&#x1d451;&#x1d451;&#x1d451;�̅�&#x1d465;&#x1d451;&#x1d451;&#x1d451;&#x1d451;

&#x1d45d;&#x1d45d;&#x1d451;&#x1d451;

&#x1d451;&#x1d451;=1

(12)

&#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;: contribution rate of &#x1d465;&#x1d465;&#x1d459;&#x1d459; for &#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450; = &#x1d434;&#x1d434;&#x1d438;&#x1d438;&#x1d438;&#x1d438; or &#x1d435;&#x1d435;&#x1d43f;&#x1d43f;&#x1d465;&#x1d465; − &#x1d436;&#x1d436;&#x1d43f;&#x1d43f;&#x1d465;&#x1d465;

�̅�&#x1d465;&#x1d459;&#x1d459;: average of explanatory variable &#x1d459;&#x1d459;

∆&#x1d465;&#x1d465;&#x1d459;&#x1d459;: standard deviation in continuous variable or 1 in dummy variable &#x1d459;&#x1d459;

Chart 8 shows the estimation results of the contribution to passenger car prices of

the continuous and dummy variables employed in Chart 6 and 7 (including cross terms).

Specifically, we show the rate of change in theoretical price &#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450; due to one standard

deviation increase in continuous variables or one unit increase in dummy variables where

a hypothetical sample with all variables are set at the mean value over the sample period.

First, with the continuous variables, more variables are employed in the new method

than in the conventional one. The number of adopted variables increased from just two in

the conventional method to nine in the new method. This can be interpreted as improved

applicability of the hedonic quality adjustment using the new method. Next, with the

21 Some papers consider residual bootstrapping method to test the significance of estimators in AEN, for example Chatterjee and Lahiri (2013), but there is still no consensus. 22 For AEN estimation, the same number of variables as degrees of freedom can be employed in the model at most. However, if the number of samples is large enough and the degree of freedom is high, we can avoid extremely complex model and calculation burden by restricting the number of employed variables. In this study, we confine the maximum number of variables to 140 for 939 degrees of freedom, although we confirm that the improvement in fit is limited at even larger number of variables.

20

dummy variables, the number of the variables employed in the new method also increased

significantly compared to the conventional method. Notably, the new method captures

more dummy variables measuring characteristics than the conventional method, while the

contribution on price of brand dummies reduces. This means that quality, which was

previously captured as a manufacturer-specific factor, can now be captured as specific

product characteristics by each of variables. Actually, the quality adjustment for

passenger cars is often applied to the cases of model change occurred in the same brand.

If the theoretical price is estimated mainly on brand dummies in the model, there will be

no room to apply the quality adjustment as long as the manufacturer does not change,

even though the quality improvement seems to occur through the model change in fact.

Therefore, not relying on brand dummies may provide significant benefits in the quality

adjustment.

These are the comparisons of the estimation results about the parameters. In order to

compare the performance of the new method with that of the conventional method, we

need to compare the fit of the regression model. To make this visible, we show in Chart

9 the mean squared error of the two regression models calculated with products released

in each quarter, using the recent dataset. Applying the new method, the error is reduced

over the whole quarter compared to the conventional method, confirming that the

estimation accuracy is improved. In particular, the new method reduces the error not only

during the estimation period, but also for the sample after the estimation period. Since we

usually apply the quality adjustment to products appeared in the market after the estimated

period, the improvement of the fit to out-of-sample is important.

In addition, when applying the hedonic method in practice, it is necessary to

periodically re-estimate the regression model. The new method also confirmed the modest

change in the estimation results when changing the sample period of the dataset (see the

Appendix for details). Such an enhancement in time-stability of the estimation results

may also improve the applicability of the hedonic quality adjustment.

4-3. Impact of new estimation on the price index

Here we see how the introduction of the new hedonic regression model using AEN affects

21

the price index. Specifically, we estimate how the price index would have changed if the

old and new hedonic quality adjustment had been applied to all the sample price

replacements — the replacement of the surveyed product due to the EOL of the old

product or change in representativeness of products in the market — occurred after 2017

for PPI “Standard Passenger Cars (Gasoline Cars)”.

We then compare the results of the new and old method with the officially released

index of PPI to examine whether the results are plausible. In practice in the compilation

of the CGPI, even if the hedonic quality adjustment method is available for a product, the

Research and Statistics Department of the Bank of Japan would choose the most

appropriate method, based mainly on a plausibility check of an estimated quality

improvement with a surveyed company and a comparison using estimates of other quality

adjustment methods such as the production cost method. In other words, if the applying

the conventional hedonic quality adjustment is not judged to be appropriate from a

practical view point, then we decide not to apply it. Therefore, comparing the price index

when the new method is applied to all the sample price replacements with the released

index of PPI, which is compiled based on practical judgement, we can generally assess

whether the new method accurately estimates the rate of quality improvement.

Chart 10 shows the results of the calculation. The dashed line in the chart is an

estimated price index when the conventional hedonic quality adjustment is applied to all

the sample price replacements. We can see a somewhat larger decline in the price index

of conventional hedonic regression model. On the other hand, regarding the solid line

where the new hedonic method using AEN is applied, the price index shows gradual

decline compared to the conventional method. Thus, the difference in the estimation

results between the old and new methods indicates a quantitatively non-negligible impact

on the price index of passenger cars.

Chart 10 also shows the officially released index of PPI as a dotted line, and the trend

is more similar to the new method with the AEN than to the conventional method. The

results show that if the old and new hedonic quality adjustment are applied to all the

sample price replacements, using the old method would risk overestimating the rate of

quality improvement, resulting excessive decline in the price index, whereas the new

22

method may be able to estimate the rate of quality improvement more accurately in

general.

These results are consistent with the results of previous studies on the bias of missing

variables described in Section 2. In the conventional method, a limited number of

explanatory variables due to multicollinearity are more likely to cause omitted variable

bias, which leads to distortions in the parameters of variables in the hedonic model—

which seems to overestimate the rate of quality improvement. On the other hand, in the

AEN estimation, the increase in the number of explanatory variables is likely to reduce

omitted variable bias, and the small distortion of the parameters results in an accurate

calculation of the quality improvement rate. This is reflected in the differences in the price

index.

5. Final Remarks

In this study, we survey the issues of the hedonic regression model and then explain the

details of the new estimation method using sparse estimation and its results. The new

estimation method proposed in this study employs an adaptive elastic net (AEN), which

enjoys two desirable properties: the ‘grouped effect’ that gives robustness for

multicollinearity and the ‘oracle property’ that ensures the adequacy of variable selection

and asymptotic unbiasedness of coefficients. It has a possibility to overcome the practical

issues of the hedonic regression model. In fact, the empirical analysis of passenger car

prices in Japan in this study shows that the new method using the AEN improved in terms

of: 1) a significant increase in the number of adopted variables; 2) improvement in fit;

and 3) elimination of omitted variable bias. In particular, applying the new estimation

method instead of the conventional one, the price index of passenger cars shows more

moderate decline, and this method reduces the risk of overestimation of the quality

improvement rate due to the omitted variable bias present in the conventional method. It

is expected this change will make the hedonic quality adjustment more accurate and

improve its applicability when the sample price replacement occurs. As mentioned in

Section 1, the hedonic method has strengths in evaluating quality objectively based on

data and statistical methods, and it is compatible even for a large number of changes in

23

characteristics between new and old products. The increased usability of the hedonic

regression model with these strengths is expected to make the price index more accurate.

In this study, we used passenger cars as an example, but the method proposed is

based on the versatile approaches using ‘sparse estimation’ and ‘polynomial regression’,

which are also applicable to another products. In applying the hedonic regression

approach, we have to gather the data and construct the model for regression, considering

the characteristics of each product sufficiently. The issues pointed out in this study are

generally common to all products, and the new method which intends to overcome such

issues could improve the performance of hedonic methods in a variety of products. Also,

because of the versatile approach, we can flexibly customize the method corresponding

to advances in statistical methods research and practical requirements. For example,

whether the estimation accuracy and parameter stability can be improved by applying

more advanced sparse estimation, or whether the generalization performance can be

further enhanced by using more advanced cross-validation methods in hyperparameter

setting, are some of the remaining issues. In addition, if the estimation accuracy required

in practice is not always high, an alternative approach that emphasizes interpretability for

the hedonic model can be fully envisioned while maintaining the framework of the new

estimation method. For example, we can select a simpler functional form or variable

composition by setting a lower upper limit of the number of the variables employed in

the model, as well as we can limit the number of variables for cross terms from the outset.

This study focuses on dealing with the issues about multicollinearity and omitted

variable bias by applying sparse estimation to the hedonic regression model, however,

there are a number of other issues surrounding the hedonic approach. For example, the

method of gathering the dataset is an important issue that is also related to omitted

variable bias. As the adage ‘garbage in garbage out’ suggests, it is important to maintain

the quality of the dataset for estimation by accurately grasping the technological

innovation of the products and adopting variables related to new characteristics as

necessary. In the field of hedonic approach, the subject of how to utilize recent advanced

information processing technology, such as big data analysis, to gathering the dataset is

24

also under study.23 The use of large dataset is expected to become easier in the future.

Under these circumstances, the estimation method proposed in this study is a highly

efficient method as it can automatically construct a good performance model by

extracting all necessary information even with the large dataset. We expect further

utilization of the new method proposed in this study for empirical research and statistical

practice in the future.

23 For the research on hedonic regression model with the web scraping data, see Zafar and Himpens (2019) or Efthymiou and Antoniou (2013).

25

References

Bonaldi, P., Hortaçsu, A., and Kastl, J., "An Empirical Analysis of Funding Costs

Spillovers in the Euro-Zone with Application to Systemic Risk," NBER Working

Paper, No. 21462, National Bureau of Economic Research, 2015.

Box, G. E. P. and Cox, D. R., "An Analysis of Transformations," Journal of the Royal

Statistics Society Series B, Vol. 26, pp. 211-252, 1964.

Chatterjee, A. and Lahiri, S. N., "Rates of Convergence of the Adaptive LASSO

Estimators to the Oracle Distribution and Higher Order Refinements by the

Bootstrap," The Annals of Statistics, Vol. 41(3), pp. 1232-1259, 2013.

Cropper, M., Deck, L. B., and McConnell, K. E., "On the Choice of Functional Form for

Hedonic Price Functions," The Review of Economics and Statistics, Vol. 70(4), pp.

668-675, 1988.

Efthymiou, D. and Antoniou, C., "How Do Transport Infrastructure and Policies Affect

House Prices and Rents? Evidence from Athens, Greece," Transportation

Research Part A, Vol. 52, pp. 1-22, 2013.

The Event Horizon Telescope Collaboration, "First M87 Event Horizon Telescope Results.

IV. Imaging the Central Supermassive Black Hole," The Astrophysical Journal

Letters, Vol. 875(1), 2019.

Graves, P., Murdoch, J. C., Thayer, M. A., and Waldman, D., "The Robustness of Hedonic

Price Estimation: Urban Air Quality," Land Economics, Vol. 64(3), pp. 220-233,

1988.

Halvorsen, R. and Pollakowski, H. O., "Choice of Functional Form for Hedonic Price

Equations," Journal of Urban Economics, Vol. 10(1), pp. 37-49, 1981.

Hirakata, N., "The Time Variation of the Hedonic Regression Model and Its Effect on the

Price Index: A case of Personal Computers in Japan" Bank of Japan Working

Paper Series, No.05-J-1, 2005(in Japanese).

Hoerl, A. E. and Kennard, R. W., "Ridge Regression: Biased Estimation for

Nonorthogonal Problems," Technometrics, Vol. 12, pp. 55-67, 1970.

26

Jin, C., and Lee, G., "Exploring spatiotemporal dynamics in a housing market using the

spatial vector autoregressive Lasso: A case study of Seoul, Korea," Transactions

in GIS, Vol. 24(1), pp. 27-43, 2020.

Pakes, A., "A Reconsideration of Hedonic Price Indexes with an Application to PC's,"

American Economic Review, Vol. 93(5), pp. 1578-1596, 2003.

Rosen, S., "Hedonic Prices and Implicit Markets: Product Differentiation in Pure

Competition," Journal of Political Economy, Vol. 82(1), pp. 34-55, 1974.

Sawyer, S. D. and So, A., "A New Approach for Quality-Adjusting PPI Microprocessors,"

Monthly Labor Review, Bureau of Labor Statistics, 2018.

Shiratsuka, S., "Automobile Prices and Quality Changes: A Hedonic Price Analysis of

Japanese Automobile Market," Monetary and Economic Studies, Vol. 13(2), pp.

1-44, 1995.

Shiratsuka, S., "Measuring Quality Changes using Hedonic Approach: Theoretical

framework and its application to empirical research," IMES discussion paper

series, No. 97-J-6, Bank of Japan, 1997(in Japanese).

Shiratsuka, S., An Economic Analysis of Pricing, University of Tokyo Press, 1998(in

Japanese).

Tibshirani, R., "Regression Shrinkage and Selection via the Lasso," Journal of the Royal

Statistics Society Series B, Vol. 58, pp. 267-288, 1996.

Triplett, J. E., Handbook on Hedonic Indexes and Quality Adjustments in Price Indexes:

Special Application to Information Technology Products, OECD Publishing, 2006.

Wheeler, D. C., "Simultaneous coefficient penalization and model selection in

geographically weighted regression: the geographically weighted LASSO,"

Environment and Planning A, Vol. 41, pp. 722–742, 2009.

Zafar, J. D. and Himpens, S., "Webscraping Laptop Prices to Estimate Hedonic Models

and Extensions to Other Predictive Methods," presented at the 16th meeting of the

Ottawa Group on Price Indices, Rio de Janeiro, 2019.

Zou, H., "The Adaptive Lasso and Its Oracle Properties," Journal of the American

27

Statistical Association, Vol. 101, pp. 1418-1429, 2006.

Zou, H. and Hastie, T., "Regularization and Variable Selection via the Elastic Net,"

Journal of the Royal Statistics Society Series B, 67, pp. 301-320, 2005.

Zou, H. and Zhang, H. H., "On the Adaptive Elastic-Net with a Diverging Number of

Parameters," The Annals of Statistics, Vol. 37(4), pp. 1733-1751, 2009.

28

Appendix; Time-stability of the hedonic regression model

It is widely known that the hedonic regression model is unstable. This is because the

relationship between characteristics and prices may change over time, influenced by

advancement in technology, changes in consumer preferences, and other factors. For

example, Pakes (2003), in estimating the regression model for personal computers, points

out that when the price of a microprocessor falls significantly due to technological

innovation, the equation for personal computers equipped with this microprocessor may

also change, and then it shows that the estimated parameters actually may change

significantly. In order to measure these changes properly, it is necessary to periodically

re-estimate the regression model and flexibly adopt to changes in the functional form and

the subset of variables.

Here, as in the main text, we analyze the stability of the parameters by running the

regression on different sample periods. This analysis is conducted for passenger cars and

the samples are one year older than the one used in the main text (the period examined is

from the 3rd quarter of 2015 to the 2nd quarter of 2017 and number of observations is

1,188). The result of the conventional estimation method is shown in Appendix Chart 1

and that of the new method using AEN is shown in Appendix Chart 2. In the following,

we estimate how much changes in the sample period affecting the estimation results for

the old and new methods.

First, we summarize how the parameters change when the samples are one year older.

We calculate how much the contribution of each variable (calculated by the same

procedure as in Chart 8) changes due to the replacement to the older sample. The results

for both the old and new methods are shown respectively in Appendix Chart 3.24

The difference of contribution rate in variable l = &#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442; − &#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d434;&#x1d434;&#x1d434;&#x1d434;&#x1d441;&#x1d441; (A1)

&#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;= &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(�̅�&#x1d465;&#x1d459;&#x1d459; + ∆&#x1d465;&#x1d465;&#x1d459;&#x1d459; ,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;) − &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(�̅�&#x1d465;&#x1d459;&#x1d459; ,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(�̅�&#x1d465;&#x1d459;&#x1d459;,&#x1d499;&#x1d499;�−&#x1d459;&#x1d459;) × 100 (A2)

&#x1d70b;&#x1d70b;&#x1d459;&#x1d459;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: contribution rate of &#x1d465;&#x1d465;&#x1d459;&#x1d459; (func = AEN or Box-Cox, smpl = NEW or OLD)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: theoretical price (func = AEN or Box-Cox, smpl = NEW or OLD)

24 For the variables that are adopted in either one model, the contribution rate of the other one is taken zero.

29

As shown in Appendix Chart 3, the differences in parameters between datasets occur

within ±5 percentage points for the new method using AEN, however, within ±20

percentage points for the old method. The result suggests that each parameter is more

stable in the new method than the old one.

However, the stability of the parameters for each variable does not necessarily lead

to the stability of the results of quality adjustment immediately. Even if the changes in

individual parameters are small, the rate of quality change (the rate of change in the

theoretical price) can be great when the sign of parameters is same. We can say that for

the opposite situation as well because parameter increase of some variables may well be

offset by parameter decrease of another variable, under the situation where there is

correlation between variables. Therefore, we have to pay attention to how the rate of

quality change indicated from the theoretical prices differ when we evaluate the stability

of the results of quality adjustment. Based on these points, Hirakata (2005) estimates the

regression model for desktop computers in several functional forms, and then compares

the results of quality adjustment using regression models with different sample periods to

analyze how the price index can change. In the following, we follow Hirakata's (2005)

method and compare the stability of the rate of quality change between the old and new

methods, taking a sample price replacement in a passenger car as an example.

First, for the data set used in Chart 9, we extract samples classified as sedans or

wagons and group them by released date quarterly.25 Then, we build a hypothetical sample

product where all variables are set at the mean value over the group for each quarter.

Finally, we calculate the quality improvement rate due to the sample price replacement

between these hypothetical products for the new and conventional method. 26 , 27 By

comparing the quality improvement rates obtained in this way between the original

sample and the old sample, it is possible to comprehensively consider the impact of

25 The period examined is from the 3rd quarter of 2016 to the 2nd quarter of 2019. The 2nd quarter of 2017, the 1st quarter of 2018 and the 1st quarter of 2019 are not subject to the estimation because there are not sedans and wagons. 26 When calculating the quality change rate, brand dummy and time dummy is fixed to a single value, not the average. 27 In total, 144 ( = 36(9C2)×2 methods (AEN or Box-Cox)×2 datasets (NEW or OLD)) different quality improvement rates are calculated.

30

changes in the regression model on the price index, taking into account the correlation

between variables.

&#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;= &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�′) − &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�) × 100 (A3)

&#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: the rate of quality change(func = AEN or Box-Cox, smpl = NEW or OLD)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: theoretical price(func = AEN or Box-Cox, smpl = NEW or OLD)

&#x1d499;&#x1d499;�: specification of hypothetical sample(before model change)

&#x1d499;&#x1d499;�′: specification of hypothetical sample(after model change)

In Appendix Chart 4, we compare the rate of quality change for a hypothetical model

change between the old and new methods, where the horizontal axis shows the estimation

result &#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d434;&#x1d434;&#x1d434;&#x1d434;&#x1d441;&#x1d441; and the vertical axis shows &#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442;. The scatter plots of the new

method are roughly distributed around the diagonal, while that of the conventional

method are far from the diagonal. This suggests that the deviation in the rate of quality

change caused by change in the estimation period is smaller when the new method is

applied. In order to evaluate this point quantitatively, the deviation (absolute value) of the

rate of quality change with the change in the estimation period, calculated as follows, is

shown for each the new and conventional methods in Appendix Chart 5.

The deviation in the rate of quality change = �&#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442;&#x1d442; − &#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d434;&#x1d434;&#x1d434;&#x1d434;&#x1d441;&#x1d441;� (A4)

&#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;= &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�′) − &#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;(&#x1d499;&#x1d499;�) × 100 (A5)

&#x1d6f1;&#x1d6f1;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: the rate of quality change(func = AEN or Box-Cox, smpl = NEW or OLD)

&#x1d466;&#x1d466;&#x1d453;&#x1d453;&#x1d453;&#x1d453;&#x1d45b;&#x1d45b;&#x1d450;&#x1d450;,&#x1d460;&#x1d460;&#x1d460;&#x1d460;&#x1d45d;&#x1d45d;&#x1d459;&#x1d459;: theoretical price(func = AEN or Box-Cox, smpl = NEW or OLD)

&#x1d499;&#x1d499;�: specification of hypothetical sample(before model change)

&#x1d499;&#x1d499;�′: specification of hypothetical sample(after model change)

The deviation of the quality improvement rates in new method is roughly half of that

of the conventional method on average for the entire period. This indicates that the

application of AEN has increased the stability of the estimation results. This is consistent

with that the fit to out-of-sample is good as well as in-sample in the new method, as

confirmed in the main text. This suggests that, in the new method, the estimation error in

31

the quality improvement rate tends to remain relatively small, even when the relationship

between price and characteristics changes over time.

In addition, in applying the quality adjustment, we put the first priority on evaluating

whether the product quality improves or deteriorates. In this regard, Appendix Chart 4

shows that in some samples of model change, as circled in red, the sign of quality change

rate is clearly different due to the change of the estimation period in the conventional

method. This suggests that due to the regression model obsoleting over time, we may

wrongly evaluate that the quality is deteriorate (improve) from the hedonic model, even

though the quality in fact improves (deteriorate) when applying the quality adjustment.

On the other hand, in the new method, there are few cases where the sign of quality change

rates reverse between the estimation models. This improvement with the introduction of

AEN could also lead to an increasing the applicability of the hedonic quality adjustment.

32

Chart 1

Correlation coefficients of variables for passenger cars

SC L W H WT FE MO MT RS NG

SC 1.000

L 0.420 1.000

W 0.223 0.849 1.000

H 0.787 0.317 0.251 1.000

WT 0.521 0.885 0.844 0.553 1.000

FE -0.252 -0.501 -0.618 -0.268 -0.563 1.000

MO 0.017 0.628 0.696 0.018 0.665 -0.661 1.000

MT 0.008 0.599 0.711 -0.006 0.618 -0.535 0.812 1.000

RS -0.051 0.625 0.785 0.004 0.584 -0.489 0.645 0.711 1.000

NG -0.058 0.198 0.152 -0.081 0.213 -0.080 0.409 0.319 0.159 1.000

SC: Seating Capacity, L: Length, W: Width, H: Height, WT: Weight, FE: Fuel Efficiency,

MO: Maximum Output, MT: Maximum Torque, RS: Rim Size, NG: Number of Gears

33

Chart 2

Change in the functional form (Maximum output of Minivans)

0

500

1000

1500

2000

0 50 100 150 200 250 300 350

Samples (as of October 2017 estimate)

Fitted values (estimated in October 2017) Box-Cox parameter≈3.4

10 thous. Yen

Maximum Output of Minivans (ps)

A rapid increase in the out of sample

Re-estimation

0

500

1000

1500

2000

0 50 100 150 200 250 300 350

Samples (as of October 2018 estimate)

Fitted values (estimated in October 2018) Box-Cox parameter≈0.0 (almost log)

10 thous. Yen

Maximum Output of Minivans (ps)

Deviated from new samplesA significant change

in the functional form

34

Chart 3

Schema of sparse estimation

&#x1d737;&#x1d737;�(&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;)

&#x1d43f;&#x1d43f;

&#x1d6fd;&#x1d6fd;1

&#x1d6fd;&#x1d6fd;2 &#x1d737;&#x1d737;�(&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;)

&#x1d737;&#x1d737;�(&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;&#x1d445;)

&#x1d43f;&#x1d43f;

&#x1d6fd;&#x1d6fd;1

&#x1d6fd;&#x1d6fd;2 &#x1d737;&#x1d737;�(&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;)

argmin &#x1d6fd;&#x1d6fd;1,&#x1d6fd;&#x1d6fd;2

��&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;1&#x1d44b;&#x1d44b;1,&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;2&#x1d44b;&#x1d44b;2,&#x1d456;&#x1d456;� 2

&#x1d45b;&#x1d45b;

&#x1d456;&#x1d456;=1

s.t. |&#x1d6fd;&#x1d6fd;1| + |&#x1d6fd;&#x1d6fd;2| ≤ &#x1d43f;&#x1d43f;

s > 0: 1 to 1 corresponding to &#x1d706;&#x1d706;

argmin &#x1d6fd;&#x1d6fd;1,&#x1d6fd;&#x1d6fd;2

��&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;1&#x1d44b;&#x1d44b;1,&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;2&#x1d44b;&#x1d44b;2,&#x1d456;&#x1d456;� 2

&#x1d45b;&#x1d45b;

&#x1d456;&#x1d456;=1

s.t. &#x1d6fd;&#x1d6fd;1 2 + &#x1d6fd;&#x1d6fd;2

2 ≤ &#x1d43f;&#x1d43f;2

s > 0: 1 to 1 corresponding to &#x1d706;&#x1d706;

Lasso

Ridge Regression

35

Chart 4

Statistical properties of AEN

Notes: 1. The estimated �̂�&#x1d6fd; are plotted with each estimation method using the artificial data, where X = 120,200×601

design matrix, &#x1d6fd;&#x1d6fd;&#x1d456;&#x1d456; ∗ = −3 + 0.01i �i = 0~600�, &#x1d480;&#x1d480; = &#x1d47f;&#x1d47f;&#x1d737;&#x1d737;∗ + &#x1d73a;&#x1d73a; (&#x1d73a;&#x1d73a;~&#x1d475;&#x1d475;(&#x1d7ce;&#x1d7ce;, &#x1d470;&#x1d470;)). Note that the mean is set to 0 and

the standard deviation to 1 for all columns of X.

2. &#x1d706;&#x1d706; = 0.5 for Lasso and &#x1d706;&#x1d706;1 = &#x1d706;&#x1d706;1 ∗ = 0.2, &#x1d706;&#x1d706;2 = 0.001, γ = 0.5 for AEN

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

OLS Lasso AEN &#x1d6fd;&#x1d6fd;∗

�̂�&#x1d6fd;

+λ

-λ

36

Chart 5

Candidate variables for passenger car

Seating Capacity (person) ETC Rear View Camera

Length (mm) Navigation System Side View Camera

Width (mm) DVD Player Front View Camera

Height (mm) Blu-ray Player Surround View Camera

Weight (kg) AM/FM Radio AFS

Wheelbase (mm) USB Input Hill Start Assist

Minimum Turning Radius (m) No Idling Cold Climate Version

Fuel Efficiency (JC08 mode, km/l) Full Auto Air Conditioner Rain Sensor

Fuel Tank Capacity (l) Dual Zone Air Conditioner Anti-Theft System

Maximum Output (ps) Front Dual Zone Air Conditioner Sedan

Maximum Torque (kg∙m) Driver's Seat Heater Wagon

Number of Cylinders (#) Driving Position Memory System Coupe

Total Displacement (cc) Split-Folding Rear Seat Convertible

Rim Size (inch) Front Power Seat Minivan

Tire Width (mm) Passenger's Power Seat SUV

Tire Flatness (%) Rear Power Seat Hatchback

Number of Gears (#) Leather Seat Domestic Car A

Indoor Space (m3) Leather Steering Domestic Car B

Diesel Telescopic Steering Device Domestic Car C

Hybrid Steering Controller Domestic Car D

Plug-In Hybrid Wood Panel Domestic Car E

Unleaded Premium Gasoline Aluminum Wheel Domestic Car F

Turbo LED Headlamp Domestic Car G

Supercharger LED Fog Lamp Domestic Car H

Twin-Turbo Front Fog Lamp Domestic Car I

Flat Engine Rear Fog Lamp Imported Car A

FF Xenon Headlamp Imported Car B

FR Projector Headlamp Imported Car C

Full-Time 4WD LSD Imported Car D

Part-Time 4WD Cruise Control 2016Q3

AT ACC 2016Q4

MT ACC (No speed limitation) 2017Q1

CVT Clearance Sonar 2017Q2

Front Spoiler LDWS 2017Q3

Rear Spoiler LKAS 2017Q4

Rear Window Wiper Traction Control 2018Q1

Sunroof Unintended Start Prevention 2018Q2

Glasstop AEBS

Privacy Glass Brake Assist

Side Airbag Parking Assist

List of Candidate Variables

C on

tin uo

us V

ar ia

bl es

Sp ec

D um

m y

V ar

ia bl

es

Sp ec

D um

m y

V ar

ia bl

es B

od y-

Ty pe

D um

m y

V ar

ia bl

es M

an uf

ac tu

re r D

um m

y V

ar ia

bl es

Sp ec

D um

m y

V ar

ia bl

es

Ti m

e D

um m

y V

ar ia

bl es

37

Chart 6

Estimation result with conventional method

Estimated Model Box-Cox Parameter of Dependent Variable -0.280 Intercept 3,472.763 ***

Sedans & Station Wagons -- Box-Cox Parameter -- Minivans 1.360E-05 ***

Box-Cox Parameter 3.400 Sedans & Station Wagons 2.543E-09 ***

Box-Cox Parameter 1.372 Minivans 1.606E-09 ***

Box-Cox Parameter 1.455 SUVs 6.841E-09 ***

Box-Cox Parameter 1.330 Hatchbacks 7.152E-18 ***

Box-Cox Parameter 3.351 Sedans & Station Wagons 2.846E-04 ***

Box-Cox Parameter 0.647 Minivans 0.007 ***

Box-Cox Parameter 6.240E-06 SUVs 5.880E-06 ***

Box-Cox Parameter 1.337 Hatchbacks 0.008 ***

Box-Cox Parameter 3.621E-06 Dummy Variables

Car Configuration Minivans -1,162.565 ***

SUVs 0.006 ***

Hatchbacks -2,306.764 ***

Motor Hybrid Vehicles -- Plug-in Hybrid Electric Vehicles --

Powertrain AWD (Full time or Part time) 0.002 ***

FR (Front-engine, rear-wheel-drive) 0.002 ***

Standard Equipment Leather Seats 0.001 ***

Side Airbags 4.504E-04 **

Power Seats 0.002 ***

Aluminum Wheel 0.002 ***

LED Headlamp 0.001 ***

Privacy Glass -- Limited Slip Differential (LSD) 0.002 ***

Advanced Emergency Braking System (AEBS) -- Adaptive Cruise Control (ACC) -- Adaptive Cruise Control (ACC) <No speed limitation> 0.001 ***

Lane Departure Warning System (LDWS) 0.001 ***

Adaptive Front-Lighting System (AFS) 0.001 ***

Parking Assist 0.001 ***

Brand Brand A -0.002 ***

Brand B -0.003 ***

Brand C -- Brand D -- Brand E -0.001 ***

Brand F 0.004 ***

Brand G 0.003 ***

Brand H -- Brand I 0.006 ***

Brand J 0.008 ***

Brand K 0.006 ***

R-squared 0.957 Adjusted R-squared 0.956 Standard Error of Regression 0.002 Mean of Dependent Variable 3.509

1,155 (from 3Q 2016 to 2Q 2018)

Tests for Double Box-Cox Model (H1: Double Box-Cox)

H0: Semi Box-Cox (λi=1) 85.560 ***

H0: Log Linear (λ0=λi=0) 273.705 ***

H0: Semi Log Linear (λ0=0,λi=1) 130.257 ***

H0: Linear (λ0=λi=1) 1,905.192 ***

Source: Bank of Japan Notes: 1. The equivalent inertia weight of a vehicle is measured as its curb weight with an additional 110kg of weight to a vehicle, which is set to chassis Notes: 1. dynamometer while measuring its fuel efficiency under JC08 emission test cycle. Notes: 2. In addition to the explanatory variables listed above, the model includes release period dummy variables.

Number of Observations (release period)

Double Box-Cox Model

Room Space (㎥)

Fuel Efficiency JC08 (km/l) ×Equivalent Inertia Weight (kg)

Horsepower (PS)

38

Chart 7

Estimation result with new method

Notes: The sample period is from the 3rd quarter of 2016 to the 2nd quarter of 2018. Volume=Length×Width×Height.

Population Density=seating capacity÷(Length×Width).

Hyperparameters

λ1 0.013

λ1 * 1.970E-05

λ2 1.000E-05

γ 0.5

Explanatory Variables

Constant 12.939

Imported Car C 0.244

Imported Car A 0.204

Supercharger 0.119

Navigation System 0.142

Rear Power Seat 0.053

Aluminum Wheel 0.032

LDWS 0.007

Blu-ray Player 0.057

Population Density (person/m2) -0.055

Curb Weight(kg)/Volume(m3) 0.001

Rim Size (inch) 0.001

FF -0.025

Curb Weight(kg)/Volume(m3)×: quadratic term 2.550E-05

Length(m)×Width(m)×: quadratic term 0.009

Maximum Output (ps)/Weight(kg)×: quadratic term 4.854

Domestic Car G×2017Q4 -0.003

Domestic Car G×Front Fog Lamp -0.034

Domestic Car G×Front Spoiler -0.056

Domestic Car E×2018Q2 0.248

Domestic Car E×Leather Seat 0.018

Domestic Car E×AFS 0.032

Domestic Car E×Maximum Output (ps)/Weight(kg) 0.865

Domestic Car D×2016Q4 0.350

Domestic Car D×2017Q3 -0.012

Domestic Car D×Hatchback -0.003

Domestic Car D×Minivan -0.135

Domestic Car D×Height (mm) -4.036E-05

Imported Car B×Maximum Torque (kg∙m) 0.003

Imported Car B×Full Auto Air Conditioner 0.117

Imported Car B×Rim Size (inch) 0.005

Domestic Car F×2017Q3 -0.130

Domestic Car F×Hybrid -0.061

Domestic Car C×CVT -0.135

Domestic Car B×CVT -0.053

Domestic Car B×Xenon Headlamp 0.088

Imported Car A×2017Q3 0.073

2016Q4×Maximum Torque (kg∙m) -0.003

Explanatory Variables

2016Q4×Leather Seat -0.031

2017Q3×Height (mm) -1.943E-05

2017Q4×CVT -0.038

2018Q1×Front Fog Lamp 0.005

2018Q2×FF -0.037

Coupe×Maximum Output (ps)/Weight(kg) 0.223

Hatchback×Maximum Output (ps)/Weight(kg) -0.315

Height (mm)×Hybrid 4.580E-05

Height (mm)×Front Fog Lamp 1.029E-05

Height (mm)×Dual Zone Air Conditioner 3.248E-05

Height (mm)×Sunroof 1.827E-06

Height (mm)×Driver's Seat Heater 4.776E-06

Height (mm)×Driving Position Memory System 2.691E-05

Height (mm)×ACC (No speed limitation) 1.741E-05

Height (mm)×Curb Weight(kg)/Volume(m3) 2.412E-06

Fuel Efficiency (JC08 mode, km/l)×Maximum Torque (kg∙m) 1.906E-05

Fuel Efficiency (JC08 mode, km/l)×Front Spoiler 0.001

Fuel Efficiency (JC08 mode, km/l)×Navigation System 2.649E-04

Fuel Efficiency (JC08 mode, km/l)×Dual Zone Air Conditioner 0.001

Fuel Efficiency (JC08 mode, km/l)×Leather Steering 0.001

Fuel Efficiency (JC08 mode, km/l)×Sunroof 9.915E-05

Fuel Efficiency (JC08 mode, km/l)×Driver's Seat Heater 0.001

Maximum Torque (kg∙m)×Rear Spoiler 0.001

Maximum Torque (kg∙m)×Maximum Output (ps)/Weight(kg) 2.818E-04

Unleaded Premium Gasoline×Front Spoiler 0.024

Unleaded Premium Gasoline×Rear Spoiler 0.002

MT×LED Headlamp 0.019

Number of Gears (#)×Rim Size (inch) 0.001

LSD×Leather Steering 0.015

LSD×Leather Seat 0.049

Cruise Control×Curb Weight(kg)/Volume(m3) 3.507E-04

Leather Seat×Length(m)×Width(m) 0.004

LED Headlamp×Curb Weight(kg)/Volume(m3) 4.970E-04

39

Chart 8

Contribution of variables to the theoretical price

1. Continuous variables

2. Dummy variables

Notes: In addition to the variables listed above, the model includes dummy variables for car configuration and release period.

-5

0

5

10

15

20

25

30

M ax

im um

O ut

pu t

Fu el

E ff

ic ie

nc y

×E qu

iv al

en t

In er

tia W

ei gh

t

C ur

b W

ei gh

t / V

ol um

e A re

a

H ei

gh t

Po pu

la tio

n D

en si

ty

Po w

er -to

-W ei

gh t

R at

io

M ax

im um

T or

qu e

(k g·

m )

Fu el

E ff

ic ie

nc y

R im

S iz

e

N um

be r o

f G ea

rs

New method (AEN) Conventional method

%

-40

-20

0

20

40

60

80

100

FR 4W

D Si

de A

irb ag

s Po

w er

S ea

t Pa

rk in

g A

ss ist

D om

es tic

C ar

A D

om es

tic C

ar B

D om

es tic

C ar

C D

om es

tic C

ar D

D om

es tic

C ar

E Im

po rte

d Ca

r A Im

po rte

d Ca

r B Im

po rte

d Ca

r C Le

at he

r S ea

t A

lu m

in iu

m W

he el

LE D

H ea

dl am

p LS

D A

CC (N

o sp

ee d

lim ita

tio n)

LD W

S A

FS H

yb rid

V eh

ic le

s U

nl ea

de d

Pr em

iu m

G as

ol in

e Su

pe rc

ha rg

er FF M T

C V

T Fr

on t S

po ile

r R

ea r S

po ile

r Su

nr oo

f N

av ig

at io

n B

lu -r

ay Fu

ll A

ut om

at ic

A ir

Co nd

iti on

in g

D ua

l A ir

C on

di tio

ni ng

D riv

er 's

Se at

H ea

te r

D PM

S R

ea r P

ow er

S ea

t Le

at he

r S te

er in

g Fr

on t F

og L

am ps

X en

on H

ea dl

am p

C ru

is e

Co nt

ro l

D om

es tic

C ar

F D

om es

tic C

ar G

New method (AEN) Conventional method %

40

Chart 9

Comparison of fit between old and new methods

0

0.02

0.04

0.06

0.08

16Q3 16Q4 17Q1 17Q2 17Q3 17Q4 18Q1 18Q2 18Q3 18Q4 19Q1 19Q2

New method (AEN) Conventional method

MSE

Release period

In-sample period

Out-of-sample period

41

Chart 10

Estimated price index by old and new methods PPI “Standard Passenger Cars (Gasoline Cars)”

94

96

98

100

102

104

Jan. 17 Jul. 17 Jan. 18 Jul. 18 Jan. 19 Jul. 19

Estimated index (adjusted by the conventional method) Estimated index (adjusted by the new method: AEN) Officially released index

CY2015=100

42

Appendix Chart 1

Estimation result with conventional method(old sample)

Estimated Model Box-Cox Parameter of Dependent Variable 0.150 Intercept 1,664.131 ***

Sedans & Station Wagons 2.433 ***

Box-Cox Parameter 1.066 Minivans 0.039 ***

Box-Cox Parameter 2.770 Sedans & Station Wagons 9.512E-08 ***

Box-Cox Parameter 1.637 Minivans 1.754E-09 ***

Box-Cox Parameter 2.097 SUVs 3.147 ***

Box-Cox Parameter 0.002 Hatchbacks 1.951E-26 ***

Box-Cox Parameter 5.773 Sedans & Station Wagons 5.993 ***

Box-Cox Parameter 0.003 Minivans 3.825E-07 ***

Box-Cox Parameter 3.384 SUVs 3.243 ***

Box-Cox Parameter 0.040 Hatchbacks 4.408 ***

Box-Cox Parameter 0.018 Dummy Variables

Car Configuration Minivans 2,275.621 ***

SUVs 819.239 ***

Hatchbacks 2,019.161 ***

Motor Hybrid Vehicles 0.393 ***

Plug-in Hybrid Electric Vehicles 2.137 ***

Powertrain AWD (Full time or Part time) 0.846 ***

FR (Front-engine, rear-wheel-drive) -- Standard Equipment

Leather Seats 1.003 ***

Side Airbags 0.559 ***

Power Seats 0.869 ***

Aluminum Wheel -- LED Headlamp -- Privacy Glass 0.782 ***

Limited Slip Differential (LSD) 0.630 ***

Advanced Emergency Braking System (AEBS) 0.358 ***

Adaptive Cruise Control (ACC) 0.405 ***

Adaptive Cruise Control (ACC) <No speed limitation> -- Lane Departure Warning System (LDWS) 0.184 **

Adaptive Front-Lighting System (AFS) 0.624 ***

Parking Assist -- Brand

Brand A -1.557 ***

Brand B -1.353 ***

Brand C -0.523 ***

Brand D -1.803 ***

Brand E -1.237 ***

Brand F 2.648 ***

Brand G -0.896 ***

Brand H -0.611 ***

Brand I 2.987 ***

Brand J 4.723 ***

Brand K 3.825 ***

R-squared 0.962 Adjusted R-squared 0.961 Standard Error of Regression 0.792 Mean of Dependent Variable 52.810

994 (from 3Q 2015 to 2Q 2017)

Tests for Double Box-Cox Model (H1: Double Box-Cox)

H0: Semi Box-Cox (λi=1) 220.310 ***

H0: Log Linear (λ0=λi=0) 158.589 ***

H0: Semi Log Linear (λ0=0,λi=1) 238.038 ***

H0: Linear (λ0=λi=1) 641.781 ***

Source: Bank of Japan Notes: 1. The equivalent inertia weight of a vehicle is measured as its curb weight with an additional 110kg of weight to a vehicle, which is set to chassis Notes: 1. dynamometer while measuring its fuel efficiency under JC08 emission test cycle. Notes: 2. In addition to the explanatory variables listed above, the model includes release period dummy variables.

Number of Observations (release period)

Double Box-Cox Model

Room Space (㎥)

Fuel Efficiency JC08 (km/l) ×Equivalent Inertia Weight (kg)

Horsepower (PS)

43

Appendix Chart 2

Estimation result with new method(old sample)

Notes: The sample period is from the 3rd quarter of 2015 to the 2nd quarter of 2017. Volume=Length×Width×Height.

Population Density=seating capacity÷(Length×Width).

Hyperparameters

λ1 0.017

λ1 * 2.068E-05

λ2 0.010

γ 0.5

Explanatory Variables

Constant 12.879

Population Density (person/m2) -0.192

Curb Weight(kg)/Volume(m3) 0.002

Length(m)×Width(m) 0.032

Navigation System 0.039

Leather Seat 1.432E-05

Rear Power Seat 0.091

Length(m)×Width(m)×: quadratic term 0.002

Imported Car C×Height (mm) 1.123E-05

Imported Car C×Maximum Output (ps)/Weight(kg) 2.093

Domestic Car G×CVT -0.025

Domestic Car G×Aluminum Wheel -0.067

Domestic Car E×Maximum Output (ps)/Weight(kg) 1.211

Domestic Car E×Driver's Seat Heater 0.014

Domestic Car D×2016Q3 -0.034

Domestic Car D×2016Q4 0.203

Domestic Car D×Minivan -0.019

Domestic Car D×Population Density (person/m2) -0.114

Imported Car B×Maximum Output (ps)/Weight(kg) 1.016

Imported Car B×Maximum Torque (kg∙m) 0.002

Imported Car B×Full Auto Air Conditioner 0.137

Imported Car B×Wood Panel 0.018

Imported Car B×Privacy Glass 0.011

Domestic Car F×2016Q1 -0.068

Domestic Car F×Fuel Efficiency (JC08 mode, km/l) -0.001

Domestic Car C×CVT -0.146

Domestic Car B×2015Q4 -0.124

Domestic Car B×Hatchback -0.074

Imported Car A×Leather Steering 0.145

2016Q3×Cruise Control 0.009

2016Q3×Wood Panel 0.012

2016Q3×ETC 0.056

2016Q3×Driver's Seat Heater 0.019

Hatchback×Population Density (person/m2) -0.047

Minivan×Hybrid 0.043

Minivan×Driver's Seat Heater 0.029

Population Density (person/m2)×Front Dual Zone Air Conditioner 0.027

Curb Weight(kg)/Volume(m3)×Length(m)×Width(m) 2.605E-04

Explanatory Variables

Curb Weight(kg)/Volume(m3)×Height (mm) 2.436E-06

Curb Weight(kg)/Volume(m3)×Rim Size (inch) 9.914E-05

Curb Weight(kg)/Volume(m3)×Telescopic Steering Device 1.396E-04

Curb Weight(kg)/Volume(m3)×Anti-Theft System 3.299E-04

Curb Weight(kg)/Volume(m3)×Blu-ray Player 3.530E-04

Length(m)×Width(m)×Rim Size (inch) 0.002

Height (mm)×Leather Seat 3.468E-05

Height (mm)×LED Headlamp 3.315E-05

Height (mm)×LDWS 4.677E-06

Fuel Efficiency (JC08 mode, km/l)×Hybrid 0.001

Fuel Efficiency (JC08 mode, km/l)×Navigation System 0.002

Fuel Efficiency (JC08 mode, km/l)×Dual Zone Air Conditioner 0.002

Fuel Efficiency (JC08 mode, km/l)×Cruise Control 0.001

Fuel Efficiency (JC08 mode, km/l)×Leather Steering 0.001

Fuel Efficiency (JC08 mode, km/l)×Front Power Seat 0.001

Fuel Efficiency (JC08 mode, km/l)×Driver's Seat Heater 0.001

Hybrid×Rain Sensor 0.013

Maximum Output (ps)/Weight(kg)×Maximum Torque (kg∙m) 0.007

Maximum Output (ps)/Weight(kg)×Unleaded Premium Gasoline 0.167

Maximum Output (ps)/Weight(kg)×FF -0.312

Maximum Output (ps)/Weight(kg)×Navigation System 0.065

Maximum Output (ps)/Weight(kg)×Dual Zone Air Conditioner 0.143

Maximum Output (ps)/Weight(kg)×Wood Panel 0.268

Maximum Output (ps)/Weight(kg)×Aluminum Wheel 0.591

Maximum Output (ps)/Weight(kg)×Side Airbag 0.193

Maximum Output (ps)/Weight(kg)×Anti-Theft System 0.009

Maximum Output (ps)/Weight(kg)×Blu-ray Player 0.010

Front Fog Lamp×AEBS 0.014

Aluminum Wheel×AEBS 0.006

44

Appendix Chart 3

Change in contribution rate of each variable between sample periods

Notes: 1. We build hypothetical sample prices where all variables are set at the mean value for sedans and wagons. For these

samples, we calculate the rate of change in theoretical price due to one standard deviation increase in continuous

variables or one unit increase in dummy variables with the regression models derived from each original and older

dataset. Here, we show the difference between the contribution rates for each variables from original and older

dataset. Dummy variables regarding body type and release period are not subject to the calculation.

2. Numbers of variables subject to the calculation are 55 for the new method (AEN) and 31 for the conventional

method. For the variables that are adopted in either one model, the contribution rate of the other one is taken zero.

3. The dotted plots, values below “the first quartile-1.5×the quartile range” or above “the third quartile+1.5×the

quartile range” are indicated as outliers.

% points

45

Appendix Chart 4

Change in quality change rate between sample periods

Notes: Assume that a model change occurs from a hypothetical product with average specifications that is launched in one

quarter to a hypothetical product built in the same way for another quarter. We calculate the rate of quality change (the

rate of change in theoretical price) with the regression models derived from each original and older dataset and plot

the combinations.

-60

0

60

120

-60 0 60 120

New method with older samples, %

New method with original samples, %

-60

0

60

120

-60 0 60 120

Conventional method with older samples, %

Conventional method with original samples, %

46

Appendix Chart 5

Deviation in quality change rate between sample periods

Notes: Assume that a model change occurs from a hypothetical product with average specifications that is launched in one

quarter to a hypothetical product built in the same way for another quarter. We calculate the rate of quality change (the

rate of change in theoretical price) with the regression models derived from each original and older dataset and list the

absolute value of the difference.

Old / New model

16Q4 17Q1 17Q3 17Q4 18Q2 18Q3 18Q4 19Q2

16Q3 4.6 1.8 3.3 4.3 0.1 3.8 6.6 2.9

16Q4 - 5.3 6.2 7.0 3.9 2.1 9.5 6.1

17Q1 - - 2.3 3.6 2.6 8.4 6.3 1.6

17Q3 - - - 1.7 6.9 15.8 4.9 1.2

17Q4 - - - - 9.8 20.3 3.1 3.3

18Q2 - - - - - 4.1 7.5 3.3

18Q3 - - - - - - 6.5 3.8

18Q4 - - - - - - - 6.2

Average

New method (AEN)

5.3

Old / New model

16Q4 17Q1 17Q3 17Q4 18Q2 18Q3 18Q4 19Q2

16Q3 4.6 8.3 9.2 7.8 6.8 1.2 6.8 20.4

16Q4 - 11.6 11.9 10.7 2.5 6.2 9.8 23.6

17Q1 - - 2.4 0.7 18.8 12.0 1.0 12.7

17Q3 - - - 1.9 25.6 18.5 4.1 11.2

17Q4 - - - - 22.8 15.2 2.0 13.6

18Q2 - - - - - 9.9 12.6 27.8

18Q3 - - - - - - 4.9 16.1

18Q4 - - - - - - - 15.3

Average

Conventional method

10.8

Revision of 2020-Base CPI Weights under the COVID-19 Pandemic, Japan

Languages and translations
English

NISHIJO Nanami, SHIBATA Takuya

Revision of 2020-Base CPI Weights under the COVID-19 Pandemic

 The current weights, calculated using the average consumption expenditures of households in 2019 and 2020, were examined for their validity in the interim year as described above and found to be unproblematic.

 As a result, the decision was made to continue the use of the current weights until the 2025-base revision (scheduled to be conducted in 2026).

Conclusion

 In 2022, an interim-year review of the 2020-base CPI was conducted from the following two perspectives.

- To examine goods and services showing rapid growth or decline for reflection in the index items - To compare the current fixed-base Laspeyres index using average weights for 2019 and 2020 with the chain Laspeyres index

2. Validation of the Current Weights in 2022 for the 2020-Base CPI

 There was no major difference between the fixed-base index and the chain index, comparing trends of YoY changes in both indices.

(The maximum difference is ±0.2 point in YoY changes after 2021.)  Based on the fact that it is unclear at this time how the impact of the COVID-19 pandemic and the spending structure of households will change in the future,

it was concluded to continue the use of the current weights in calculating the fixed-base index for the 2020-base CPI.

2.2 Comparing the Fixed-Base Laspeyres Index (Current Index) and the Chain Laspeyres Index

 In Japan, the items used in the CPI are revised in the base revision, and 582 items are used for the 2020-base calculation.  Goods and services showing rapid growth or decline in consumption are quickly reflected in the index as needed in the interim year, before the next base revision.  As a result of examination with data from the Family Income and Expenditure Survey, which is used to determine the weights, all of the items that showed a large increase or

decrease in expenditure were considered to have been affected by the unusual circumstances caused by the COVID-19 pandemic.  Since it was not certain whether their state of growth or decline would remain unchanged going forward, it was decided not to add or eliminate index items in the interim review.

2.1 Examination of Goods and Services Showing Rapid Growth or Decline in Consumption for Reflection in the Index Items

 The 2020-base weights would normally be computed using the results of the 2020 Family Income and Expenditure Survey, etc.  However, the consumption structure of households changed significantly due to the outbreak of COVID-19 infections in 2019, and it became necessary to take this impact into account in

the 2020-base revision.  In this revision, we tried the following three methods to estimate new weights (per 10,000) for the 10 major groups:

(1) Consumption expenditures in 2020 (the conventional method), (2) Average of consumption expenditures for multiple years (2019 and 2020), and (3) Outlier-processed consumption expenditures in 2020* *Processed mechanically using the pre-adjustment function in the seasonal adjustment software X-12-ARIMA

1. Estimation of New Weights with 3 Methods for the 2020-Base Revision

 The CPI in Japan is calculated using the fixed-base method (Laspeyres formula), and the base period of the index is revised every five years, in years with 0 or 5 as the final digit.  This poster introduces a series of processes from considerations to determine new weights under the conditions of the COVID-19 pandemic to the validation of the current weights

in the interim year.

Introduction

Table 3. Goods and services showing decrease in consumption

Income and expenditure items

Corresponding index items 2018 2019 2020 2021

1. Charges for package tours to overseas

Charges for package tours to overseas 46.3 47.9 6.8 0.2

2. Neckties Neckties 1.4 1.2 1.0 0.8

3. Women's stockings Women's stockings 1.6 1.5 1.1 0.9

4. Suitcases Suitcases 2.9 2.6 1.1 0.9 Note: Values are ratios per 10,000. (Source) Yearly average results for households of two persons or more in the Family

Income and Expenditure Survey

Table 2. Goods and services showing increase in consumption

Status of adoption for the index items Items showing rapid increase in consumption 2018→ 2019→ 2020→ 2021

1. Not adopted for the 2020-base index items

- Wet wipes* - Sterilization and disinfection solution (for household use) * - Hand sanitizer* - Medical thermometer*

【1.5→ __ → 3.8→ 2.8】 【0.8→ __ → 2.6→ 1.2】 【0.2→ __ → 3.9→ 1.2】 【0.3→ __ → 1.6→ 0.9】

2. Adopted for the 2020- base index items

- “Chu-hi”, liquor with soda & fruit, & cocktail - Masks* - Game softwares etc.

【9.6→ 10.7→ 14.9→ 15.5】 【3.2→ __ → 26.5→ 16.4】 【4.5→ 4.5→ 7.1→ 6.6】

Note: The values in brackets are ratios per 10,000 in the order of 2018, 2019, 2020, and 2021. Note: The values of items with asterisks were not calculated in 2019. (Source) Yearly average results for households of two persons or more in the Family Income and Expenditure Survey

Table 1. Estimation of weights for the 10 major groups (per 10,000)

Released values Estimated values (Values in parentheses are percentage changes relative to 2019) 2019 (1) 2020 (2) Avg. for 2019 and 2020 (3) Outlier-processed

Food 2,628 2,711 ( 3.2) 2,670 ( 1.6) 2,734 ( 4.0)

Housing 2,012 2,102 ( 4.5) 2,051 ( 1.9) 2,085 ( 3.6)

Fuel, light & water charges 689 712 ( 3.3) 699 ( 1.5) 706 ( 2.5)

Furniture & household utensils 373 420 ( 12.6) 396 ( 6.2) 406 ( 8.8)

Clothes & footwear 376 319 ( - 15.2) 348 ( - 7.4) 337 ( - 10.4)

Medical care 461 485 ( 5.2) 472 ( 2.4) 480 ( 4.1)

Transportation & communication 1,547 1,476 ( - 4.6) 1,516 ( - 2.0) 1,466 ( - 5.2)

Education 302 316 ( 4.6) 311 ( 3.0) 310 ( 2.6)

Culture & recreation 995 863 ( - 13.3) 926 ( - 6.9) 879 ( - 11.7)

Miscellaneous 617 595 ( - 3.6) 610 ( - 1.1) 596 ( - 3.4)

When experts and economists in Japan were asked for their opinions regarding these estimations, the method “(2) Average for 2019 and 2020” received the most approvals. It was therefore adopted, and the 2020-base revision including this weighting was implemented in August 2021.

[Results of the Estimation] “(1) 2020” showed sharp fluctuations compared to the previous year for many classifications, but in “(2) Average for 2019 and 2020” and “(3) Outlier-processed,” the estimated weights were generally adjusted to mitigate the fluctuations in (1).

Figure 1. Trends of YoY changes in the indices

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7

2020 2021 2022

All items

chain-liking fixed-base

(%)

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7

2020 2021 2022

All items, less fresh food

chain-liking fixed-base

(%)

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7

2020 2021 2022

All items, less fresh food and energy

chain-liking fixed-base

(%)

Figure 2. Difference in YoY changes (fixed-base – chain)

New Hedonic Quality Adjustment Method using Sparse Estimation, Japan

Languages and translations
English

Sahoko Furuta Bank of Japan

Research and Statistics Department

New Hedonic Quality Adjustment Method using Sparse Estimation

1

Introduction

 The hedonic estimation generally has issues with multicollinearity and the omitted variable bias. This leads to a low estimation accuracy and a large estimation burden in practice.

 To overcome these problems, we introduce new estimation method using "sparse estimation" as a way to automatically select the meaningful variables from a large number of candidates.

 The new method brings three benefits; 1. A significant increase in the number of variables in the model 2. An improvement in fit of the model to actual prices 3. A reduction of the over-estimation in quality improvements due to the

omitted variable bias

2

1. Motivations

3

What is Hedonic Quality Adjustment?

 The Bank of Japan applies the hedonic quality adjust method in the compilation of the Price Indexes to eliminate the effect of products’ quality changes.

 When a product turnover occurs, the observed price difference between new and old products is decomposed into (a) the difference due to a quality change and (b) the difference due to a pure price fluctuation, which is called quality adjustment.

 In the hedonic method, the relationship between product quality and price is statistically regressed with a large amount of data. This method is not only highly objective, but also applicable to various changes in characteristics of products.

(a) The price difference due to a quality change

(b) The price difference due to a pure price fluctuation+

Estimated by the hedonic regression model

Only this part is reflected in the Price Indexes.

The observed price difference between new and old products

4

Overview of Conventional method

 Given the non-linear relationship between the price and characteristic of a product, the hedonic regression model often has both of linear parts and non-linear parts by the Box-Cox transformed term.

&#x1d466;&#x1d466;&#x1d456;&#x1d456; &#x1d706;&#x1d706;0 = &#x1d6fd;&#x1d6fd;0 + � &#x1d458;&#x1d458;=1

&#x1d45d;&#x1d45d;&#x1d451;&#x1d451;

&#x1d6fd;&#x1d6fd;&#x1d451;&#x1d451;&#x1d458;&#x1d458;&#x1d465;&#x1d465;&#x1d451;&#x1d451;&#x1d458;&#x1d458;,&#x1d456;&#x1d456; + � &#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;&#x1d450;&#x1d450;

&#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d457;&#x1d457;&#x1d465;&#x1d465;&#x1d450;&#x1d450;&#x1d457;&#x1d457;,&#x1d456;&#x1d456; &#x1d706;&#x1d706;&#x1d457;&#x1d457;

&#x1d466;&#x1d466;&#x1d456;&#x1d456;: theoretical price, &#x1d465;&#x1d465;&#x1d450;&#x1d450;&#x1d457;&#x1d457;,&#x1d456;&#x1d456;: continuous variable, &#x1d465;&#x1d465;&#x1d451;&#x1d451;&#x1d458;&#x1d458;,&#x1d456;&#x1d456;: dummy variable, &#x1d6fd;&#x1d6fd;0: constant term, &#x1d6fd;&#x1d6fd;&#x1d450;&#x1d450;&#x1d457;&#x1d457;: coefficient on a continuous variable,

&#x1d6fd;&#x1d6fd;&#x1d451;&#x1d451;&#x1d458;&#x1d458;: coefficient on a dummy variable, &#x1d706;&#x1d706;0: Box-Cox parameter for theoretical price,

&#x1d706;&#x1d706;&#x1d457;&#x1d457;: Box-Cox parameter for a continuous variable, &#x1d45d;&#x1d45d;&#x1d450;&#x1d450;: number of continuous variables, &#x1d45d;&#x1d45d;&#x1d451;&#x1d451;: number of dummy variables

Issues of Conventional method

5

• Multicollinearity • The omitted variables bias • These problems are likely to arise when the characteristics of the

products are highly correlated. They disturb accurate estimation of the parameters.

Accuracy of

estimation

• Repeating estimation while changing the set of the variables (excluding variable that cause multicollinearity and including the meaningful variables) to obtain good results.

Burden of

estimation

6

Accuracy of estimation (1)

 Estimated parameters on variables may become unstable due to the problem of multicollinearity and the omitted variables bias.

 Multicollinearity refers to a state in which there is a high inter-connection among the variables. Multicollinearity makes it difficult to identify price effects of variables, and it may also cause the omitted variables bias through the variable selection based on the statistical significance. As a result, the parameters are not estimated accurately.

 It is known that these problems can be more serious as the model has more complex functional form to deal with the non-linear effects of price determining characteristics.

7

Accuracy of estimation (2)

 A distorted functional form has a problem, called "overfitting."  The model may give quite poor estimates for the new products (i.e. out-

of-sample).

0

500

1000

1500

2000

0 50 100 150 200 250 300 350

Samples (as of October 2018 estimation)

Fitted values (estimated in October 2018) Box-Cox parameter≈0.0 (almost log)

10 thous. Yen

Maximum Output of Mini-Vans (hp)

Deviated from new samplesSignificant change in

the functional form

0

500

1000

1500

2000

0 50 100 150 200 250 300 350

Samples (as of October 2017 estimation)

Fitted values (estimated in October 2017) Box-Cox parameter≈3.4

10 thous. Yen

Maximum Output of Mini-Vans (hp)

A rapid increase in the out of sample region

Re-estimation

8

Burden of estimation

 As mentioned, the model with complex functional form may be suffered by the problem of multicollinearity and the omitted variables bias.

 Then, a slight change in sample and regressors often leads to a quite different estimation result in each re-estimation. Discontinuity in the estimates is highly problematic in practice.

⇒We have to repeat estimation with changing the set of the variables each time until obtaining a better and acceptable result.

 This problem is serious in the estimation of "passenger car", where there are many candidate variables and they are highly correlated.

9

2. New Method using Sparse Estimation

10

Sparse Estimation (1)

 Sparse estimation has a property that select the meaningful variables from a large number of candidates and gives zero coefficients to the rest of the variables ("Sparsity"). It can perform "variable selection" and "coefficient estimation" at the same time and can automatically derive a stable and well fitted model.

 The new estimation method proposed in this study employs an Adaptive Elastic Net (AEN), which enjoys two desirable properties;

1. "Group Effect" that gives robustness for multicollinearity

2. "Oracle Property" that ensures the adequacy of variable selection and estimated coefficients.

11

Sparse Estimation (2)

 For example, Lasso, a typical sparse estimation, estimates &#x1d737;&#x1d737;, by minimizing loss function: the sum of the squared errors and the regularization term (&#x1d43f;&#x1d43f;1norm of &#x1d737;&#x1d737;).

 Lasso has similar loss function with Ridge, but differs in that it has sparsity.

argmin &#x1d737;&#x1d737;

&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737; 2 + &#x1d706;&#x1d706;� &#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;

&#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; Lasso

argmin &#x1d737;&#x1d737;

&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737; 2 + &#x1d706;&#x1d706;� &#x1d456;&#x1d456;=1

&#x1d45d;&#x1d45d;

&#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; 2

Ridge

&#x1d706;&#x1d706; > 0: regularization parameter (It selects relatively smaller number of variables if &#x1d706;&#x1d706; is large)

12

Sparse Estimation (3)

 In the bivariate model, &#x1d737;&#x1d737; is derived from the intersection of the contour line of the sum of squared error and the constraint.

 Lasso gives &#x1d737;&#x1d737; at the corners of rhombus of the constraint, and then one coefficient is estimated to be exactly zero.

argmin &#x1d6fd;&#x1d6fd;1,&#x1d6fd;&#x1d6fd;2

� &#x1d456;&#x1d456;=1

&#x1d45b;&#x1d45b;

&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;1&#x1d44b;&#x1d44b;1,&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;2&#x1d44b;&#x1d44b;2,&#x1d456;&#x1d456; 2

s.t. &#x1d6fd;&#x1d6fd;1 2 + &#x1d6fd;&#x1d6fd;2

2 ≤ &#x1d460;&#x1d460;2 &#x1d460;&#x1d460; > 0: 1-1 corresponding to &#x1d706;&#x1d706;

Ridge

Lasso

&#x1d6fd;&#x1d6fd;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;&#x1d43f;

&#x1d460;&#x1d460; &#x1d6fd;&#x1d6fd;1

&#x1d6fd;&#x1d6fd;2 &#x1d6fd;&#x1d6fd;&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;

Lasso

&#x1d6fd;&#x1d6fd;&#x1d445;&#x1d445;&#x1d456;&#x1d456;&#x1d451;&#x1d451;&#x1d445;&#x1d445;&#x1d445;&#x1d445;

&#x1d460;&#x1d460; &#x1d6fd;&#x1d6fd;1

&#x1d6fd;&#x1d6fd;2 &#x1d6fd;&#x1d6fd;&#x1d442;&#x1d442;&#x1d43f;&#x1d43f;&#x1d442;&#x1d442;Ridge

argmin &#x1d6fd;&#x1d6fd;1,&#x1d6fd;&#x1d6fd;2

� &#x1d456;&#x1d456;=1

&#x1d45b;&#x1d45b;

&#x1d44c;&#x1d44c;&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;1&#x1d44b;&#x1d44b;1,&#x1d456;&#x1d456; − &#x1d6fd;&#x1d6fd;2&#x1d44b;&#x1d44b;2,&#x1d456;&#x1d456; 2

s.t. &#x1d6fd;&#x1d6fd;1 + &#x1d6fd;&#x1d6fd;2 ≤ &#x1d460;&#x1d460; &#x1d460;&#x1d460; > 0: 1−1 corresponding to &#x1d706;&#x1d706;

13

Adaptive Elastic Net (1)

 AEN can be interpreted as the combination of the Lasso and the Ridge.

 It has "group effect" and "oracle property."

Regularization Lasso Elastic Net Adaptive

Elastic Net

Ridge

Sparsity Group Effect Oracle Property

14

Adaptive Elastic Net (2): Group Effect

 For Lasso, the results of variable selection are known to be unstable in data has strong multicollinearity.

 A typical method to overcome this problem is the "Elastic Net (EN)."

 The robustness of EN for multicollinearity is called "group effect". It is a property that gives similar coefficients on variables when the correlation between them is high.

�&#x1d737;&#x1d737; &#x1d438;&#x1d438;&#x1d438;&#x1d438; = 1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b;

argmin &#x1d737;&#x1d737;

&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737; 2 + &#x1d706;&#x1d706;2� &#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;

&#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; 2 + &#x1d706;&#x1d706;1�

&#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;

&#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457;

&#x1d706;&#x1d706;2 > 0: &#x1d43f;&#x1d43f;2 norm regularization parameters &#x1d706;&#x1d706;1 > 0: &#x1d43f;&#x1d43f;1 norm regularization parameters

&#x1d45b;&#x1d45b;: number of observations

15

Adaptive Elastic Net (3): Oracle Property

 The "oracle property" is known as a property that asymptotically guarantees the appropriateness of both the "variable selection" and the "coefficient estimation".

When &#x1d737;&#x1d737;∗ is the true coefficient, the estimator �&#x1d737;&#x1d737; satisfies the following;

(1) Variable Selection Consistency

lim &#x1d45b;&#x1d45b;→∞

&#x1d443;&#x1d443; �̂�&#x1d6fd;&#x1d457;&#x1d457; = 0 = 1 &#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464; &#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; ∗ = 0

(2) Asymptotic Normality of the Non-zero Coefficients

lim &#x1d45b;&#x1d45b;→∞

�̂�&#x1d6fd;&#x1d457;&#x1d457; − &#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; ∗

&#x1d70e;&#x1d70e; �̂�&#x1d6fd;&#x1d457;&#x1d457; ~N 0,1 &#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464;&#x1d464; &#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457;

∗ ≠ 0

&#x1d70e;&#x1d70e;2 �̂�&#x1d6fd;&#x1d457;&#x1d457; : asymptotic variance of estimator

16

Adaptive Elastic Net (4)

 We employ AEN as a new estimation method for hedonic regression model.

 The AEN estimation is performed in two stages. At the first stage, we estimate the coefficients with EN. Then, EN is performed again to impose greater penalties for variables with small absolute values of the coefficients.

�&#x1d737;&#x1d737; &#x1d434;&#x1d434;&#x1d438;&#x1d438;&#x1d438;&#x1d438; = 1 + &#x1d706;&#x1d706;2 &#x1d45b;&#x1d45b;

argmin &#x1d737;&#x1d737;

&#x1d480;&#x1d480; − &#x1d47f;&#x1d47f;&#x1d737;&#x1d737; 2 + &#x1d706;&#x1d706;2� &#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;

&#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457; 2 + &#x1d706;&#x1d706;1

∗� &#x1d457;&#x1d457;=1

&#x1d45d;&#x1d45d;

�&#x1d464;&#x1d464;&#x1d457;&#x1d457; &#x1d6fd;&#x1d6fd;&#x1d457;&#x1d457;

�&#x1d464;&#x1d464;&#x1d457;&#x1d457; = �̂�&#x1d6fd;&#x1d457;&#x1d457;(&#x1d438;&#x1d438;&#x1d438;&#x1d438;) −&#x1d6fe;&#x1d6fe;

&#x1d706;&#x1d706;1 ∗ > 0: &#x1d43f;&#x1d43f;1 norm regularization parameters (2nd stage) �&#x1d464;&#x1d464;&#x1d457;&#x1d457; > 0: adaptive weight, &#x1d6fe;&#x1d6fe; > 0: adaptive parameter

(Larger γ imposes larger penalties corresponding to the absolute value of the coefficient)

17

3. Estimation Results

18

Continuous variables in the model

 We apply new and previous hedonic regression models to passenger cars in Japan and compare those results.

 The number of continuous variables in the regression models increases and this is accompanied by a reduction in dependence on just a few specific variables.

Note: Bar charts indicate the rates of change in theoretical price due to one unit increase in variables where all variables of a product are set at sample means.

-5

0

5

10

15

20

25

30

M ax

im um

Ou tp

ut

Fu el

Ef fic

ien cy

×E qu

iva len

t In

er tia

W ei

gh t

Cu rb

W ei

gh t /

Vo lu

m e Ar

ea

Po we

r-t o

-W eig

ht R

at io

He igh

t

Fu el

Ef fic

ien cy

Nu m

be r o

f Ge

ar s

Ri m

S ize

M ax

im um

T or

qu e(

kg ·m

)

Po pu

la tio

n De

ns ity

Previous method

New method

Contribution to theoretical price, %

19

Dummy variables in the model

 As a result of the increased number of characteristics, the new regression model reduces its reliance on manufacturer dummies (control variables).

Note: Bar charts indicate the rates of change in theoretical price due to one unit increase in variables where all variables of a product are set at sample means.

0 2 4 6 8

10 12 14 16 18 20

LS D

Al um

in um

W he

el

Po w

er S

ea t

LE D

H ea

dl am

p

Le at

he r S

ea ts

AC C

(N o

sp ee

d lim

ita tio

n)

AF S

LD W

S

Pa rk

in g

As si

st

Si de

A irb

ag s

N av

ig at

io n

D ua

l A ir

C on

di tio

ni ng

Bl u-

ra y

Po w

er S

ea t (

R ea

r)

C ru

is e

C on

tro l

D PM

S

Se at

H ea

te r (

D riv

er )

Fr on

t S po

ile r

R ea

r S po

ile r

Le at

he r S

te er

in g

Fr on

t F og

L am

ps

Fu ll A

ut om

at ic

Ai r C

on di

tio ni

ng U

nl ea

de d

Pr em

iu m

G as

ol in

e

Previous method

New method

Contribution to theoretical price, %

-50 0

50 100

Br an

d G

Br an

d C

Br an

d F

Br an

d A

Br an

d J

Br an

d D

Br an

d B

Br an

d E

Br an

d I

Br an

d H

20

Fit of the model

 The fit (mean squared errors) of regression models to actual price improves in the new estimation method for both in-sample and out-of-sample period.

 Since the quality adjustment is generally applied to products, released after the estimation, the improvement in the out-of-sample fit implies an increase in the usefulness of the hedonic quality adjustment method in practice.

0.00

0.01

0.02

0.03

In sample (2016/Q3-2018/Q2)

Out of sample (2018/Q3-2019/Q2)

Previous method New method

MSE of log price

21

Estimated Price Index

 The estimated price index of "standard passenger cars (gasoline cars)" in the PPI, which is retrospectively calculated by applying the new hedonic estimation method to all quality adjustments, shows similar developments to the published price index.

 On the other hand, the previous method highlights the risk of over-estimating the rate of quality improvement as it shows an excessive decline in the price.

96

97

98

99

100

101

102

Jan. 2017 Jul. 2017 Jan. 2018 Jul. 2018 Jan. 2019 Jul. 2019

Published index Estimated index (adjusted by the new method) Estimated index (adjusted by the previous method)

CY2015=100

22

Recent Estimation Results

 In recent years, passenger car quality has changed greatly following the move to electric cars from gasoline cars in the market.

 The new method worked well and some features related to electric motors were adopted in recent estimation. It allowed for a more accurate evaluation of passenger car quality in this market situation.

Note: Bar charts indicate the rates of change in theoretical price due to one unit increase in variables where all variables of a product are set at sample means.

23

4. Conclusion

24

Conclusion

 The new estimation method using "sparse estimation"

1. mitigates the problems of omitted variables and multicollinearity significantly.

2. improves estimation accuracy and reduces estimation burden.

3. possibly improves the accuracy of the price index.

 The proposed method can automatically build a good performance model by extracting all necessary information even with the large dataset (e.g. 1500 samples and 100 candidate variables in the dataset of passenger car), and it can be expected that this method supports effective use of big data for price statistics.

  • Slide Number 1
  • Slide Number 2
  • 1. Motivations
  • Slide Number 4
  • Slide Number 5
  • Issues of Conventional method
  • Slide Number 7
  • Slide Number 8
  • Slide Number 9
  • 2. New Method using Sparse Estimation
  • Slide Number 11
  • Slide Number 12
  • Slide Number 13
  • Slide Number 14
  • Slide Number 15
  • Slide Number 16
  • Slide Number 17
  • 3. Estimation Results
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Slide Number 22
  • Slide Number 23
  • 4. Conclusion
  • Slide Number 25

Expanding the use of Big Data for CPI in Japan

Languages and translations
English

Expanding the use of Big Data for CPI in Japan

Seitaro Tanimichi, Takuya Shibata Statistics Bureau, Japan

Meeting of the Group of Experts on Consumer Price Indices, 7-9 June 2023, Geneva, Switzerland

Outline

 Background  Web Scraping data: hotel charge  Scanner data  Study for further expansion

2

Background

2000-base: Scanner data for “desktop computers” and “laptop computers”

2005-base: Added scanner data for “cameras”

2010-base: Included scanner data of “tablet computers” to “laptop computers”

2015-base: Separated “tablets computers” from “laptop computers”

2020-base: Web scraping data for “hotel charges” “airplane fares” “charges for package tours to overseas”

Scanner data for “video recorders”, “PC printers” and “TV sets”

3

Use of web Scraping data : hotel charges

 Capturing the price trend of internet sales grasped the price trend of hotel charges Web scraping can stably collect prices from each travel booking website  A huge number of internet sales prices were accurately reflected in the indices

 A questionnaire survey to examine  trends in purchasing methods,  time to make reservations,  accommodation plans,  selection of collection websites, etc.

Also  Conducted price collection and index production by web scraping on a trial basis  Compared with the index by conventional price surveys

Web scraping contributes to the improvement of indices

4

Web Scraping (hotel charges) : Price collection sites

N = 2,448

RESERVATION TIME

Within a week

One to three weeks before

One month or more before

Unknown Total

R ES

ER VA

TI O

N M

ET H

O D Called hotels directly 3% 4% 5% 1% 13%

Website of hotels 2% 7% 12% 1% 21%

Travel booking site 7% 21% 29% 2% 59%

Over the counter 0% 1% 2% 0% 3%

Others 0% 0% 1% 0% 1%

Unknown 0% 0% 1% 2% 3%

Total 12% 33% 50% 6% 100%

5

Web Scraping (hotel charges) : Accommodation plan

N = 2,448

Western- style rooms

Japanese- style rooms

Japanese- Western

style rooms Others Total

No meals 24% 4% 1% 1% 29%

With breakfast 24% 3% 1% 0% 29%

With breakfast and dinner 11% 22% 7% 0% 40%

Breakfast, lunch and dinner included 1% 1% 0% 0% 2%

Others 0% 0% 0% 0% 0%

Total 60% 30% 9% 1% 100%

6

Web Scraping (hotel charges) : Price collection time

 Prices are collected, in principle, at the beginning of the month, two months before the accommodation date

As for one month before the accommodation date, Some sites showed that the average price of some accommodations was abnormally high compared to that of the two-month prior collection  due to the inability to collect low-priced plans because of full occupancy

Long-term web scraping conducted between August 2017 and March 2018 (for 30 accommodation facilities)

• Prices for about 10% of accommodations four months ahead and about half of accommodations six months ahead were not listed on the booking website

• Seasonal limit on the advanced reservation, a gap at the time of change of the fiscal year

7

Web Scraping (hotel charges) : Accommodation Facility

 About 400 representative accommodations facilities are selected

• While price collection by web scraping does not require consideration of the upper limit of the number of target facilities caused by resource constraints, unrestricted access to websites to obtain prices is not possible in light of the load on the website.  It is necessary to set an appropriate number of

target facilities.

• In the pilot study, the standard error rate of the average price for the increase in the number of facilities almost stopped decreasing and leveled off when the number of facilities exceeded 400

8

Web Scraping (hotel charges) : Calculation of indices

(1) Exclusions of outliers

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; = log(&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;)

&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; s: booking website, &#x1d44e;&#x1d44e; : accommodation date, &#x1d44f;&#x1d44f; : accommodation facility, &#x1d450;&#x1d450; : plan

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;

&#x1d70e;&#x1d70e;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;−1

∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

2

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; > 3&#x1d70e;&#x1d70e;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; is considered as an outlier if

 Using a two-month data set for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1) the price indices are calculated according to the following procedures (1) to (4)

9

Web Scraping (hotel charges) : Calculation of indices (2) Creation of a data table

(3) Missing value imputation : Next Slide (4) Calculation of index

 Average prices for each booking website(s), accommodation date(a), and accommodation facility(b) are calculated,

 Data table with these as attributions is created &#x1d44c;&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1

&#x1d441;&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; ∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;

 Data set after imputation is used to calculate average prices for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), respectively.

 These price relatives are multiplied by the price index for the previous month to calculate the price index for the current month.

&#x1d443;&#x1d443;&#x1d461;&#x1d461; = ∏&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; = exp 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; log &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = exp 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d44c;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

&#x1d43c;&#x1d43c;&#x1d461;&#x1d461; = &#x1d43c;&#x1d43c;&#x1d461;&#x1d461;−1 × &#x1d443;&#x1d443;&#x1d461;&#x1d461; &#x1d443;&#x1d443;&#x1d461;&#x1d461;−1 10

Web Scraping (hotel charges) : Missing value imputation Accommodation date (Xa)

Booking site (Xs)

Facility (Xb)

Log Average Price (y)

2018/12/1 A X 9.51

2018/12/1 A Y 9.61

2018/12/1 A Z 9.75

2018/12/1 B X

2018/12/1 B Y

2018/12/1 B Z

2018/12/1 C X 9.58

2018/12/1 C Y 9.69

2018/12/1 C Z 9.85

2018/12/2 A X 9.65

2018/12/2 A Y 9.66

2018/12/2 A Z

2018/12/2 B X 9.49

…… … … ……

Accommodation date (Xa)

Booking site (Xs)

Facility (Xb)

Log Average Price(y)

2018/12/1 A X 9.51

2018/12/1 A Y 9.61

2018/12/1 A Z 9.75

2018/12/1 C X 9.58

2018/12/1 C Y 9.69

2018/12/1 C Z 9.85

2018/12/2 A X 9.65

2018/12/2 A Y 9.66

2018/12/2 B X 9.49

…… … … ……

&#x1d44c;&#x1d44c;&#x1d44c;s,a,&#x1d44f;&#x1d44f; = &#x1d6fc;&#x1d6fc; + &#x1d737;&#x1d737;&#x1d44e;&#x1d44e; � &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; + &#x1d737;&#x1d737;&#x1d460;&#x1d460; � &#x1d499;&#x1d499;&#x1d460;&#x1d460; + &#x1d737;&#x1d737;&#x1d44f;&#x1d44f; � &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; + &#x1d700;&#x1d700;

11

Use of web Scraping data : hotel charges

12

Use of web Scraping data : hotel charges

2015-Base method (field collection)

2020-Base method (web scraping)

Collection conditions

Prices on Friday and Saturday of the week including the 5th of every month

Prices of 1st to 31st of every month purchased two months in advance of accommodation

Number of collected prices 640 About 1 million

13

the 2020-base index in July reflects the impact of the “Go To Travel” policy that started in late July

Use of web Scraping data : hotel charges

14

Use of Scanner data Specifications Examples Release month Year, Month Tuner shape Separate type, Integrated type, None Screen size 3-inch type to 75-inch type Number of pixels displayed 1366x768, 1920x1080, 3840x2160, etc. D connector D4x1, D5x1, None PC input D-Sub, None Communication terminal LAN, None Card slot SDXC, None HDD capacity 0 GB to 2,000 GB Internet Capable, Incapable Wireless function IEEE802.11a/n, None Audio output 10W+10W, 3W+3W, 5W+5W, etc. HDMI connector 0 to 4 Link function Available, Unavailable Drive speed Constant speed, Double speed Recording media HDD (external), HDD (internal/external)

High-definition capable 4K/2K, 8K, High-definition, Full high- definition, Incapable

Hybrid cast Capable, Incapable

 TV Sets  hedonic model

 PC printers, video recorders  fixed-specification method

15

Use of Scanner data 2015 Base

(field collection) 2020 Base

(Scanner data)

Collection time and price

Price on any one of Wednesday, Thursday or Friday of the week including

the 12th of each month Prices from 1st to 31st of each month

Item Video recorders

PC printers TV sets Video

recorders PC

printers TV sets

Number of collected product

models 6 1 8 23 46 600

Number of stores for collection 186 172 186 About

2,600 About 2,600

About 2,600

Number of collected prices 186 172 186 About

30,000 About 80,000

About 240,000

16

Use of Scanner data

17

Study for further expansion of the use of big data  It is necessary to accelerate the use of big data for the CPI  The items under consideration include white goods, foods, medical supplies,

daily necessities, and clothing

 For clothing, we are considering web scraping to collect prices for items such as one-piece dresses, slacks, and children’s trousers

 As web scraping data for clothing contains a large number of related products, it is necessary to extract equivalent products from these products

 The necessary codes and names are often not present, it is difficult to filter them mechanically (and not practical to extract them manually) Currently studying the construction of a machine learning model

for automatically classifying products based on product descriptions (about 100 to 400 words)

and image information

18

mascot of Statistics Bureau of Japan “Census-Kun” and “Mirai-chan”

(Master. Census) (Miss. Future)

Thank you

19 Seitaro Tanimichi Takuya Shibata [email protected] [email protected]

  • Slide Number 1
  • Slide Number 2
  • Slide Number 3
  • Slide Number 4
  • Slide Number 5
  • Slide Number 6
  • Slide Number 7
  • Slide Number 8
  • Slide Number 9
  • Slide Number 10
  • Slide Number 11
  • Slide Number 12
  • Slide Number 13
  • Slide Number 14
  • Slide Number 15
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19

Expanding the use of Big Data for CPI in Japan

Languages and translations
English

Expanding the use of Big Data for CPI in Japan

Seitaro Tanimichi, Takuya Shibata Statistics Bureau, Japan

Meeting of the Group of Experts on Consumer Price Indices, 7-9 June 2023, Geneva, Switzerland

Outline

 Background  Web Scraping data: hotel charge  Scanner data  Study for further expansion

2

Background

2000-base: Scanner data for “desktop computers” and “laptop computers”

2005-base: Added scanner data for “cameras”

2010-base: Included scanner data of “tablet computers” to “laptop computers”

2015-base: Separated “tablets computers” from “laptop computers”

2020-base: Web scraping data for “hotel charges” “airplane fares” “charges for package tours to overseas”

Scanner data for “video recorders”, “PC printers” and “TV sets”

3

Use of web Scraping data : hotel charges

 Capturing the price trend of internet sales grasped the price trend of hotel charges Web scraping can stably collect prices from each travel booking website  A huge number of internet sales prices were accurately reflected in the indices

 A questionnaire survey to examine  trends in purchasing methods,  time to make reservations,  accommodation plans,  selection of collection websites, etc.

Also  Conducted price collection and index production by web scraping on a trial basis  Compared with the index by conventional price surveys

Web scraping contributes to the improvement of indices

4

Web Scraping (hotel charges) : Price collection sites

N = 2,448

RESERVATION TIME

Within a week

One to three weeks before

One month or more before

Unknown Total

R ES

ER VA

TI O

N M

ET H

O D Called hotels directly 3% 4% 5% 1% 13%

Website of hotels 2% 7% 12% 1% 21%

Travel booking site 7% 21% 29% 2% 59%

Over the counter 0% 1% 2% 0% 3%

Others 0% 0% 1% 0% 1%

Unknown 0% 0% 1% 2% 3%

Total 12% 33% 50% 6% 100%

5

Web Scraping (hotel charges) : Accommodation plan

N = 2,448

Western- style rooms

Japanese- style rooms

Japanese- Western

style rooms Others Total

No meals 24% 4% 1% 1% 29%

With breakfast 24% 3% 1% 0% 29%

With breakfast and dinner 11% 22% 7% 0% 40%

Breakfast, lunch and dinner included 1% 1% 0% 0% 2%

Others 0% 0% 0% 0% 0%

Total 60% 30% 9% 1% 100%

6

Web Scraping (hotel charges) : Price collection time

 Prices are collected, in principle, at the beginning of the month, two months before the accommodation date

As for one month before the accommodation date, Some sites showed that the average price of some accommodations was abnormally high compared to that of the two-month prior collection  due to the inability to collect low-priced plans because of full occupancy

Long-term web scraping conducted between August 2017 and March 2018 (for 30 accommodation facilities)

• Prices for about 10% of accommodations four months ahead and about half of accommodations six months ahead were not listed on the booking website

• Seasonal limit on the advanced reservation, a gap at the time of change of the fiscal year

7

Web Scraping (hotel charges) : Accommodation Facility

 About 400 representative accommodations facilities are selected

• While price collection by web scraping does not require consideration of the upper limit of the number of target facilities caused by resource constraints, unrestricted access to websites to obtain prices is not possible in light of the load on the website.  It is necessary to set an appropriate number of

target facilities.

• In the pilot study, the standard error rate of the average price for the increase in the number of facilities almost stopped decreasing and leveled off when the number of facilities exceeded 400

8

Web Scraping (hotel charges) : Calculation of indices

(1) Exclusions of outliers

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; = log(&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;)

&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; s: booking website, &#x1d44e;&#x1d44e; : accommodation date, &#x1d44f;&#x1d44f; : accommodation facility, &#x1d450;&#x1d450; : plan

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;

&#x1d70e;&#x1d70e;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;−1

∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

2

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; > 3&#x1d70e;&#x1d70e;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; is considered as an outlier if

 Using a two-month data set for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1) the price indices are calculated according to the following procedures (1) to (4)

9

Web Scraping (hotel charges) : Calculation of indices (2) Creation of a data table

(3) Missing value imputation : Next Slide (4) Calculation of index

 Average prices for each booking website(s), accommodation date(a), and accommodation facility(b) are calculated,

 Data table with these as attributions is created &#x1d44c;&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1

&#x1d441;&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; ∑&#x1d450;&#x1d450;=1 &#x1d441;&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;

 Data set after imputation is used to calculate average prices for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), respectively.

 These price relatives are multiplied by the price index for the previous month to calculate the price index for the current month.

&#x1d443;&#x1d443;&#x1d461;&#x1d461; = ∏&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; = exp 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; log &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = exp 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d44c;&#x1d44c;&#x1d44c;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

&#x1d43c;&#x1d43c;&#x1d461;&#x1d461; = &#x1d43c;&#x1d43c;&#x1d461;&#x1d461;−1 × &#x1d443;&#x1d443;&#x1d461;&#x1d461; &#x1d443;&#x1d443;&#x1d461;&#x1d461;−1 10

Web Scraping (hotel charges) : Missing value imputation Accommodation date (Xa)

Booking site (Xs)

Facility (Xb)

Log Average Price (y)

2018/12/1 A X 9.51

2018/12/1 A Y 9.61

2018/12/1 A Z 9.75

2018/12/1 B X

2018/12/1 B Y

2018/12/1 B Z

2018/12/1 C X 9.58

2018/12/1 C Y 9.69

2018/12/1 C Z 9.85

2018/12/2 A X 9.65

2018/12/2 A Y 9.66

2018/12/2 A Z

2018/12/2 B X 9.49

…… … … ……

Accommodation date (Xa)

Booking site (Xs)

Facility (Xb)

Log Average Price(y)

2018/12/1 A X 9.51

2018/12/1 A Y 9.61

2018/12/1 A Z 9.75

2018/12/1 C X 9.58

2018/12/1 C Y 9.69

2018/12/1 C Z 9.85

2018/12/2 A X 9.65

2018/12/2 A Y 9.66

2018/12/2 B X 9.49

…… … … ……

&#x1d44c;&#x1d44c;&#x1d44c;s,a,&#x1d44f;&#x1d44f; = &#x1d6fc;&#x1d6fc; + &#x1d737;&#x1d737;&#x1d44e;&#x1d44e; � &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; + &#x1d737;&#x1d737;&#x1d460;&#x1d460; � &#x1d499;&#x1d499;&#x1d460;&#x1d460; + &#x1d737;&#x1d737;&#x1d44f;&#x1d44f; � &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; + &#x1d700;&#x1d700;

11

Use of web Scraping data : hotel charges

12

Use of web Scraping data : hotel charges

2015-Base method (field collection)

2020-Base method (web scraping)

Collection conditions

Prices on Friday and Saturday of the week including the 5th of every month

Prices of 1st to 31st of every month purchased two months in advance of accommodation

Number of collected prices 640 About 1 million

13

the 2020-base index in July reflects the impact of the “Go To Travel” policy that started in late July

Use of web Scraping data : hotel charges

14

Use of Scanner data Specifications Examples Release month Year, Month Tuner shape Separate type, Integrated type, None Screen size 3-inch type to 75-inch type Number of pixels displayed 1366x768, 1920x1080, 3840x2160, etc. D connector D4x1, D5x1, None PC input D-Sub, None Communication terminal LAN, None Card slot SDXC, None HDD capacity 0 GB to 2,000 GB Internet Capable, Incapable Wireless function IEEE802.11a/n, None Audio output 10W+10W, 3W+3W, 5W+5W, etc. HDMI connector 0 to 4 Link function Available, Unavailable Drive speed Constant speed, Double speed Recording media HDD (external), HDD (internal/external)

High-definition capable 4K/2K, 8K, High-definition, Full high- definition, Incapable

Hybrid cast Capable, Incapable

 TV Sets  hedonic model

 PC printers, video recorders  fixed-specification method

15

Use of Scanner data 2015 Base

(field collection) 2020 Base

(Scanner data)

Collection time and price

Price on any one of Wednesday, Thursday or Friday of the week including

the 12th of each month Prices from 1st to 31st of each month

Item Video recorders

PC printers TV sets Video

recorders PC

printers TV sets

Number of collected product

models 6 1 8 23 46 600

Number of stores for collection 186 172 186 About

2,600 About 2,600

About 2,600

Number of collected prices 186 172 186 About

30,000 About 80,000

About 240,000

16

Use of Scanner data

17

Study for further expansion of the use of big data  It is necessary to accelerate the use of big data for the CPI  The items under consideration include white goods, foods, medical supplies,

daily necessities, and clothing

 For clothing, we are considering web scraping to collect prices for items such as one-piece dresses, slacks, and children’s trousers

 As web scraping data for clothing contains a large number of related products, it is necessary to extract equivalent products from these products

 The necessary codes and names are often not present, it is difficult to filter them mechanically (and not practical to extract them manually) Currently studying the construction of a machine learning model

for automatically classifying products based on product descriptions (about 100 to 400 words)

and image information

18

mascot of Statistics Bureau of Japan “Census-Kun” and “Mirai-chan”

(Master. Census) (Miss. Future)

Thank you

19 Seitaro Tanimichi Takuya Shibata [email protected] [email protected]

  • Slide Number 1
  • Slide Number 2
  • Slide Number 3
  • Slide Number 4
  • Slide Number 5
  • Slide Number 6
  • Slide Number 7
  • Slide Number 8
  • Slide Number 9
  • Slide Number 10
  • Slide Number 11
  • Slide Number 12
  • Slide Number 13
  • Slide Number 14
  • Slide Number 15
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19

Expanding the use of Big Data for CPI in Japan

The Statistics Bureau of Japan (SBJ) has been utilizing big data to calculate the consumer price index (CPI) and has greatly expanded the scope since the 2020-base year. In the 2015-base year, the index was calculated using scanner data for four items: “personal computers (laptop)”, “personal computers (desktop)”, “tablet computers” and “cameras”. From the 2020-base, three items, “video recorders”, “PC printers” and “TV sets” were added to the index using scanner data.

Languages and translations
English

Expanding the use of Big Data for CPI in Japan

Seitaro Tanimichi, Takuya Shibata

Statistics Bureau of Japan

May 2023

Prepared for the Meeting of the Group of Experts on Consumer Price Indices

UNECE, June 2023, Geneva

Summary

The Statistics Bureau of Japan (SBJ) has been utilizing big data to calculate the consumer price index

(CPI) and has greatly expanded the scope since the 2020-base year.

In the 2015-base year, the index was calculated using scanner data for four items: “personal computers

(laptop)”, “personal computers (desktop)”, “tablet computers” and “cameras”. From the 2020-base, three

items, “video recorders”, “PC printers” and “TV sets” were added to the index using scanner data.

The SBJ has been conducting experimental studies and pilot tests for the use of web scraping since 2015,

and from the 2020-base, began actually producing indices for travel services (“airplane fares”, “hotel charges”

and “charges for package tours to overseas”).

By expanding coverage, the use of big data has made it possible to produce more appropriate indices, with

the number of prices increased significantly compared to previous field surveys, and to reduce the burden on

local governments and price collectors.

This paper introduces a comparison of the 2020-base results using big data and the 2015-base results

using field surveys for the same items, as well as the current status of studies aimed at expanding the use of

big data.

1. Introduction

In the 2020-base revision of CPIs in Japan, the use of scanner data was expanded and internet sales prices

by web scraping were newly adopted. In order to expand the use of big data, in light of the increase in online

shopping in recent years and the development of information-gathering technology, around 2015 the SBJ

started specific studies on the use of scanner data and the collection of online sales prices by web scraping.

For the items to be adopted, we narrowed down the candidates by comparing the index created from the trial

collection data with the current index and the percentage of online purchases. As a result, it was decided to

expand the use of scanner data in recreational durable goods, and for travel services (airplane fares, hotel

charges, and charges for package tours to overseas), and to shift from previous price surveys to collection of

online sales prices using web scraping.

In addition to confirming that there were no legal problems such as copyright with web scraping, we

requested the cooperation of site operators, improved the collection timing, and began operation in January

2020. Since August 2021, the SBJ has published indices calculated by expanding the use of such big data.

In this paper, we present the verification of the production of indices for items by using big data in the

2020-base and the status of studies toward the further use of big data in the 2025-base.

History of expanded use of big data in base revision of CPI

2000-base Used scanner data for “personal computers (desktop)” and “personal computers (laptop)”

2005-base Added scanner data for “cameras”

2010-base Included the price by scanner data of “tablet computers” to “personal computers (laptop)”

2015-base Separated “tablets computers” from “personal computers (laptop)”

2020-base Used scanner data for “video recorders”, “PC printers” and “TV sets”

Used web scraping data for “airplane fares”, “hotel charges” and “charges for package tours

to overseas”

2. Details of studies and calculation methods of price indices using big data

(1) Use of web scraping data: example of “hotel charges”

In considering the use of web scraping for hotel charges, we conducted a questionnaire survey to

examine trends in purchasing methods, time to make reservations, accommodation plans, selection of

collection sites, etc. We also conducted price collection and index production by web scraping on a trial

basis, and compared it with the index by conventional price surveys. As a result,

・ The largest number of reservations were made via the Internet, and capturing the price trend of

internet sales appropriately grasped the price trend of hotel charges.

・ We confirmed that web scraping can stably collect internet reservation prices from each travel

booking website.

・ We had a prospect of a huge number of internet sales prices being accurately reflected in the indices,

including quality adjustment, and it is expected that web scraping collecting daily prices contributes

to the improvement of accuracy of indices.

Therefore, we decided to use the internet sales prices.

(Price collection sites)

According to the questionnaire results, the largest number of people used travel booking websites

rather than websites of hotels. So, based on the status of the transaction volume handled by major travel

agencies, travel agencies of booking websites with the highest share of the transaction volume are

selected for web scraping collection of prices. In addition, as web scraping requires individual settings

based on each website structure, it is practical and efficient to collect from a comprehensive booking site,

in which we can collect many prices from the same site.

Table 1: Reservation time and method (results of the questionnaire)

N = 2,448 RESERVATION TIME

Within a week One to three weeks before

One month or more before Unknown Total

R ES

ER V

A TI

O N

M ET

H O

D Called hotels

directly 3% 4% 5% 1% 13%

Website of hotels 2% 7% 12% 1% 21% Travel booking site 7% 21% 29% 2% 59%

Over the counter 0% 1% 2% 0% 3% Others 0% 0% 1% 0% 1%

Unknown 0% 0% 1% 2% 3% Total 12% 33% 50% 6% 100%

(Accommodation plans and price collection time)

Depending on the release timing of accommodation plans at travel agencies and the timing of

consumers’ purchases, daily prices in each month of ryokan (Japanese-style inns), Japanese-style rooms,

of one night with two meals plans and of hotels, Western-style rooms, of one night with breakfast are

used. Plans with extremely high (or extremely low owing to a sale) prices relative to typical hotel charges

are excluded during process of excluding outliers.

Table 2: Cross table of room types and meal types (results of the questionnaire)

N = 2,448 WESTERN-

STYLE ROOMS

JAPANESE -STYLE ROOMS

JAPANESE -WESTERN

STYLE ROOMS

OTHERS TOTAL

NO MEALS 24% 4% 1% 1% 29% WITH BREAKFAST 24% 3% 1% 0% 29% WITH BREAKFAST AND DINNER 11% 22% 7% 0% 40%

BREAKFAST, LUNCH AND DINNER INCLUDED 1% 1% 0% 0% 2%

OTHERS 0% 0% 0% 0% 0% TOTAL 60% 30% 9% 1% 100%

As for price collection time, in principle, prices are collected at the beginning of the month, two months

before the accommodation date. This is because, in the web scraping collection results obtained during

the pilot study, the collection results one month before the accommodation date of some sites showed

that the average price of some accommodations was abnormally high compared to that of the two-month

prior collection due to the inability to collect low-priced plans because of full occupancy.

In addition, according to the results of long-term web scraping conducted between August 2017 and

March 2018, limited to 30 accommodation facilities, the following trends were observed in the number

of facilities where prices could be collected, and it was also found that there was a seasonal limit on

advanced reservation. (Table 3)

・ Prices for about 10% of accommodations four months ahead and about half of accommodations six

months ahead were not listed on the booking site. Therefore, it was not possible to collect prices.

・ Especially before November, prices from the following April (shaded cells) are posted considerably

less than before, and there is a gap in the status of prices posted on the site at the time of change of the

fiscal year.

Table 3: Number of accommodation facilities capable of price collection (N = 30) Reservation month

Collection month

1 month ahead

2 month ahead

3 month ahead

4 month ahead

5 month ahead

6 month ahead

7 month ahead

8 month ahead

9 month ahead

10 month ahead

11 month ahead

2017 Aug 30 29 29 28 25 18 14 2 2 2 1 Sep 30 30 29 26 23 16 4 2 2 1 1 Oct 30 30 30 27 22 7 3 2 1 1 1

Nov 30 30 29 26 17 10 5 4 2 2 1 Dec 30 29 28 24 22 14 7 5 5 3 3

2018 Jan 29 29 27 26 26 14 9 6 5 5 5 Feb 29 28 28 27 26 18 12 5 5 5 3 Mar 29 29 28 27 26 17 10 6 6 3 2

Average 30 29 29 26 23 14 8 4 4 3 2 Collection percentage 100% 99% 96% 89% 79% 48% 27% 14% 12% 9% 7%

(Accommodation facilities)

Based on the number of guests and facility scale of capacity by travel destination (prefecture) in the

Overnight Travel Statistics Survey (official statistics by Japan Tourism Agency), about 400 representative

accommodations are selected.

Price collection by web scraping does not require consideration of the upper limit of the number of

target facilities caused by resource constraints. However, unrestricted access to websites to obtain

Internet sales prices is not possible in light of the load on the site. Therefore, it is necessary to set an

appropriate number of target facilities.

In the pilot study, the standard error rate of the geometric average price was calculated using the

experimentally collected data table, and the effect on the price index was taken into account. As a result,

the number of facilities was set at 400, since the standard error rate for the increase in the number of

facilities almost stopped decreasing and leveled off when the number of facilities exceeded 400.

(Calculation method of indices)

Using a two-month data set for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), the price indices

are calculated according to the following procedures (1) to (4).

(1) Exclusions of outliers

In price collection, as all plans that match the conditions are collected, extremely high or low prices

may be collected. Plans in such price range have large quality differences from other prices and may

have temporarily lower prices, such as with a limited-time sale. Thus it is considered appropriate to

exclude them as outliers when producing price indices. Therefore, the following procedure is adopted

to exclude outliers.

(a) Define the individual prices as &#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; by booking website (&#x1d460;&#x1d460;), by accommodation date (&#x1d44e;&#x1d44e;), by

accommodation facility (&#x1d44f;&#x1d44f;) and by plan (&#x1d450;&#x1d450;), and convert them to logarithms.

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; = log (&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;)

(b) Calculate average prices and standard deviations by booking website, accommodation date and

accommodation facility. (&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; is the number of plans.)

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑ &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d450;&#x1d450;=1

σ&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = � 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;−1

∑ �&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;� 2&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

&#x1d450;&#x1d450;=1

(c) Any individual price that differs from the average price by more than three times the absolute value

of the standard deviation for each reservation site, accommodation date and accommodation facility

is considered as an outlier.

�&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;� > 3σ&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

(2) Creation of a data table

For individual prices excluding outliers, average prices for each booking website, accommodation

date, and accommodation facility are calculated, and a data table with these as attributions is created

(&#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; is the number of prices excluding outliers).

&#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑ &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; &#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d450;&#x1d450;=1

(3) Missing value imputation

In the case of the average price after data cleaning, if the individual prices are not displayed on the

site as a result of the site search by setting the reservation date and accommodation, the average value

under this search condition cannot be calculated, which causes missing values in the data table. In the

calculation of the average price in which missing values are ignored in the index calculation, the

difference in missing by day of the week may make missing less random, resulting in a bias in the

average price. In addition, attention should be paid to the imputation at the calculation stage of the

average price because the result of the index calculation may change depending on the calculation

order of the average. Therefore, a method of estimating and imputing missing values from regression

analysis of data sets of actual measured values (regression imputation) is considered.

As the index calculation assumes a monthly chain-linking method, by performing regression

analysis using a data set for two consecutive months, the same regression coefficient can be used to

adjust the average price variation due to the entry and exit of accommodations on a monthly basis

together, such as newly collected in the current month or those that no longer accept reservations from

the current month.

(a) Using the data table aggregated in (2), regression analysis is performed with the price &#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; as

an explained variable and reservation site, accommodation date, and accommodation facility as

explanatory variables (dummy variables).

&#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = &#x1d6fc;&#x1d6fc;+&#x1d737;&#x1d737;&#x1d460;&#x1d460; ∙ &#x1d499;&#x1d499;&#x1d460;&#x1d460; +&#x1d737;&#x1d737;&#x1d44e;&#x1d44e; ∙ &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; +&#x1d737;&#x1d737;&#x1d44f;&#x1d44f; ∙ &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; + &#x1d700;&#x1d700; Explanatory variable

Reservation site: &#x1d499;&#x1d499;s = �&#x1d465;&#x1d465;&#x1d460;&#x1d460;,1,⋯ ,&#x1d465;&#x1d465;&#x1d460;&#x1d460;,&#x1d446;&#x1d446;−1� S: The number of booking websites Accommodation date: &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; = �&#x1d465;&#x1d465;&#x1d44e;&#x1d44e;,1,⋯ ,&#x1d465;&#x1d465;&#x1d44e;&#x1d44e;,&#x1d434;&#x1d434;−1�

&#x1d434;&#x1d434;: Total number of days in the current month and the previous month Accommodation facility: &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; = �&#x1d465;&#x1d465;&#x1d44f;&#x1d44f;,1,⋯ ,&#x1d465;&#x1d465;&#x1d44f;&#x1d44f;,&#x1d435;&#x1d435;−1�

&#x1d435;&#x1d435;: The number of accommodation facilities

(b) Based on the estimated regression model, in the combinations of booking website, accommodation

date, and accommodation facility that lead to missing values of prices, estimate values of prices ymıs�

are calculated using the attribution information (booking website: &#x1d499;&#x1d499;&#x1d460;&#x1d460;′, accommodation date: &#x1d499;&#x1d499;&#x1d44e;&#x1d44e;′,

accommodation facility: &#x1d499;&#x1d499;&#x1d44f;&#x1d44f;′) and are substituted as imputed values.

&#x1d466;&#x1d466;mıs� = &#x1d6fc;&#x1d6fc;� + &#x1d737;&#x1d737;&#x1d494;&#x1d494;� ∙ &#x1d499;&#x1d499;&#x1d494;&#x1d494;′ + &#x1d737;&#x1d737;&#x1d482;&#x1d482;� ∙ &#x1d499;&#x1d499;&#x1d482;&#x1d482;′ + &#x1d737;&#x1d737;&#x1d483;&#x1d483;� ∙ &#x1d499;&#x1d499;&#x1d483;&#x1d483;′

(4) The data set after imputation is used to calculate the geometric average prices for the current month

(&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), respectively. These price relatives are multiplied by the price

index for the previous month to calculate the price index for the current month.

&#x1d443;&#x1d443;&#x1d461;&#x1d461; = �∏ &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; � 1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; = exp � 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑ log�&#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;�&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; �

= exp � 1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑ &#x1d44c;&#x1d44c;′&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; �

&#x1d43c;&#x1d43c;&#x1d461;&#x1d461; = &#x1d43c;&#x1d43c;&#x1d461;&#x1d461;−1 × &#x1d443;&#x1d443;&#x1d461;&#x1d461; &#x1d443;&#x1d443;&#x1d461;&#x1d461;−1

Figure 1 shows the calculation results of the verification. By imputing missing values, it can be seen

that the index has remained stable by the effect of adjusting the difference in month-by-month average

prices due to differences in facilities. To examine seasonality, we compared the index after imputation

with the average value for four years of published values from 2015 to 2018 and found that the index

after imputation generally captured seasonal movements. In addition, the index in August was lower than

the published value because the published value in 2018 largely increased owing to the effect of a

calendar, but reflecting daily prices by web scraping removes the temporary effect of the relationship

between survey date and a calendar. Conversely, the indices in December and January were higher than

the published values, but this divergence was caused by the fact that the published values did not reflect

prices during the busy period of year-end and New Year holidays, while the calculation values did. Thus

calculation results are considered to reflect the actual condition.

Figure 1: Index calculation results

(2) Use of scanner data: examples of “TV sets”

Until the 2015-base, the price index of “TV sets” for the CPIs was calculated using prices collected

through the specification designation method in the Retail Price Survey. However, while high-quality

TVs with higher resolution and larger screens are becoming more prevalent, there is demand for

conventional TVs due to the increasing number of single-person households and other factors, leading to

greater diversification. To reflect these trends in the indices, we examined index creation using the

hedonics method, which utilizes scanner data, as a method to create indices that do not rely on the

specification designation method.

The following scanner data were used in the validation for the 2020-base revision.

・ Period: Monthly data from October 2017 to March 2018

・ Type: Liquid crystal display TV (not including organic EL TV)

・ Region: Whole of country (about 2,500 outlets), including online shops

・ Data size: Approximately 750 models, Unit sales: Approximately 220,000/month average

・ Average unit price and sales quantities by model (total of outlet sales and online sales)

・ Characteristics of each model, such as screen size and number of pixels displayed

Specifications Examples Release month Year, Month Tuner shape Separate type, Integrated type, None Screen size 3-inch type to 75-inch type Number of pixels displayed 1366x768, 1920x1080, 3840x2160, etc. D connector D4x1, D5x1, None PC input D-Sub, None Communication terminal LAN, None Card slot SDXC, None HDD capacity 0 GB to 2,000 GB

Internet Capable, Incapable Wireless function IEEE802.11a/n, None Audio output 10W+10W, 3W+3W, 5W+5W, etc. HDMI connector 0 to 4 Link function Available, Unavailable Drive speed Constant speed, Double speed Recording media HDD (external), HDD (internal/external) High-definition capable 4K/2K, 8K, High-definition, Full high-definition, Incapable Hybrid cast Capable, Incapable

In terms of the product cycle, when observing the market share by release month from the scanner

data as of March 2018, product models released in September 2017 still held about 30% of the market

share in March 2018, more than half a year after launch, while models released within one year of launch

held about 80%, those within one year and a half held 90%, and those within two years held almost

100%. In time series, the share of models released within a year and a half ranged from 80% to 90%, and

the share of models released within two years transitioned at 95% or more, indicating that the product

cycle is short compared to the frequency of base revisions of CPI (five years). It is conceivable that a

long period of time after launch may result in a significant difference in quality from the new model, or a

price drop greater than the difference in quality. For this reason, models after 24 months have passed

since the launch are excluded from the analysis.

The regression model is set up as a semi-logarithmic regression model with the average unit price as

an explained variable and with various characteristics such as specifications as explanatory variables.

The explanatory variables were selected by the stepwise method from the characteristic values using

scanner data of March 2018. For the month-over-month estimation, data from two consecutive months

are pooled and analyzed using a regression model weighted by sales quantities to estimate the price

relative between the two time points of which quality differences were adjusted.

As a result of the estimation, the result of the month-over-month estimation between November 2017

and March 2018 showed that the adjusted coefficient of determination adjusted for degrees of freedom

remained stable over 0.95 in all the periods, indicating that its applicability to the hedonic regression

model is good.

Figure 2 shows a comparison between the 2015-base CPI and the results of the month-over-month

provisional calculation by the hedonic price index. Although there are differences in product models and

price levels between the current CPI based on the specification designation method and the hedonic price

index based on scanner data, the month-over-month provisional calculation values based on the hedonic

price index show a difference of 0.4 to 4.7 points from the current CPI. As a result of the calculation, it

was thought that the hedonic regression model using scanner data would enable stable quality adjustment

and contribute to improving the accuracy of statistics, and therefore scanner data was used for TV sets in

the 2020-base revision.

Figure 2: Comparison of the 2015-base CPI and calculation values

For PC printers and video recorders, a fixed-specification method is used, not a hedonic regression

model. This is based on the following characteristics: these items have a long cycle of new products, the

items have little difference in quality between the old and new products, the price of the items can be

explained with small specifications, and the items have small weights.

3. Comparison of results using big data (the 2020-base) with results from field surveys (the 2015-

base)

(1) Web scraping

For items using web scraping from the 2020-base, price collection conditions and the number of

collected prices were compared with those of the 2015-base as shown in the table below, and the number

of collected prices has increased significantly. Item Hotel charges Base 2015 Base 2020 Base

Collection conditions (main)

Prices on Friday and Saturday of the week including the 5th of every month

Prices of 1st to 31st of every month purchased two months in advance of accommodation

Number of collected prices 640 About 1 million

Item Airplane fares Base 2015 Base 2020 Base

Collection conditions (main)

One flight each by adopted section and airline

All flights by adopted section and airline

Number of collected prices 2,604 About 2.5 million

Item Charges for package tours to overseas Base 2015 Base 2020 Base

Collection conditions (main)

One flight by adopted city and travel company

All flights by adopted city and travel company

Number of collected prices 372 About 200,000

With regard to hotel charges, from January 2020 to July 2021, a comparison of the price index in the

2020-base for these items with the price index in the 2015-base (converted value as 2020 year = 100)

yielded the following results.

The 2015-base index has fallen sharply in August 2020. On the other hand, the 2020-base index over

the same period has been somewhat gradual compared to the 2015-base index. This is because the impact

of the government’s travel assistance program (reduction of hotel charges), which began in late July, was

reflected from July in the 2020-base index, whereas the index of 2015, which only covered prices for a

specific two days in early every month, did not show the impact of the program in July but reflected it

from the following August. Web scraping has made it possible for policy effects to be reflected in the index

in a timely manner.

In addition, the difference in the movements of the two indices from November to December 2020 may

also be affected by the difference in the scope of accommodation dates covered and the timing of price

collection. In the index of 2015, which only covers prices for a specific two days, the calendar around the

survey date has affected the indices, but the introduction of web scraping has made it possible to cover all

days of accommodation, which has made it possible to produce more stable indices.

With regard to travel services to which web scraping is introduced, it has become possible to produce

more stable and appropriate indices by expanding coverage in general. “Hotel charges” were excluded

from the price collection surveys conducted, which contributed to reducing the burden on collectors and

local government officials.

(2) Scanner data

The table below shows the comparison of collection time of prices and the number of collected prices

for items that use scanner data from the 2020-base with those in the 2015-base, and that the number of

collected prices considerably increased.

70.0

80.0

90.0

100.0

110.0

120.0

130.0 Hotel charges

2020年基準 2015年基準(換算値)2015-base 2020-base

2015 Base 2020 Base

Collection time and price

Price on any one of Wednesday, Thursday or Friday of the week

including the 12th of each month Prices from 1st to 31st of each month

Item Video recorders

PC printers TV sets Video

recorders PC

printers TV sets

Number of collected product

models 6 1 8 23 46 600

Number of stores for collection 186 172 186 About

2,600 About 2,600

About 2,600

Number of collected prices 186 172 186 About

30,000 About 80,000

About 240,000

When comparing the price index in the 2020-base for these items with the price index in the 2015-base

(converted value as 2020 year = 100) from January 2020 to July 2021, the following results were obtained.

・ TV sets (hedonics method)

While the prices of some specific product models are collected for the index of 2015, the 2020-base

index covers all models (including online sales) included in the scanner data, so that the price trend

after quality adjustment can be captured by the specification information. Specifically, the 2015-base

index shows a downward trend from the spring of 2020 until the end of the year, while the 2020-base

index shows an upward trend. The movement of the 2020-base index is also in line with the

presumption that demand for televisions at home increased during this period, along with increased

time at home.

・ PC printers (fixed specification method)

As the 2015-base index collects the price of one specific product model, the index changes

depending solely on the model whose price increased in September 2020. On the other hand, the 2020-

base index can capture models whose prices have increased since around May 2020 because multiple

models that fell under the selected specifications (including online sales) are included. Specifically, the

movement of the 2020-base index is consistent with the presumption that since the spring of 2020, the

demand for PC printers at home increased owing to the spread of remote working and classes to prevent

the spread of COVID-19.

Based on the above, we believe that more appropriate index production has become possible for

recreational durable goods for which scanner data is newly used by the expansion of coverage and quality

adjustment using specification information. In addition, items for which the survey method was switched

to price collection by scanner data are excluded from the scope of surveys by enumerators, and this

contributes to reducing the burden on prefectures and enumerators.

4. Study to expand the use of big data

In light of the expansion of online sales, improvement of information-gathering technology, and further

deterioration of the field survey environment, it is necessary to accelerate the use of big data for the CPI.

Therefore, we will continue to study to make use of big data. In doing so, it is necessary to take into

consideration newly occurring costs and issues, as well as the division of roles between field collection and

prefectural surveys, and to prioritize areas that are expected to be cost-effective against budgetary

constraints.

The items under consideration include white goods, foods, medical supplies, daily necessities and

clothing. Of these, data for some items of white goods have already been shifted to scanner data, but it is

expected that the extension to electric rice-cookers and microwave ovens will contribute to reducing the

field survey burden on enumerators in the future. Scanner data is also expected to be used for food, medical

supplies and daily necessities. On the other hand, in the case of foods, for example, there is no scanner data

for prepared food. Therefore, the use of scanner data for some items may not substantially reduce the

burden on enumerators.

For clothing, we are considering web scraping to collect prices for items such as one-piece dresses,

slacks and children’s trousers, in light of the growing size of the online sales market and the percentage of

purchases. As web scraping data for clothing contains a large number of related products in addition to the

clothing being sought, it is necessary to extract equivalent products from these products, but since the

necessary codes and names are often not present, it is difficult to filter them mechanically and it is not

practical to extract them manually. Therefore, we are currently studying the construction of a machine

learning model that automatically classifies products into equivalent products based on product descriptions

(about 100 to 400 words) and image information.

To date, as for analyses using text information, we are verifying methods such as logistic regression,

gradient boosting (Light GBM), and kernel SVM as models for classifying materials (cotton, chemical

fiber, etc.), lengths (full length, short, etc.), seasons (spring/summer, fall/winter, etc.), and patterns (plain,

floral, etc.). We are also verifying methods for analysis using image information such as ResNet and

EfficientNet.

Although these methods can ensure a certain level of classification accuracy, practical applications

require reducing the amount of images and shortening the computation time because of the large data

capacity of images, and increasing the number of companies targeted for web scraping to secure a share of

sales.

5. Conclusion

This paper introduced the expansion of the use of big data in the 2020-base revision. The use of big data

has contributed to improving statistical accuracy by expanding coverage and reducing the burden on

prefectures and enumerators. We will continue to conduct wide-ranging studies for accuracy improvement

of the CPI and efficient price collection.

Expanding the use of Big Data for CPI in Japan

The Statistics Bureau of Japan (SBJ) has been utilizing big data to calculate the consumer price index (CPI) and has greatly expanded the scope since the 2020-base year. In the 2015-base year, the index was calculated using scanner data for four items: “personal computers (laptop)”, “personal computers (desktop)”, “tablet computers” and “cameras”. From the 2020-base, three items, “video recorders”, “PC printers” and “TV sets” were added to the index using scanner data.

Languages and translations
English

Expanding the use of Big Data for CPI in Japan

Seitaro Tanimichi, Takuya Shibata

Statistics Bureau of Japan

May 2023

Prepared for the Meeting of the Group of Experts on Consumer Price Indices

UNECE, June 2023, Geneva

Summary

The Statistics Bureau of Japan (SBJ) has been utilizing big data to calculate the consumer price index

(CPI) and has greatly expanded the scope since the 2020-base year.

In the 2015-base year, the index was calculated using scanner data for four items: “personal computers

(laptop)”, “personal computers (desktop)”, “tablet computers” and “cameras”. From the 2020-base, three

items, “video recorders”, “PC printers” and “TV sets” were added to the index using scanner data.

The SBJ has been conducting experimental studies and pilot tests for the use of web scraping since 2015,

and from the 2020-base, began actually producing indices for travel services (“airplane fares”, “hotel charges”

and “charges for package tours to overseas”).

By expanding coverage, the use of big data has made it possible to produce more appropriate indices, with

the number of prices increased significantly compared to previous field surveys, and to reduce the burden on

local governments and price collectors.

This paper introduces a comparison of the 2020-base results using big data and the 2015-base results

using field surveys for the same items, as well as the current status of studies aimed at expanding the use of

big data.

1. Introduction

In the 2020-base revision of CPIs in Japan, the use of scanner data was expanded and internet sales prices

by web scraping were newly adopted. In order to expand the use of big data, in light of the increase in online

shopping in recent years and the development of information-gathering technology, around 2015 the SBJ

started specific studies on the use of scanner data and the collection of online sales prices by web scraping.

For the items to be adopted, we narrowed down the candidates by comparing the index created from the trial

collection data with the current index and the percentage of online purchases. As a result, it was decided to

expand the use of scanner data in recreational durable goods, and for travel services (airplane fares, hotel

charges, and charges for package tours to overseas), and to shift from previous price surveys to collection of

online sales prices using web scraping.

In addition to confirming that there were no legal problems such as copyright with web scraping, we

requested the cooperation of site operators, improved the collection timing, and began operation in January

2020. Since August 2021, the SBJ has published indices calculated by expanding the use of such big data.

In this paper, we present the verification of the production of indices for items by using big data in the

2020-base and the status of studies toward the further use of big data in the 2025-base.

History of expanded use of big data in base revision of CPI

2000-base Used scanner data for “personal computers (desktop)” and “personal computers (laptop)”

2005-base Added scanner data for “cameras”

2010-base Included the price by scanner data of “tablet computers” to “personal computers (laptop)”

2015-base Separated “tablets computers” from “personal computers (laptop)”

2020-base Used scanner data for “video recorders”, “PC printers” and “TV sets”

Used web scraping data for “airplane fares”, “hotel charges” and “charges for package tours

to overseas”

2. Details of studies and calculation methods of price indices using big data

(1) Use of web scraping data: example of “hotel charges”

In considering the use of web scraping for hotel charges, we conducted a questionnaire survey to

examine trends in purchasing methods, time to make reservations, accommodation plans, selection of

collection sites, etc. We also conducted price collection and index production by web scraping on a trial

basis, and compared it with the index by conventional price surveys. As a result,

・ The largest number of reservations were made via the Internet, and capturing the price trend of

internet sales appropriately grasped the price trend of hotel charges.

・ We confirmed that web scraping can stably collect internet reservation prices from each travel

booking website.

・ We had a prospect of a huge number of internet sales prices being accurately reflected in the indices,

including quality adjustment, and it is expected that web scraping collecting daily prices contributes

to the improvement of accuracy of indices.

Therefore, we decided to use the internet sales prices.

(Price collection sites)

According to the questionnaire results, the largest number of people used travel booking websites

rather than websites of hotels. So, based on the status of the transaction volume handled by major travel

agencies, travel agencies of booking websites with the highest share of the transaction volume are

selected for web scraping collection of prices. In addition, as web scraping requires individual settings

based on each website structure, it is practical and efficient to collect from a comprehensive booking site,

in which we can collect many prices from the same site.

Table 1: Reservation time and method (results of the questionnaire)

N = 2,448 RESERVATION TIME

Within a week One to three weeks before

One month or more before Unknown Total

R ES

ER V

A TI

O N

M ET

H O

D Called hotels

directly 3% 4% 5% 1% 13%

Website of hotels 2% 7% 12% 1% 21% Travel booking site 7% 21% 29% 2% 59%

Over the counter 0% 1% 2% 0% 3% Others 0% 0% 1% 0% 1%

Unknown 0% 0% 1% 2% 3% Total 12% 33% 50% 6% 100%

(Accommodation plans and price collection time)

Depending on the release timing of accommodation plans at travel agencies and the timing of

consumers’ purchases, daily prices in each month of ryokan (Japanese-style inns), Japanese-style rooms,

of one night with two meals plans and of hotels, Western-style rooms, of one night with breakfast are

used. Plans with extremely high (or extremely low owing to a sale) prices relative to typical hotel charges

are excluded during process of excluding outliers.

Table 2: Cross table of room types and meal types (results of the questionnaire)

N = 2,448 WESTERN-

STYLE ROOMS

JAPANESE -STYLE ROOMS

JAPANESE -WESTERN

STYLE ROOMS

OTHERS TOTAL

NO MEALS 24% 4% 1% 1% 29% WITH BREAKFAST 24% 3% 1% 0% 29% WITH BREAKFAST AND DINNER 11% 22% 7% 0% 40%

BREAKFAST, LUNCH AND DINNER INCLUDED 1% 1% 0% 0% 2%

OTHERS 0% 0% 0% 0% 0% TOTAL 60% 30% 9% 1% 100%

As for price collection time, in principle, prices are collected at the beginning of the month, two months

before the accommodation date. This is because, in the web scraping collection results obtained during

the pilot study, the collection results one month before the accommodation date of some sites showed

that the average price of some accommodations was abnormally high compared to that of the two-month

prior collection due to the inability to collect low-priced plans because of full occupancy.

In addition, according to the results of long-term web scraping conducted between August 2017 and

March 2018, limited to 30 accommodation facilities, the following trends were observed in the number

of facilities where prices could be collected, and it was also found that there was a seasonal limit on

advanced reservation. (Table 3)

・ Prices for about 10% of accommodations four months ahead and about half of accommodations six

months ahead were not listed on the booking site. Therefore, it was not possible to collect prices.

・ Especially before November, prices from the following April (shaded cells) are posted considerably

less than before, and there is a gap in the status of prices posted on the site at the time of change of the

fiscal year.

Table 3: Number of accommodation facilities capable of price collection (N = 30) Reservation month

Collection month

1 month ahead

2 month ahead

3 month ahead

4 month ahead

5 month ahead

6 month ahead

7 month ahead

8 month ahead

9 month ahead

10 month ahead

11 month ahead

2017 Aug 30 29 29 28 25 18 14 2 2 2 1 Sep 30 30 29 26 23 16 4 2 2 1 1 Oct 30 30 30 27 22 7 3 2 1 1 1

Nov 30 30 29 26 17 10 5 4 2 2 1 Dec 30 29 28 24 22 14 7 5 5 3 3

2018 Jan 29 29 27 26 26 14 9 6 5 5 5 Feb 29 28 28 27 26 18 12 5 5 5 3 Mar 29 29 28 27 26 17 10 6 6 3 2

Average 30 29 29 26 23 14 8 4 4 3 2 Collection percentage 100% 99% 96% 89% 79% 48% 27% 14% 12% 9% 7%

(Accommodation facilities)

Based on the number of guests and facility scale of capacity by travel destination (prefecture) in the

Overnight Travel Statistics Survey (official statistics by Japan Tourism Agency), about 400 representative

accommodations are selected.

Price collection by web scraping does not require consideration of the upper limit of the number of

target facilities caused by resource constraints. However, unrestricted access to websites to obtain

Internet sales prices is not possible in light of the load on the site. Therefore, it is necessary to set an

appropriate number of target facilities.

In the pilot study, the standard error rate of the geometric average price was calculated using the

experimentally collected data table, and the effect on the price index was taken into account. As a result,

the number of facilities was set at 400, since the standard error rate for the increase in the number of

facilities almost stopped decreasing and leveled off when the number of facilities exceeded 400.

(Calculation method of indices)

Using a two-month data set for the current month (&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), the price indices

are calculated according to the following procedures (1) to (4).

(1) Exclusions of outliers

In price collection, as all plans that match the conditions are collected, extremely high or low prices

may be collected. Plans in such price range have large quality differences from other prices and may

have temporarily lower prices, such as with a limited-time sale. Thus it is considered appropriate to

exclude them as outliers when producing price indices. Therefore, the following procedure is adopted

to exclude outliers.

(a) Define the individual prices as &#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; by booking website (&#x1d460;&#x1d460;), by accommodation date (&#x1d44e;&#x1d44e;), by

accommodation facility (&#x1d44f;&#x1d44f;) and by plan (&#x1d450;&#x1d450;), and convert them to logarithms.

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; = log (&#x1d443;&#x1d443;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450;)

(b) Calculate average prices and standard deviations by booking website, accommodation date and

accommodation facility. (&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; is the number of plans.)

&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑ &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d450;&#x1d450;=1

σ&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = � 1 &#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;−1

∑ �&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;� 2&#x1d441;&#x1d441;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

&#x1d450;&#x1d450;=1

(c) Any individual price that differs from the average price by more than three times the absolute value

of the standard deviation for each reservation site, accommodation date and accommodation facility

is considered as an outlier.

�&#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; − &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;� > 3σ&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

(2) Creation of a data table

For individual prices excluding outliers, average prices for each booking website, accommodation

date, and accommodation facility are calculated, and a data table with these as attributions is created

(&#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; is the number of prices excluding outliers).

&#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = 1 &#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;

∑ &#x1d44c;&#x1d44c;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;,&#x1d450;&#x1d450; &#x1d441;&#x1d441;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; &#x1d450;&#x1d450;=1

(3) Missing value imputation

In the case of the average price after data cleaning, if the individual prices are not displayed on the

site as a result of the site search by setting the reservation date and accommodation, the average value

under this search condition cannot be calculated, which causes missing values in the data table. In the

calculation of the average price in which missing values are ignored in the index calculation, the

difference in missing by day of the week may make missing less random, resulting in a bias in the

average price. In addition, attention should be paid to the imputation at the calculation stage of the

average price because the result of the index calculation may change depending on the calculation

order of the average. Therefore, a method of estimating and imputing missing values from regression

analysis of data sets of actual measured values (regression imputation) is considered.

As the index calculation assumes a monthly chain-linking method, by performing regression

analysis using a data set for two consecutive months, the same regression coefficient can be used to

adjust the average price variation due to the entry and exit of accommodations on a monthly basis

together, such as newly collected in the current month or those that no longer accept reservations from

the current month.

(a) Using the data table aggregated in (2), regression analysis is performed with the price &#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; as

an explained variable and reservation site, accommodation date, and accommodation facility as

explanatory variables (dummy variables).

&#x1d44c;&#x1d44c;′&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; = &#x1d6fc;&#x1d6fc;+&#x1d737;&#x1d737;&#x1d460;&#x1d460; ∙ &#x1d499;&#x1d499;&#x1d460;&#x1d460; +&#x1d737;&#x1d737;&#x1d44e;&#x1d44e; ∙ &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; +&#x1d737;&#x1d737;&#x1d44f;&#x1d44f; ∙ &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; + &#x1d700;&#x1d700; Explanatory variable

Reservation site: &#x1d499;&#x1d499;s = �&#x1d465;&#x1d465;&#x1d460;&#x1d460;,1,⋯ ,&#x1d465;&#x1d465;&#x1d460;&#x1d460;,&#x1d446;&#x1d446;−1� S: The number of booking websites Accommodation date: &#x1d499;&#x1d499;&#x1d44e;&#x1d44e; = �&#x1d465;&#x1d465;&#x1d44e;&#x1d44e;,1,⋯ ,&#x1d465;&#x1d465;&#x1d44e;&#x1d44e;,&#x1d434;&#x1d434;−1�

&#x1d434;&#x1d434;: Total number of days in the current month and the previous month Accommodation facility: &#x1d499;&#x1d499;&#x1d44f;&#x1d44f; = �&#x1d465;&#x1d465;&#x1d44f;&#x1d44f;,1,⋯ ,&#x1d465;&#x1d465;&#x1d44f;&#x1d44f;,&#x1d435;&#x1d435;−1�

&#x1d435;&#x1d435;: The number of accommodation facilities

(b) Based on the estimated regression model, in the combinations of booking website, accommodation

date, and accommodation facility that lead to missing values of prices, estimate values of prices ymıs�

are calculated using the attribution information (booking website: &#x1d499;&#x1d499;&#x1d460;&#x1d460;′, accommodation date: &#x1d499;&#x1d499;&#x1d44e;&#x1d44e;′,

accommodation facility: &#x1d499;&#x1d499;&#x1d44f;&#x1d44f;′) and are substituted as imputed values.

&#x1d466;&#x1d466;mıs� = &#x1d6fc;&#x1d6fc;� + &#x1d737;&#x1d737;&#x1d494;&#x1d494;� ∙ &#x1d499;&#x1d499;&#x1d494;&#x1d494;′ + &#x1d737;&#x1d737;&#x1d482;&#x1d482;� ∙ &#x1d499;&#x1d499;&#x1d482;&#x1d482;′ + &#x1d737;&#x1d737;&#x1d483;&#x1d483;� ∙ &#x1d499;&#x1d499;&#x1d483;&#x1d483;′

(4) The data set after imputation is used to calculate the geometric average prices for the current month

(&#x1d461;&#x1d461;) and the previous month (&#x1d461;&#x1d461; − 1), respectively. These price relatives are multiplied by the price

index for the previous month to calculate the price index for the current month.

&#x1d443;&#x1d443;&#x1d461;&#x1d461; = �∏ &#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; � 1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; = exp � 1

&#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑ log�&#x1d443;&#x1d443;&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;�&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; �

= exp � 1 &#x1d441;&#x1d441;&#x1d461;&#x1d461; ∑ &#x1d44c;&#x1d44c;′&#x1d461;&#x1d461;,&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f;&#x1d460;&#x1d460;,&#x1d44e;&#x1d44e;,&#x1d44f;&#x1d44f; �

&#x1d43c;&#x1d43c;&#x1d461;&#x1d461; = &#x1d43c;&#x1d43c;&#x1d461;&#x1d461;−1 × &#x1d443;&#x1d443;&#x1d461;&#x1d461; &#x1d443;&#x1d443;&#x1d461;&#x1d461;−1

Figure 1 shows the calculation results of the verification. By imputing missing values, it can be seen

that the index has remained stable by the effect of adjusting the difference in month-by-month average

prices due to differences in facilities. To examine seasonality, we compared the index after imputation

with the average value for four years of published values from 2015 to 2018 and found that the index

after imputation generally captured seasonal movements. In addition, the index in August was lower than

the published value because the published value in 2018 largely increased owing to the effect of a

calendar, but reflecting daily prices by web scraping removes the temporary effect of the relationship

between survey date and a calendar. Conversely, the indices in December and January were higher than

the published values, but this divergence was caused by the fact that the published values did not reflect

prices during the busy period of year-end and New Year holidays, while the calculation values did. Thus

calculation results are considered to reflect the actual condition.

Figure 1: Index calculation results

(2) Use of scanner data: examples of “TV sets”

Until the 2015-base, the price index of “TV sets” for the CPIs was calculated using prices collected

through the specification designation method in the Retail Price Survey. However, while high-quality

TVs with higher resolution and larger screens are becoming more prevalent, there is demand for

conventional TVs due to the increasing number of single-person households and other factors, leading to

greater diversification. To reflect these trends in the indices, we examined index creation using the

hedonics method, which utilizes scanner data, as a method to create indices that do not rely on the

specification designation method.

The following scanner data were used in the validation for the 2020-base revision.

・ Period: Monthly data from October 2017 to March 2018

・ Type: Liquid crystal display TV (not including organic EL TV)

・ Region: Whole of country (about 2,500 outlets), including online shops

・ Data size: Approximately 750 models, Unit sales: Approximately 220,000/month average

・ Average unit price and sales quantities by model (total of outlet sales and online sales)

・ Characteristics of each model, such as screen size and number of pixels displayed

Specifications Examples Release month Year, Month Tuner shape Separate type, Integrated type, None Screen size 3-inch type to 75-inch type Number of pixels displayed 1366x768, 1920x1080, 3840x2160, etc. D connector D4x1, D5x1, None PC input D-Sub, None Communication terminal LAN, None Card slot SDXC, None HDD capacity 0 GB to 2,000 GB

Internet Capable, Incapable Wireless function IEEE802.11a/n, None Audio output 10W+10W, 3W+3W, 5W+5W, etc. HDMI connector 0 to 4 Link function Available, Unavailable Drive speed Constant speed, Double speed Recording media HDD (external), HDD (internal/external) High-definition capable 4K/2K, 8K, High-definition, Full high-definition, Incapable Hybrid cast Capable, Incapable

In terms of the product cycle, when observing the market share by release month from the scanner

data as of March 2018, product models released in September 2017 still held about 30% of the market

share in March 2018, more than half a year after launch, while models released within one year of launch

held about 80%, those within one year and a half held 90%, and those within two years held almost

100%. In time series, the share of models released within a year and a half ranged from 80% to 90%, and

the share of models released within two years transitioned at 95% or more, indicating that the product

cycle is short compared to the frequency of base revisions of CPI (five years). It is conceivable that a

long period of time after launch may result in a significant difference in quality from the new model, or a

price drop greater than the difference in quality. For this reason, models after 24 months have passed

since the launch are excluded from the analysis.

The regression model is set up as a semi-logarithmic regression model with the average unit price as

an explained variable and with various characteristics such as specifications as explanatory variables.

The explanatory variables were selected by the stepwise method from the characteristic values using

scanner data of March 2018. For the month-over-month estimation, data from two consecutive months

are pooled and analyzed using a regression model weighted by sales quantities to estimate the price

relative between the two time points of which quality differences were adjusted.

As a result of the estimation, the result of the month-over-month estimation between November 2017

and March 2018 showed that the adjusted coefficient of determination adjusted for degrees of freedom

remained stable over 0.95 in all the periods, indicating that its applicability to the hedonic regression

model is good.

Figure 2 shows a comparison between the 2015-base CPI and the results of the month-over-month

provisional calculation by the hedonic price index. Although there are differences in product models and

price levels between the current CPI based on the specification designation method and the hedonic price

index based on scanner data, the month-over-month provisional calculation values based on the hedonic

price index show a difference of 0.4 to 4.7 points from the current CPI. As a result of the calculation, it

was thought that the hedonic regression model using scanner data would enable stable quality adjustment

and contribute to improving the accuracy of statistics, and therefore scanner data was used for TV sets in

the 2020-base revision.

Figure 2: Comparison of the 2015-base CPI and calculation values

For PC printers and video recorders, a fixed-specification method is used, not a hedonic regression

model. This is based on the following characteristics: these items have a long cycle of new products, the

items have little difference in quality between the old and new products, the price of the items can be

explained with small specifications, and the items have small weights.

3. Comparison of results using big data (the 2020-base) with results from field surveys (the 2015-

base)

(1) Web scraping

For items using web scraping from the 2020-base, price collection conditions and the number of

collected prices were compared with those of the 2015-base as shown in the table below, and the number

of collected prices has increased significantly. Item Hotel charges Base 2015 Base 2020 Base

Collection conditions (main)

Prices on Friday and Saturday of the week including the 5th of every month

Prices of 1st to 31st of every month purchased two months in advance of accommodation

Number of collected prices 640 About 1 million

Item Airplane fares Base 2015 Base 2020 Base

Collection conditions (main)

One flight each by adopted section and airline

All flights by adopted section and airline

Number of collected prices 2,604 About 2.5 million

Item Charges for package tours to overseas Base 2015 Base 2020 Base

Collection conditions (main)

One flight by adopted city and travel company

All flights by adopted city and travel company

Number of collected prices 372 About 200,000

With regard to hotel charges, from January 2020 to July 2021, a comparison of the price index in the

2020-base for these items with the price index in the 2015-base (converted value as 2020 year = 100)

yielded the following results.

The 2015-base index has fallen sharply in August 2020. On the other hand, the 2020-base index over

the same period has been somewhat gradual compared to the 2015-base index. This is because the impact

of the government’s travel assistance program (reduction of hotel charges), which began in late July, was

reflected from July in the 2020-base index, whereas the index of 2015, which only covered prices for a

specific two days in early every month, did not show the impact of the program in July but reflected it

from the following August. Web scraping has made it possible for policy effects to be reflected in the index

in a timely manner.

In addition, the difference in the movements of the two indices from November to December 2020 may

also be affected by the difference in the scope of accommodation dates covered and the timing of price

collection. In the index of 2015, which only covers prices for a specific two days, the calendar around the

survey date has affected the indices, but the introduction of web scraping has made it possible to cover all

days of accommodation, which has made it possible to produce more stable indices.

With regard to travel services to which web scraping is introduced, it has become possible to produce

more stable and appropriate indices by expanding coverage in general. “Hotel charges” were excluded

from the price collection surveys conducted, which contributed to reducing the burden on collectors and

local government officials.

(2) Scanner data

The table below shows the comparison of collection time of prices and the number of collected prices

for items that use scanner data from the 2020-base with those in the 2015-base, and that the number of

collected prices considerably increased.

70.0

80.0

90.0

100.0

110.0

120.0

130.0 Hotel charges

2020年基準 2015年基準(換算値)2015-base 2020-base

2015 Base 2020 Base

Collection time and price

Price on any one of Wednesday, Thursday or Friday of the week

including the 12th of each month Prices from 1st to 31st of each month

Item Video recorders

PC printers TV sets Video

recorders PC

printers TV sets

Number of collected product

models 6 1 8 23 46 600

Number of stores for collection 186 172 186 About

2,600 About 2,600

About 2,600

Number of collected prices 186 172 186 About

30,000 About 80,000

About 240,000

When comparing the price index in the 2020-base for these items with the price index in the 2015-base

(converted value as 2020 year = 100) from January 2020 to July 2021, the following results were obtained.

・ TV sets (hedonics method)

While the prices of some specific product models are collected for the index of 2015, the 2020-base

index covers all models (including online sales) included in the scanner data, so that the price trend

after quality adjustment can be captured by the specification information. Specifically, the 2015-base

index shows a downward trend from the spring of 2020 until the end of the year, while the 2020-base

index shows an upward trend. The movement of the 2020-base index is also in line with the

presumption that demand for televisions at home increased during this period, along with increased

time at home.

・ PC printers (fixed specification method)

As the 2015-base index collects the price of one specific product model, the index changes

depending solely on the model whose price increased in September 2020. On the other hand, the 2020-

base index can capture models whose prices have increased since around May 2020 because multiple

models that fell under the selected specifications (including online sales) are included. Specifically, the

movement of the 2020-base index is consistent with the presumption that since the spring of 2020, the

demand for PC printers at home increased owing to the spread of remote working and classes to prevent

the spread of COVID-19.

Based on the above, we believe that more appropriate index production has become possible for

recreational durable goods for which scanner data is newly used by the expansion of coverage and quality

adjustment using specification information. In addition, items for which the survey method was switched

to price collection by scanner data are excluded from the scope of surveys by enumerators, and this

contributes to reducing the burden on prefectures and enumerators.

4. Study to expand the use of big data

In light of the expansion of online sales, improvement of information-gathering technology, and further

deterioration of the field survey environment, it is necessary to accelerate the use of big data for the CPI.

Therefore, we will continue to study to make use of big data. In doing so, it is necessary to take into

consideration newly occurring costs and issues, as well as the division of roles between field collection and

prefectural surveys, and to prioritize areas that are expected to be cost-effective against budgetary

constraints.

The items under consideration include white goods, foods, medical supplies, daily necessities and

clothing. Of these, data for some items of white goods have already been shifted to scanner data, but it is

expected that the extension to electric rice-cookers and microwave ovens will contribute to reducing the

field survey burden on enumerators in the future. Scanner data is also expected to be used for food, medical

supplies and daily necessities. On the other hand, in the case of foods, for example, there is no scanner data

for prepared food. Therefore, the use of scanner data for some items may not substantially reduce the

burden on enumerators.

For clothing, we are considering web scraping to collect prices for items such as one-piece dresses,

slacks and children’s trousers, in light of the growing size of the online sales market and the percentage of

purchases. As web scraping data for clothing contains a large number of related products in addition to the

clothing being sought, it is necessary to extract equivalent products from these products, but since the

necessary codes and names are often not present, it is difficult to filter them mechanically and it is not

practical to extract them manually. Therefore, we are currently studying the construction of a machine

learning model that automatically classifies products into equivalent products based on product descriptions

(about 100 to 400 words) and image information.

To date, as for analyses using text information, we are verifying methods such as logistic regression,

gradient boosting (Light GBM), and kernel SVM as models for classifying materials (cotton, chemical

fiber, etc.), lengths (full length, short, etc.), seasons (spring/summer, fall/winter, etc.), and patterns (plain,

floral, etc.). We are also verifying methods for analysis using image information such as ResNet and

EfficientNet.

Although these methods can ensure a certain level of classification accuracy, practical applications

require reducing the amount of images and shortening the computation time because of the large data

capacity of images, and increasing the number of companies targeted for web scraping to secure a share of

sales.

5. Conclusion

This paper introduced the expansion of the use of big data in the 2020-base revision. The use of big data

has contributed to improving statistical accuracy by expanding coverage and reducing the burden on

prefectures and enumerators. We will continue to conduct wide-ranging studies for accuracy improvement

of the CPI and efficient price collection.