March 2019 · Number 7 Working with Administrative Tax Data: A How-to-Get-Started Guide Anne Brockmeyer Introduction different types of tax data, modes of accessing tax data and briefly reviewing some key Administrative tax data at the taxpayer or upsides and downsides of working with tax data. Second, the note provides practical advice transaction level (henceforth tax data for simplicity) are a great resource for studying on how to get started in working with tax data – from building a first contact with the Tax taxpayer behavior, assessing responses to changes in tax policy and administration, and Authority (TA) to pitching a project and formulating a data request. I conclude with a deriving lessons for optimal tax policy design brief discussion on publishing studies using tax (see Pomeranz & Vila-Belda (forthcoming), and data. The note draws on my PhD research and Slemrod 2017 for reviews of the emerging my experience at the World Bank, where I work literature). From an operational perspective, tax data can be used to prepare technical with tax data from various countries, including on a new pilot project between different World assistance and investment projects (e.g. identify the most important tax offices from a Bank units (the Macro, Trade and Investment Global Practice, the Research Department and revenue perspective and assess compliance gaps), monitor implementation of projects (e.g. the Global Tax Team) which investigates what can be learned from comparing micro tax data record the number of declarations filed across countries. electronically in response to IT investments), and evaluate project-supported policy reforms The note is intended both for staff in (e.g. estimate the increase in tax revenue achieved through a tax rate change). Tax data development partner organizations (be it multilateral, bilateral or NGOs) working on can also be used to monitor/evaluate non-tax development projects (e.g. the effect of micro tax, or desiring to use tax data for their work, as well as for graduate students and junior loans on business growth) and study a multitude of other questions, e.g. related to researchers hoping to conduct policy-relevant research using tax data. For a more academic intergenerational mobility, firm production guide to collaborating with TAs, and insights networks, or who becomes an inventor. from a survey with 70 researchers working with administrative tax data, see Pomeranz Governments are increasingly open to providing access to tax data and to and Vila-Belda (forthcoming). collaborating in the analysis of these data. This goes hand in hand with an increasing desire by 1) A Primer on Tax Data development organizations and researchers to make use of such data for policy analysis. This What do we mean by administrative tax data? note has two objectives. First, it provides a This section provides an overview of the primer on what tax data is, describing the different types of tax data, the modes of accessing tax data and some upsides and intermediaries (such as credit/debit card downsides of tax data compared to other types processing companies) that report of data. taxpayers’ financial transactions (e.g. retailers’ sales through card machines), or Different Types of Tax Data in the form of VAT annexes in which firms need to detail their transactions with other Tax data are collected by the TA in the process firms. Some of these third parties are also of exercising its functions – collecting designated as withholding agents, government revenue. I focus here on the micro- remitting a small fraction of the transaction level (non-aggregated) version of this data and amount they process as advance tax provide below a list of the broad categories of payment for the transaction partner tax data. For an illustration with data from (Velayudhan & Slemrod 2018 show that Costa Rica, see the context and data sections in 85% of liabilities are remitted like that). The Brockmeyer et al 2019 and Brockmeyer & information and withholding declaration Hernandez 2018. are often uniquely detailed (e.g. providing firm-firm-day transaction level • Tax Register: The register contains the list information), but rarely provide of all registered taxpayers at a point in information about the type of time, with the unique identifier, name, goods/services exchanged. geographic location (region or precise • Customs data: The customs authority address), and (for firms) sector and legal maintains import and export records. The form. Using registration and deregistration customs authority is usually an entity records (which are usually based on a distinct from the TA. Although the two specific form available on the TA’s should collaborate closely so that the TA website), it is possible to reconstruct the can cross-check taxpayers’ self-assessment register for previous points in time. There declarations with customs records, this is may be separate registers for firms and not always the case. Still, it is worthwhile individual taxpayers.1 checking if the TA has customs data, or • Taxpayer self-assessment declarations: whether that data can be obtained from the These are the declarations that taxpayers customs authority directly. Similarly, submit in regular intervals (e.g. monthly public procurement agencies may have for the value-added tax, annually for the data sharing agreements with the TA. corporate income tax and the personal • Process and HR data: The TA maintains income tax, sometimes quarterly for records of its internal processes, such as tax simplified tax regimes). The self- audits (e.g. list of taxpayers selected for assessment declarations contain audit, allocation to tax officers, audit information about the tax base outcomes). Closely related to this is (income/wealth/consumption/value- information on the TA staff (spell data on added), deductions and exemptions, the work history), remuneration and bonuses. tax liability, and tax payment. Note that the tax payment is sometimes recorded on a separate payment form and not on the self- All data except the HR data would have unique assessment declaration. taxpayer identifiers and period identifiers • Informative declarations and (which can be month/fiscal year/quarter) that withholding declarations: The TA collects allow merging different datasets. The self- information from third parties (i.e. agents assessment, customs and informative other the taxpayers) about the taxpayers’ declarations usually have monotonically transactions. This third-party information increasing form numbers, which allow sorting may be submitted to the TA by public and eliminating multiple filings per taxpayer- procurement agencies, by financial period. As the tax identifiers might be unique 1 The property cadaster is a special type of tax register, usually maintained not by the central government but by local governments. March 2019 · Number 7 · 2 to the TA, it is not always possible to merge tax the identifier in “its” data into the identifier in data with data from other government agencies the other data, or a merge can be done based on (unless one of the two agencies can translate names and/or addresses). If a merge at the In any case, the data should be de-identified micro-level is not possible, a second-best securely but systematically, so that the panel approach is to merge semi-aggregated data at structure of the data is preserved, and the same the sector and/or region level. But generally, de-identification algorithm applies to all the more disaggregated the data, the better. datasets, so that different datasets can be merged. An MoU may specify how and for Modes of Accessing Tax Data what purpose the data can be used, detail procedures for safe handling of data, and any Different countries provide access to their tax conditions on results publication (e.g. results data in different ways. From least to most must be such that no data point published is restrictive, these are the options I have based on less than X observations, results need encountered: to be discussed with TA prior to publication). 1. The data is available online (believe it Data can be analyzed in STATA or R, or not, this actually happens in some depending on the mode of access and software Scandinavian countries; Mexico publishes available at the TA’s data lab. Given potential anonymized data > click on SAT mas unavailability of STATA at government offices abierto > Datos Anonimizados > and the free availability of R, junior researchers Declaraciones anuales de personas should probably invest in R. morales). 2. The TA extracts and hands over the Upsides and Downsides of Tax Data data to specific individuals/institutions under a Memorandum of Understanding It is good to be mindful of a few characteristics (this is how researchers work with data of tax data when preparing to work with them. from Senegal and Pakistan). 3. The TA extracts de-identified data for Upsides specific institutions under a Memorandum of Understanding (MoU), requiring that • Tax data contain the universe of the formal the data be considered confidential, with sector. Unlike survey data, they do not restricted access in a secure computer suffer from selective non-reporting at the outside of the TA premises but regulated top of the income distribution. Unlike by a data security plan which, among other census data, they contain very detailed provisions, requires that the computer is information. not connect to the internet (e.g. some state • The data is collected at high frequency and governments in Brazil have provided data low marginal cost. access this way). • Most types of tax data are now collected 4. The TA provides remote access to the electronically, which minimizes errors in data to selected/screened individuals via a tax filing (e.g. through internal consistency secure server (this is theoretically possible, checks in tax filing software) and data but I have not seen an example). processing. 5. The TA provides access to the data • As the data is the product of actual onsite (e.g. at the Datalab at the UK TA economic processes, it measures variables (HMRC) and at the Ecuador TA). In this with high precision, unlike survey data in case, external partners can either work which respondents provide ballpark onsite (possibly via a research assistant) or figures as their response has no meaningful work with a TA staff who closely consequences for themselves. collaborates with the research team and runs do-files/scripts which are partly Downsides prepared remotely on simulated data. • The fact that the data is directly economically relevant for those people March 2019 · Number 7 · 3 who provide the data (mostly taxpayers or questions that policy makers in low and their transaction partners) also means that middle-income countries face. the data is not necessarily good in capturing real economic outcomes. Self- Below, I describe how to build a connection assessment declarations in particular with a TA, make a successful project pitch, capture reported outcomes. Informative walk the tightropes of institutional politics, and declarations are more likely to capture real prepare a data request. outcomes, as the reporting agent has less incentive to misreport and is often more Building a Connection tightly monitored. • Tax data can be poor on demographic This section proposes some practical steps for information on individuals and building a connection with a TA and households unless they can be merged developing a joint project based on tax data. with other government data (studies in This is primarily intended for junior Denmark and Sweden have exploited the researchers. 2 Staff in development partner ability to merge across various types of organizations will know which steps they can administrative data). skip or have already completed. • Various types of documentation are available to understand tax data (tax • Find a context/country that you know well returns, manuals on how to file tax returns, or have some connection to, ideally one tax laws and decrees that explain the tax system), but unlike survey/census data, that is not yet over-researched, to avoid tax data is not collected for primarily overlap and potential conflicts of interest analytical purposes, and is thus not with other research teams. Or, work as accompanied by researcher-friendly Research Assistant for a more senior variable descriptions and codebooks. researcher to gain experience working with Understanding the data requires tax data, and potentially identify a spin-off knowledge of the relevant language and a topic for your own research. regular exchange with the TA on variable definitions, administrative practices and • Identify a contact person or local champion legislation. in the TA. Ideally, someone more senior who has a relationship of trust with the TA 2) How to Access Tax Data would introduce you. Hopefully your first contact person will soon loop some of her Once a policy question that requires access to colleagues into the dialogue. tax data is identified, some diplomacy and • Spend time to understand entrepreneurial spirit is usually necessary to a) the country’s tax system, its make the project a reality. This note is focused particularities, and policy on the technical aspects of accessing tax data rather than the conceptual questions studied, challenges [read World Bank and but a sample list of tax policy questions and IMF reports, the country’s tax data required to study them is provided in laws, reports by the TA and Table 1. Staff in development partner Ministry of Finance, press releases organizations working on tax will have come and discussions on government across many of these questions. Graduate websites, and above all, talk to students or junior researchers should replace people]; “policy question” by “research question”, which would be derived from an b) the data structure (here, I mean to understanding of the existing literature, try to understand the tax return remaining knowledge gaps and important forms and other data formats – 2Also see Glennerster (2014, 2015) on how to create research partnerships with practitioners. March 2019 · Number 7 · 4 postpone direct discussions about administrative practice, get a second the data for a bit); opinion on a topic etc. c) then link a+b to the broader policy/research questions to In addition to the above, staff in development partner organizations could leverage synergies study, derive a more specific with technical assistance work, budget support formulation of the question and lending (e.g. lending can support an MoU for hypothesis, and define a data access, or a new data confidentiality law methodological approach (which that stipulates ways for accessing data for can be quasi-experimental, i.e. research purposes), or investment projects (e.g. exploiting historical data and to support the establishment of data warehouse policy variation, a randomized or datalab, or otherwise improve the TA’s data infrastructure). field experiment, structural model estimation, or a combination of Ingredients for a Successful Project Pitch those). • After many “learning meetings” and some Obviously, the key ingredient is a policy- informal discussions about the project, relevant project which allows the government pitch your topic and methodological to improve on or learn something they would approach to your government counterpart. otherwise not be able to achieve, and which is This can be in person or via in line with the TA’s mandate and actual policy challenges. Needless to say, the project should videoconference. Then continue the also strike the balance between being dialogue by integrating feedback, innovative yet realistically feasible. adjusting your project to fit realities on the ground and policy needs, and, if needed, In addition, it helps to weave into your pitch re-pitch your project (ideally at (e.g. slides, write-up) some of the following: increasingly higher levels of the administration) until you have agreement • Evidence that other countries or other government institutions in the same on the project. Then start discussing the country provide access to their data for data request and associated logistics analytical purposes. formally (you might have touched on this • Examples of other projects using tax data topic previously in an informal way). that are policy-relevant and have a positive • To make your project a reality, you either policy impact. This can be evidence from need high-level political buy-in (e.g. from work by other people (maybe a mix of well- the Director of the TA or a Deputy Minister known work by senior researchers and of Finance – they can then request their work by people more similar to your technical staff to collaborate), or agreement background). Ideally choose from both the tax intelligence/IT topics/methods similar to the ones you are department (which handles the data) and pitching. the technical staff who are experts of the o The more specific you can be about topic the project focuses on (unless the the policy impact of the example latter are hierarchically superior to the data project(s), the better (e.g. sustained guardians and can request the data). The changes in the TA practices informed by the project, or data vs policy split often corresponds to a sustained improvement in tax TA vs Ministry of Finance split. Generally, revenue or distributional fairness). it is ideal to have contacts in various parts • If applicable, evidence of your own track of the TA and the Ministry of Finance, to record in working with TAs/with tax ask background questions on legislation, administrative data in other contexts, and of your policy impact. March 2019 · Number 7 · 5 In terms of capacity building, local ownership letter can be prepared. It would likely contain and quality of the project (and potentially also the following elements: ease of accessing data), it can be a good idea to identify not just a local champion but an actual • Reiterate the (verbally agreed upon) co-author (Juliana Londoño-Vélez and Pierre purpose of the data use (e.g. policy Bachas have fared very well with this). question to be studied) and discuss the benefits of the study. Walking the Tightropes of Institutional • Detail the data needed: Politics o Type of dataset (e.g. annual corporate income tax declarations; This a tricky terrain, and the politics of each if applicable, quarterly income tax country and TA are different, but here are a few declarations; all payment record thoughts (again, more targeted to graduate related to the corporate income student than development bureaucrats, who would have heard such advice before): tax); o Sample (e.g. all corporations and • Be mindful of “your” country’s unincorporated firms (i.e. self- development status and geopolitics when employed individuals) in all tax providing examples of successful offices); collaborations in your pitch. Most o Period covered (e.g. 2010-2018); governments like to be compared with o Identifiers: unique taxpayer ID “aspirational” peers, i.e. slightly more (de-identified), tax year, developed countries. declaration submission date, • Be mindful of internal politics in declaration number; government agencies. External partners o Variables: list the line items/boxes can have convening power (and the on the tax return that are needed (if required innocence and independence) to the tax return isn’t too long, it’s bring competing government departments often easiest to request all to the same table. Yet they might also variables, which makes the request inadvertently exacerbate internal easier to deal with for the person challenges, e.g. by designing a field extracting the data, prevents issues experiment that requires the collaboration due to errors in variable selection, of rival departments. and limits the need for follow-up • Be mindful of the career concerns of your requests to add variables); government counterparts. Depending on o Any additional variables that need their position, career history and to be merged into the data (e.g. aspirations, some staff in the TA will have sector codes for all firms from the stronger incentives to collaborate with tax register). external partners and generate innovative • Specify the mode of access (see section findings than other staff. You can try to above) if agreed upon or propose further help ensure that the internal champions in discussions about the logistics. your project get the visibility they deserve • Whether or not you want to request an for their work, but also be careful about explicit ex-ante permission to be able to implicating them if part of the project turns publish whatever results you find is a out to be controversial. sensitive and context-specific question, but it is important to convey (at least implicitly) that you are not intending to Formulating a Data Request prepare a top-secret report but rather a public good. Once the government counterparts have agreed to the project, a formal data request March 2019 · Number 7 · 6 Once the data has been accessed, it will likely intervention. On the contrary, a strong yield interesting results. Until then, it is ownership of the study’s findings improves the important to maintain a regular exchange with chances that the study attracts the attention of the TA during the data analysis, communicate policy makers – in the country in question as intermediate results, seek feedback and consult well as in peer countries and practitioner fora – on the final dissemination strategy and policy and leads to positive change. Besides, discussion surrounding the results. After all, requesting data from a government without improving policy design – either directly or promising consultations on the publication of indirectly, by improving our knowledge of tax results risks thwarting from the onset some systems – should be a key objective of the studies that would be of high public interest yet analysis. potentially controversial. 3) Publishing Findings Based on Tax As for publication in academic journals, work Data with tax data present the additional challenge that the data is highly confidential and cannot be published, which is contrary to editorial Whether or not governments should be policy in most academic journals. Journals are consulted before publication of the results is a generally happy to waive the data publication question up for debate. For staff in requirement for tax data, but this waiver must development partner organizations, it will be requested at the time of submission to the usually be necessary to discuss research journal. Researchers must also provide all findings and dissemination strategies with the replication codes and an explanation of how government before publishing. Academic access to the data could be requested by other researchers, however, might prefer to merely researchers desiring to replicate the findings in present research findings, as they worry that a a published paper. There are ongoing consultation (or worse, a legal agreement) discussions about the non-replicability which gives government a veto right over the problem for studies using confidential publication can harm the independence and administrative data, so rules might become objectivity of the study. stricter in the future. I have not encountered or heard of a situation in which a study was withdrawn, substantially altered or misrepresented due to government About the author(s): Anne Brockmeyer, Senior Economist, World Bank’s Macroeconomics, Trade, and Investment Global Practice abrockmeyer@worldbank.org March 2019 · Number 7 · 7 Table 1: Typical Tax Policy Questions and Data Required to Answer Them Policy question Required data Other requirements to answer the question 1 Effect of a change in the Taxpayer declarations for Some variation in the applicability of the rate/base of tax X (on tax X, for some period policy change (e.g. rate change applied only reported tax around the tax policy to certain types of taxpayers) base/payment); change optimal rate/base for Depending on whether optimal refers to tax X revenue-maximizing or welfare-maximizing, answering this question may require estimates of other parameters, e.g. liquidity constraints 2 Effect of tax Administrative data that Information on the timing and targeting of enforcement measures the aspects of tax enforcement (e.g. list of taxpayers intervention/policy on compliance most likely audited and audit date); information on how reported tax affected by enforcement, targeting was decided (e.g. random base/payment directly and indirectly – targeting, targeting based on some cutoff e.g. through spillovers rule such as specific risk level) (can be tax declarations, registration forms, third- party reports) 3 Effect of tax policy or Data under 1. or 2. above Randomized or natural experiment which tax administrative + ideally survey data varies tax policy or tax administrative practice on real which provides a practice outcomes (e.g. firm measure of real outcomes growth) less susceptible to misreporting 4 Distributional Taxpayer declarations for properties of tax X tax X, or household (income or wealth survey data containing taxes) information on tax X 5 Distributional Consumption survey properties of tax X data (cf. Bachas et al (consumption tax) 2019) March 2019 · Number 7 · 8