Enabling Synthetic Data adoption in regulated domains

Giorgio Visani, Giacomo Gra, Mattia Alfero, Enrico Bagli, Davide Capuzzo, and Federico Chesani

Abstract of Enabling Synthetic Data adoption in regulated domains

The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. Specific approaches to address the privacy issue have been developed, as Privacy Enhancing Technologies. However, they frequently cause loss of information, putting forward a crucial trade-o  among data quality and privacy. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties. Both Academia and Industry realized the importance of evaluating synthetic data quality: without all-round reliable metrics, the innovative data generation task has no proper objective function to maximize. Despite that, the topic remains under-explored. For this reason, we systematically catalog the important traits of synthetic data. quality and privacy, and devise a speci c methodology to test them. The result is DAISYnt (aDoption of Arti cial Intelligence SYnthesis): a comprehensive suite of advanced tests, which sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the di erent synthetic replicas. Further potential uses, among others, entail auditing and  netuning of generative models or ensuring high quality of a given synthetic dataset. From a prescriptive viewpoint, eventually, DAISYnt may pave the way to synthetic data adoption in highly regulated domains, ranging from Finance to Healthcare, through Insurance and Education.

INTRODUCTION to Enabling Synthetic Data adoption in regulated domains

Critical aspects of a valuable dataset are data quality and privacy. The former is stressed in the Data-Centric mindset pioneered by Andrew Ng, while the latter is required by novel regulations such as the GDPR and the U.S. FERPA and HIPAA, educational and medical data privacy respectively. Privacy Enhancing Technologies already help protecting sensitive data, at the cost of an information loss. In fact, privacy and data quality behave as two antagonistic features. A clever way to potentially avoid such conflict relies on Synthetic Data: data obtained from a generative process, learning real data properties.

The quest for valuable synthetic data is highly relevant in regulated domains such as Finance and Healthcare, where they may enable several use-cases such as:
1) enforcing privacy protection,
2) facilitating data sharing among companies and towards the research community,
3) tackling class imbalance (eg. fraud detection),
4) increasing the amount of data for prediction models.

Despite that, the assessment of synthetic data quality and privacy remains an under-explored, although vital, topic. Whilst few taxonomies and tests have been proposed, we feel the need for a decisive improvement.

In this paper we tackle the open question of how to evaluate the quality and privacy of tabular synthetic data. Firstly, we systematically catalog their most important features into three concepts: Statistical Similarity, Data Utility and Privacy. To measure these notions, we devise appropriate state-of-the-art tests yielding a numeric value in the range , where higher metrics imply better performance. The final result is DAISYnt (aDoption of Artificial Intel-ligence SYnthesis): a comprehensive and easy to use test suite, that sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the different synthetic replicas. Further possible DAISYnt applications entail auditing and fine tuning of the models or ensuring high quality of a given synthetic dataset. In the following, Section 2 contains taxonomy and literature review. Section 3 is dedicated to general purpose tests, while Sections 4, 5 and 6 respectively concern with distribution similarity, data utility and privacy tests. Section 7 contains DAISYnt application on Credit Scoring data, while Section 8 contains a discussion on its implications and future perspectives. Methodological sections contain DAISYnt graphs and results on the Adult3 dataset from the UCI repos-itory.

ARE YOU A DEVELOPER?

Check out all the resources for TPPs and developers on the Crif Platform development portal.

REQUEST YOUR FREE COPY

PRIVACY POLICY PURSUANT TO ART. 13 OF EU REGULATION 679/2016 (“GDPR”)

In accordance with the legislation in force on the protection of personal data, CRIF S.p.A., located at Via Fantin 1-3, 40131 Bologna, Italy, VAT No. 02083271201 (“CRIF”), as the Controller for the processing of your personal data, must provide you with certain information concerning the use of such data. 1 – Purpose of the processing of personal data and lawful basis of the processing 1.1 – Purpose and lawful basis of the processing Your personal data is processed by CRIF for the following purposes: a) for the purpose of fulfilling contact requests. Lawfulness of processing: art. 6(1)(b) of the GDPR. b) for marketing and/or information purposes, as well as market analysis and initiatives related to CRIF activities, including via automated calling systems (e.g., SMS, MMS, e-mail, fax). Lawfulness of processing: art. 6(1)(a) of the GDPR. c) purpose of sharing/transferring your data with/to CRIF Group companies (refer to link https://www.crif.it/chi-siamo/la-nostra-presenza-globale/ to fulfill contact requests. Lawfulness of processing: art. 6(1)(b) of the GDPR. The provision of personal data for the purposes referred to in point (b) is optional, and the related processing requires the consent of the data subject; any refusal to provide consent will not give rise to any consequences. The provision of data for the purposes referred to in points (a) and (c) is necessary and does not require consent. The user is free to not provide this information, but in this case we will not be able to fulfill your requests. After the initial telephone/e-mail contact, if the user decides not to subscribe to any service or to purchase any product or states that he/she does not want to be contacted again, the Controller will cancel the user’s details. Likewise, users can decide not to receive any marketing communications at any time by using the opt-out link at the bottom of each message and in any case exercising the relative right to withdraw consent. Any other processing for different purposes is excluded. 2 - Retention times 2.1 We hereby inform you that your personal data will be processed and retained for up to 5 years or in any case until you withdraw your consent. In this regard, you can withdraw consent for the processing of personal data for the purposes described in point 1.1 (b) at any time by e-mailing: dirprivacy@crif.com. 3 – Methods of data processing 3.1 Data processing is carried out using manual, computerized and ICT tools according to methods strictly related to the purposes themselves and, in any case, in a way that guarantees the confidentiality and security of the data. 4 – Categories of subjects to which personal data can be communicated or who may become aware of such data 4.1 – To achieve the purposes described in point 1.1 “Purpose and lawful basis of the processing” of this Privacy Policy, CRIF may communicate your personal data to third parties belonging to the following categories: a) personnel authorized to perform the processing, or third-party subjects appointed as processors; b) CRIF Group companies, including outside the European Union, which will act as independent controllers and will provide their own privacy notice in accordance with art. 14 of the GDPR. 5 – Transfer of data outside the European Union 5.1 To achieve the purposes described in point 1.1 letter c) “Purpose and lawful basis of the processing” of this Privacy Policy, CRIF may also communicate your personal data to CRIF Group companies based outside the European Economic Area. 5.2 The above transfer may be put in place, without specific authorizations, if the third country to which the data is transferred falls under those which guarantee an adequate level of protection according to the European Commission. In the absence of such an adequacy decision adopted by the European Commission, this transfer to recipients located in third countries can be carried out by adopting and documenting the sufficient guarantees referred to in art. 46 of the GDPR. In the absence of an adequacy decision or additional guarantees, the transfer of personal data to recipients located in third countries can be carried out if the terms are met and the additional conditions set out by Chapter V of the GDPR exist, including the possibility to make use of the derogations for specific situations in art. 49 of the GDPR. 5.3 A list of countries where CRIF Group companies operate is available at: https://www.crif.it/chi-siamo/la-nostra-presenza-globale/ 6 - Data Subject rights 6.1 According to Chapter III of the GDPR, as the Data Subject, you have the right to (i) obtain confirmation of whether personal data relating to you is being processed, obtaining the information listed in article 15 of the Regulation; (ii) obtain rectification of inaccurate personal data regarding you or to have incomplete personal data completed; (iii) obtain deletion of personal data regarding you, pursuant to and with the limitations set out in article 17 of the Regulation; (iv) obtain the restriction of processing of your personal data, in the cases specified in article 18 of the Regulation; (v) receive the personal data concerning you in a structured and machine-readable format, in the cases specified in article 20 of the Regulation; (vi) oppose the processing of personal data pursuant to and with the limitations set out in article 21 of the Regulation, even only for automated contact; and (vii) withdraw consent at any time, without prejudice to the lawfulness of the processing based on the consent given prior to the withdrawal. 7 - Controller 7.1 The Controller responsible for the processing of personal data is CRIF S.p.A., Via Mario Fantin 1‐3, 40131 Bologna, Italy, VAT No. 02083271201. A complete list of Processors is available from the Controller’s head office. The following methods can be used to exercise the rights set out in Chapter III of the GDPR: - e-mail sent to the address: dirprivacy@crif.com; - certified e-mail sent to the address: crif@pec.crif.com 7.2 You can also submit a complaint to the Italian Data Protection Authority, following the instructions via the link: http://www.garanteprivacy.it/web/guest/home/docweb/-/docweb-display/docweb/4535524. 8 – Data Protection Officer 8.1 For any questions regarding the processing of your personal data, you can contact the Data Protection Officer at: e-mail: dirprivacy@crif.com: Certified e-mail: crif@pec.crif.com.