Generative intelligence is undoubtedly the hype of 2023. No company, product, event, or author can avoid addressing this theme and its most famous representative: ChatGPT.
Launched in November 2022, OpenAI’s chatbot has captured the public’s attention with its extraordinary ability to generate coherent and contextually relevant texts, reaching 100 million users in just two months and sparking heated debates about the implications, potential, and challenges posed by this technology.
One of the most promising applications of generative models is also one of the least known and discussed: the production of synthetic data.
Plausible and Therefore Useful Data
We are accustomed to seeing generative AI models tested in seemingly creative fields: creating new images, new videos, new texts, or new music. However, what generative models like ChatGPT or Midjourney do, is create something that is statistically similar to what already exists — something plausible that is relative to the initial dataset.
Therefore, the capabilities of Generative Artificial Intelligences are extremely well-suited to solving one of the obstacles that many companies face when they want to launch Data Science projects: the need for a sufficiently large set of plausible data.
In this case, generative AI is used to reproduce data with statistical properties that are consistent with those observed in the original dataset, without corresponding to the real data that has been observed.
An Example in Luxury Retail
Let’s imagine, for example, that two luxury goods companies wish to train an Artificial Intelligence model to identify and predict the types of customers who purchase goods from both of their companies. The available dataset has a relatively low number of rows, as is obvious for an industry that, by definition, caters to an elite clientele. Furthermore, it is crucial for both companies to ensure the privacy of their customers and the confidentiality of corporate information. In this case, simple pseudonymization would not be considered sufficient, while complete anonymization would risk reducing the significance of the information by removing relationships between the various elements.
A generative model solves this problem by synthesizing new data with statistically similar properties to the existing data, but in larger quantities (i.e., data augmentation). It also transforms the data into a format that is completely devoid of personal information. Analyses can then be conducted, or predictive models can be trained on this new dataset.
Synthetic Data: From Research to Business
Synthetic data are not a novelty: the need to generate datasets with specific characteristics and the idea of doing so through models is widespread in the scientific field. The quality of generative models developed in the last decade and recent regulations introduced to regulate data protection have given this practice a new impetus in the corporate sphere.
A practice, which used to concern mainly researchers who had to test or develop advanced data analysis systems, is now ready to solve business problems.
When are synthetic data useful?
Scarcity
The ability to generate synthetic data in a business context becomes crucial when the available data is quantitatively scarce — for example, because the subject of the analysis is by definition a rare event within the dataset, as can happen in fraud detection.
Privacy
Another aspect that makes the recourse of synthetic data useful is when sensitive data is included (i.e., data that is subject to privacy, confidentiality, or security restrictions) but cannot simply be omitted without compromising the effectiveness of the entire dataset.
Constraints and Potential
These characteristics can make some analyses unachievable or ineffective, hinder the exchange of information between different areas of the same organization, and impede the development of machine learning models for solving business problems (e.g., Churn Prevention, Propensity, or Anomaly Detection).
The utility of synthetic data for privacy protection has not escaped the attention of the European legislature, who expressly mentions synthetic data in the AI Act as a solution for training models without resorting to real data from real individuals.
The Trade-off Between Privacy, Fidelity, and Utility.
An important characteristic of synthetic data is its similarity to the original data. This characteristic is quantified using metrics that evaluate fidelity. A synthetic dataset with a high level of fidelity also ensures high utility, meaning the synthetic data can be used for the same purposes as the original data because the level of similarity is very high. However, maximizing the data’s fidelity and utility could potentially lead to privacy requirements violations.
The generation of synthetic data is characterized by tension among the three above-mentioned dimensions. Every time a dataset is generated, it is necessary to assess how to balance these dimensions based on the purpose of the project and the starting data. In this way, one can obtain useful, high-quality data that also meets confidentiality and privacy requirements.