by Alys Woodward, Senior Director Analyst at Gartner
A major problem with AI development today is the burden involved in obtaining real-world data and labeling it. In fact, data availability was selected as one of the top five barriers to implementing generative AI (GenAI) in a Gartner survey of 644 organizations done in the fourth quarter of 2023.
Synthetic data can help solve this problem. With orders of magnitude less privacy risk than real data, synthetic data can open a range of opportunities to train machine learning models and analyze data that would not be available if real data were the only option.
However, it’s important to understand how synthetic data can overcome privacy, compliance and data anonymization challenges, as well as the issues impeding its widespread adoption.
Addressing privacy challenges
Synthetic data helps organizations address privacy challenges while training their AI, machine learning (ML), or computer vision (CV) models.
Synthetic data can bridge information silos by acting as a substitute for real data and not revealing sensitive information, such as personal details and intellectual property. Since synthetic datasets maintain statistical properties that closely resemble the original data, they can produce precise training and testing data that is crucial for model development.
Training computer vision models often requires a large and diverse set of labeled data to build highly accurate models. Obtaining and using real data for this purpose can be challenging, especially when it involves personally identifiable information (PII).
Two common use cases that require PII data are ID verification and automated driver assistance systems (ADAS), which monitor movements and actions in the driver’s area. In these situations, synthetic data can be useful for generating a range of facial expressions, skin color and texture, as well as additional objects like hats, masks, and sunglasses. ADAS also requires AI to be trained for low-light conditions, such as driving in the dark.
Mitigating challenges associated with data anonymization
Efforts to manually anonymize and de-identify datasets – remove information that links a data record to a specific individual – are often time consuming, labor intensive and prone to errors.
Ultimately, this can delay projects and lengthen the iteration cycle time for development of ML algorithms and models. Synthetic data can overcome many of these pitfalls by providing faster, cheaper and easier access to data that is similar to the original source, suitable for use and protects privacy.
Furthermore, if manually anonymized data is combined with other publicly available data sources, there’s a risk it could inadvertently reveal information that could lead to data re-identification, thus breaching data privacy. Leaders can use techniques such as differential privacy to ensure any synthetic data generated from real data is at very low risk of deanonymization.
Challenges hindering widespread adoption
Creating a synthetic tabular dataset involves striking a balance between privacy and utility, ensuring the data remains useful and accurately represents the original dataset. If the utility is too high, privacy may be compromised, especially for unique or distinctive records, as the synthetic dataset could be matched with other data sources.
Conversely, methods to enhance privacy, such as disconnecting certain attributes or introducing ‘noise’ via differential privacy, can inherently diminish the dataset’s utility.
Over the past decades of data management, low quality of transaction data has been an ongoing challenge. For example, call center agents might fail to complete full address data, or customer information. This missing data can prevent analysis. To counteract this, IT organizations needed to educate business users on how important good data quality is to both applications and analytics. “Garbage in means garbage out” was the commonly accepted principle.
However, this now affects people’s attitudes to synthetic data as they believe it must be inferior because it’s not real data, which delays adoption. In reality, synthetic data can be better than real data, not in how it represents the current world, but in how it can train AI models to work with the ideal or future world.
A synthetic dataset mirrors the original dataset. Therefore, if the original does not include unusual occurrences or “edge cases,” these won’t appear in the synthetic dataset either. This is particularly important for image and video synthetic data in areas like autonomous driving, where many hours of driving footage are used to train the AI. However, unusual situations like emergency vehicles, driving in snow or animals on the road need to be created.