Blog post

Synthetic data alone is not enough

All industries
No items found.
Written by
No items found.
Published on
November 4, 2025
Readtime:
0
Stairs leading up to a ceiling with cellular tiles, representing synthetic dataStairs leading up to a ceiling with cellular tiles, representing synthetic data

Synthetic data has been hailed as a privacy cure-all. But is it?

Recommended reading

Whitepaper: Unlock the value of real-world healthcare data with confidential data clean rooms

As the amount of healthcare data from real-world settings grows, how can care providers and life sciences companies use this data to advance research and treatment while protecting sensitive patient information?

Key visual for unlocking real-world data with data clean rooms

Privacy-preserving synthetic data promises organizations a way to innovate while minimizing the exposure of real personal information. Yet despite growing adoption, challenges remain: only 58% of organizations have completed a formal AI risk assessment, leaving gaps in privacy, anonymization, and regulatory compliance. 

The rise of generative AI has accelerated interest in synthetic data, as businesses seek ways to safely leverage machine learning while navigating complex privacy regulations. High-profile data breaches and regulatory investigations in recent years have made it clear that relying solely on synthetic data is not enough to mitigate risk. 

Governments, research institutions, and enterprises alike are experimenting with synthetic datasets to protect sensitive information, but regulators continue to emphasize that residual privacy risks must be carefully managed. This makes understanding synthetic data privacy risks and the legal landscape critical for organizations aiming to innovate responsibly.

What is synthetic data, and how does it relate to privacy?

Synthetic data is artificially generated information designed to mimic the statistical properties of real datasets. It can take many forms, from tabular records and text to images and complex sequences, depending on the intended use. 

Synthetic data is created using algorithms that learn patterns from original data, then produce new records that retain the essential statistical distribution of the source while avoiding direct replication of individual records. Popular generation methods include generative adversarial networks (GANs), variational autoencoders (VAEs), and other machine learning–based approaches, which allow data scientists to produce fully synthetic data or partially synthetic datasets tailored to specific analytical needs.

Because synthetic data does not contain actual personal information, it is often described as privacy-preserving synthetic data. By replacing real personal data with artificially generated counterparts, it reduces direct exposure risks and supports data minimization, helping organizations comply with data protection obligations. This approach is particularly useful in sectors such as healthcare or finance, where sensitive information must be safeguarded but still analyzed for research or operational insights.

However, the privacy benefits of synthetic data are frequently misunderstood. While it lowers the likelihood of exposing individual identities, it does not guarantee anonymity. Models can inadvertently retain patterns from the original data, creating subtle correlations that could allow re-identification in poorly generated datasets. Residual privacy risks exist, meaning synthetic data must be handled carefully and, ideally, combined with other privacy-enhancing technologies (PETs) to fully mitigate potential leakage.

Even with these precautions, there are real-world implications. Recent studies indicate that synthetic datasets, if not properly generated or validated, can still reveal patterns that may compromise privacy, highlighting that synthetic data privacy risks are real and must be managed proactively. 

Nevertheless: When used correctly, synthetic data allows organizations to conduct meaningful analysis, train machine learning models, and generate data for research or policy purposes — all while maintaining a strong focus on data protection and analytical value.

How is synthetic data being used?

Synthetic data is increasingly deployed across industries where data protection and data access need to be balanced

  • In healthcare, it enables researchers to study patient trends and develop predictive models without exposing sensitive medical records. 
  • Academic institutions use synthetic datasets to teach machine learning or conduct experiments while respecting privacy.
  • Policy organizations leverage synthetic data to model public behavior and inform decision-making without relying on real-world personal information.

Beyond access benefits, synthetic data helps address data scarcity, allowing organizations to augment training datasets for artificial intelligence and deep learning algorithms. Recent initiatives in urban planning and public policy have demonstrated that synthetic census or survey datasets can provide actionable insights while preserving citizen privacy. 

However, there are important ethical and technical considerations: Synthetic datasets can inadvertently reproduce biases present in source data, potentially affecting model fairness. Maintaining data integrity is critical, as synthetic approximations may not fully capture real-world variability. Ensuring reproducibility of analyses can also be challenging when synthetic data differs from underlying real datasets, highlighting the need for careful validation and responsible use.

Is synthetic data GDPR compliant?

AI-generated synthetic data is not automatically anonymous, and using AI-synthetic data generators does not necessarily fulfill the strict anonymization requirements of GDPR. While synthetic data can reduce exposure to real personal data, it does not automatically eliminate privacy risks. Organizations must understand that synthetic data privacy risks remain unless the data generation process and downstream usage are carefully managed.

Under the General Data Protection Regulation (GDPR), true anonymization is defined in Recital 26: data is only considered anonymized if individuals cannot be identified by any means reasonably likely to be used. 

Synthetic datasets often fail this test. Even when synthetic records do not contain exact personal identifiers, machine learning models can encode patterns from the original data. These traces may allow an adversary to infer information about individuals indirectly, meaning the data cannot be considered fully anonymous under EU law.

Residual privacy risks also persist. If the model memorises sensitive information from the source data, synthetic outputs could inadvertently reveal aspects of real individuals. This is particularly relevant in privacy-sensitive sectors where unique data points may stand out. Without proper safeguards, organizations may unintentionally expose sensitive data, creating compliance challenges and potential liability under data protection laws.

Moreover, synthetic data is often better described as pseudonymized rather than anonymized. Pseudonymization reduces the direct identifiability of individuals but still qualifies as personal data under GDPR. As a result, processing pseudonymized synthetic datasets still requires a valid legal basis, adherence to data minimization principles, and implementation of appropriate technical and organizational measures.

Despite these limitations, synthetic data can play a critical role in compliance when combined with privacy-enhancing technologies (PETs). Techniques such as confidential computing, federated learning, or differential privacy can strengthen protections, mitigate residual risks, and maintain the analytical value of synthetic datasets. By integrating synthetic data with PETs, organizations can safely conduct research, share insights, and train machine learning models while staying aligned with GDPR requirements and supporting responsible data stewardship.

Synthetic data ≠ anonymization

Synthetic data is often marketed as a shortcut to anonymization, but anonymization is a legal standard rather than a statistical one. Simply generating artificial records does not automatically remove the risk of identifying individuals. Even high-quality synthetic datasets can retain patterns from the original data, which may be exploited to infer sensitive information.

Why synthetic data isn’t inherently anonymous:

  • Models can memorize and reproduce unique data points from the source.
  • Statistical patterns may indirectly reveal individual identities.
  • Re-identification is possible through linkage with other datasets.
  • Synthetic outputs may still reflect sensitive attributes like age, location, or health conditions.

For example, healthcare researchers have shown that synthetic patient datasets, if not carefully validated, can sometimes be reverse-engineered to reveal traits of real patients even when no original records are included. This demonstrates that synthetic data reduces exposure risk but does not guarantee privacy, reinforcing the need to combine it with other measures to ensure robust protection.

Integrating synthetic data with privacy-enhancing technologies

To address synthetic data’s inability to fully guarantee privacy or regulatory compliance, organizations increasingly combine synthetic data with privacy-enhancing technologies (PETs) that strengthen protections while maintaining analytical value.

  • Differential privacy adds carefully calibrated noise to datasets, ensuring that individual-level information cannot be inferred from aggregate outputs. 
  • Federated learning allows machine learning models to be trained across multiple decentralized datasets without moving sensitive data from its source. 
  • Homomorphic encryption enables computations on encrypted data, producing results without ever exposing the underlying information.

By integrating these techniques with synthetic data, organizations can overcome the typical trade-offs between privacy and utility. Synthetic datasets can be safely generated and shared while maintaining statistical relevance for research, predictive modeling, or policy analysis. This multi-layered approach not only reduces synthetic data privacy risks but also sets the stage for more advanced solutions, like confidential computing, which protects data even while it is being processed.

How confidential computing strengthens synthetic data privacy

Confidential computing provides an additional layer of security that goes beyond other PETs. Unlike federated learning or differential privacy, confidential computing protects data while in use, ensuring that computations occur within a secure, encrypted environment that even the infrastructure provider cannot access.

When combined with synthetic data, confidential computing allows organizations to generate, collaborate on, and analyze datasets without exposing sensitive information. This combination addresses the main limitations of synthetic data, including residual re-identification risk and GDPR compliance concerns. Data can be processed collaboratively across multiple stakeholders — for example, in a healthcare scenario, hospitals and researchers — without compromising confidentiality.

Decentriq’s data clean room platform leverages confidential computing to enable secure multi-party analytics. Synthetic datasets are generated within secure enclaves, meaning analysts or partners can extract meaningful insights without ever accessing real personal data. This approach preserves analytical value, reduces synthetic data privacy risks, and ensures adherence to data protection laws. Read more about data clean rooms here.

By uniting synthetic data with PETs and, more specifically, confidential computing, organizations achieve what is often called the “best of all three worlds”: the flexibility and utility of synthetic data, the privacy assurances of PETs, and the robust security of confidential computing. Decentriq’s solution empowers safe, GDPR-compliant collaboration, allowing teams to innovate with sensitive data confidently and efficiently.

Toward privacy that’s real, not synthetic

Synthetic data is valuable, but true privacy requires more than artificial records. The combination of privacy-preserving synthetic data, PETs, and confidential computing ensures that organizations can collaborate, analyze, and innovate without exposing sensitive information.

Learn how Decentriq enables real privacy through Custom Collaborations and explore our executive brief for deeper insights into synthetic data and privacy-enhancing technologies.

This is an updated version of a 2022 article by Nikolas Molyndris

References

Recommended reading

Whitepaper: Unlock the value of real-world healthcare data with confidential data clean rooms

As the amount of healthcare data from real-world settings grows, how can care providers and life sciences companies use this data to advance research and treatment while protecting sensitive patient information?

Key visual for unlocking real-world data with data clean rooms

Related content

Subscribe to Decentriq

Stay connected with Decentriq. Receive email notifications about industry news and product updates.