Generative AI and Data Privacy: The Challenge of PII Use in Training Data Sets

Jun 11th '24

The rapid advancement of generative AI models, such as large language models and image generators, has ushered in a new era of technological capabilities. However, this progress also raises significant concerns regarding data privacy compliance when personally identifiable information (PII) is included to train these powerful AI systems.


Generative AI and data privacy

At the core of this issue is the fundamental tension between the requirement for vast amounts of data needed for the training of generative AI models, the rights of individual data subjects (including the right to have their PII deleted), and the principles of data minimization and purpose limitation enshrined in many data privacy regulations, such as the EU’s GDPR, California’s CCPA/CPRA, and many more new state data privacy laws.


Generative AI models are trained on massive datasets, often containing billions or trillions of data points, to learn patterns and relationships to generate human-like text, images, or other outputs. While this training data can come from various sources, including publicly available sources, it is not uncommon for organizations to incorporate PII, such as personal names, addresses, or other identifiable information, into their training sets.


Using PII in AI training sets raises several concerns from a data privacy perspective. First, it may violate the principle of purpose limitation — if the personal data was initially collected for a different purpose than training AI models. Additionally, it could conflict with the data minimization principle, which encourages organizations to limit the collection and processing of personal data to what is strictly necessary.


  • Data privacy law rights

Moreover, many data privacy laws grant individuals the right to access, rectify, or delete their personal data held by organizations. However, when PII is deeply embedded within the parameters of a trained generative AI model, it becomes incredibly challenging, if not impossible, to isolate and remove that specific data point without significantly degrading the model’s performance.


This predicament poses a significant compliance challenge for organizations operating under data privacy regulations. Failing to effectively remove requested PII from trained AI models could be considered a violation, potentially leading to substantial fines, legal actions, and reputational damage.


To mitigate these risks, organizations should explore techniques such as synthetic data generation, differential privacy, and federated learning, which can help obfuscate or anonymize individual PII while preserving the overall patterns and distributions in the training data.


A best practice would include maintaining detailed data lineage and documentation of training data sources, preprocessing steps, and model versioning to help identify and manage PII better within AI systems. Additionally, companies should consider changing their consent form requests to include individual PII in generative AI training data sets.


Furthermore, using PII in generative AI training raises broader ethical and legal concerns beyond compliance with data deletion requests. Organizations must carefully evaluate the potential risks, such as algorithmic bias, discrimination, and privacy violations, and implement appropriate safeguards and governance processes.


As generative AI continues to advance and permeate various industries, regulatory bodies and policymakers may need to provide additional guidance or legal frameworks explicitly addressing the challenges posed by AI systems and personal data. Striking the right balance between innovation and data privacy protection will be crucial for fostering public trust and enabling the responsible development of these transformative technologies.


Implications of generative AI models and the new data privacy laws

If a data subject exercises their right to have their personally identifiable information (PII) deleted under various data privacy laws like the GDPR, CCPA/CPRA, or the numerous other state data privacy laws, and that PII was used in training a generative AI model, there could be significant challenges in fully complying with the deletion request.


Key implications and considerations include:


Difficulty in Isolating and Removing specific PII: Generative AI models, especially large language models or image generators, are trained on massive datasets containing vast data points. Identifying and removing the specific PII of an individual data subject from the already trained model can be extremely difficult – if not impossible, due to the distributed nature of the training data within the model’s parameters.

Potential Model Degradation:If the PII of a data subject is successfully identified and removed from the training data, retraining the generative AI model without that data could potentially degrade its performance and accuracy and potentially change the model’s conclusions and recommendations. The extent of this degradation would depend on the size of the overall training dataset and the significance of the removed data point.

Compliance Challenges:Failing to effectively remove the requested PII from the trained AI model could be considered a violation of data privacy regulations, potentially leading to fines, legal actions, and damage to the organization’s reputation.

Synthetic Data and Differential Privacy: One potential solution could be using synthetic data generation techniques or differential privacy methods during the initial training process. These approaches could help obscure or anonymize the individual PII while preserving the overall patterns and distributions in the data, making it easier to comply with deletion requests without significantly impacting the AI model’s performance.

Data Lineage and Documentation:Maintaining detailed data lineage and documentation of the training data sources, preprocessing steps, and model versioning could help organizations better identify and manage PII within their AI systems, facilitating compliance with data subject rights.

Ethical and Legal Considerations:Using PII in generative AI training raises ethical and legal concerns beyond compliance with data deletion requests.


  • One last question:

If a specific data subject requests deletion of their PII and that PII was used for generative AI content creation, would all the associated content generation output from that model also need to be deleted and the model retrained with the new training set?


I have not reviewed the various new AI use laws emerging worldwide, but so far, I have not encountered these related questions or concerns.


Ultimately, addressing data subject deletion requests for generative AI models trained on PII may require a combination of technical solutions, robust data governance practices, and, potentially, new regulatory guidance or legal frameworks specifically addressing the challenges posed by AI systems and using PII.


FEATURED GUIDE: Generative AI and Compliance


Source: Smarsh


About Bill Tolson

Bill Tolson is President of Tolson Communications LLC, an advisory and consulting firm. He has 25-plus years in the archiving, information governance, data privacy, data security, and eDiscovery industries. Bill has held executive leadership positions in a wide range of high technology organizations, from consulting firms and technology startups to multinationals. Companies include Contoural, Hewlett Packard, StorageTek, Iomega, Hitachi Data Systems, Recommind, Actiance and Archive360 where he was the Vice President of Global Compliance and eDiscovery for seven years.


Bill is a frequent speaker at legal and information governance industry events and has authored four eBooks including Email Archiving for Dummies, Cloud Archiving for Dummies, The Bartenders Guide to eDiscovery and the Know IT All’s Guide to eDiscovery. Bill has also authored 60 plus industry articles and hundreds of blogs as well as hosting 37 podcasts with industry pundits, subject matter experts, state legislators, and attorneys.


About Smarsh

Smarsh® is the recognized global leader in electronic communications archiving solutions for regulated organizations. Smarsh provides innovative capture, archiving, e-discovery, and supervision solutions across the industry’s widest breadth of communication channels.


Scalable for organizations of all sizes, the Smarsh platform provides customers with compliance built on confidence. It enables them to strategically future-proof as new communication channels are adopted, and to realize more insight and value from the data in their archive. Customers strengthen their compliance and e-discovery initiatives and benefit from the productive use of email, social media, mobile/text messaging, instant messaging and collaboration, web, and voice channels.


Smarsh serves a global client base that spans the top banks in North America and Europe, along with leading brokerage firms, insurers, and registered investment advisors. Smarsh also enables state and local government agencies to meet their public records and e-discovery requirements. For more information, visit


About us

LS Consultancy are experts in Marketing and Compliance, and work with a range of firms to assist with improving their documents, processes and systems to mitigate any risk.


We provide a cost-effective and timely bespoke copy advice and copy development services to make sure all your advertising and campaigns are compliant, clear and suitable for their purpose.


Our range of innovative solutions can be tailored to suit your unique requirements, no matter whether you’re currently working from home, or are continuing to go into the office. Our services can be deployed individually or combined to form a broader solution to release your energies and focus on your clients.


Contact us today for a chat or send us an email to find out how we can support you in meeting your current and future challenges with confidence.


Explore our full range today.


Need A Regulatory Marketing Compliance Consultant? A Bit More About Us


Contact us


Why Not Download our FREE Brochures! Click here.


Call Us Today on 020 8087 2377 or send us an email.


We welcome individual bloggers / Professional Writers / Freelancers to submit high quality contents. Find out more…



Connect with us via social media and drop us a message from there. We’d love to hear from you and discuss how we can help.


Facebook | Instagram | LinkedIn | X (formerly Twitter) | YouTube


Contact us