Copyright and training data: what AI product teams can responsibly say
Copyright in AI is a three-layer problem. First, whether the training data itself infringes copyright. Second, whether the model’s outputs can reproduce copyrighted material. Third, whether users who submit copyrighted content to an AI service retain their rights.
Each layer has different legal status, different levels of uncertainty, and different things you can responsibly say to customers without overclaiming.
Quick answer
The responsible position is straightforward: acknowledge that copyright issues in AI are unsettled, be transparent about your indemnity position, and do not claim that your outputs are definitively free of third-party rights. If you are building a product for businesses, your procurement team will need a documented position on all three layers.
Layer one: training data
The first layer is whether the data used to train a model was properly licensed. This is mostly a provider problem, not a user problem — you are not the one who trained the model.
But it affects you in two ways:
Provider indemnity. Some providers offer intellectual property indemnity, meaning they will defend you if a copyright claim arises from their training data. Others do not. The presence or absence of indemnity is a procurement signal, not a guarantee.
Training data disclosure. Most providers do not publish a complete list of training data. Without this, you cannot independently verify whether the data was properly sourced. You are relying on the provider’s statements.
What you can responsibly say: “Our model provider states that training data was sourced according to its published policies. We cannot independently verify the complete training data set.”
Layer two: outputs and similarity
The second layer is whether the model can generate outputs that reproduce copyrighted material from its training data. This is possible, particularly for:
- Code that mirrors open-source repositories;
- Lyrics, quotes or prose from well-known works;
- Character or brand names in specific contexts;
- Images that reconstruct training data (in image-generation models).
The risk varies by model size, training data composition, and how common the copyrighted material is in the training set. Small, domain-specific models have lower reproduction risk than large, web-scale models.
What you can responsibly say: “Outputs are generated probabilistically and are not guaranteed to be free of third-party rights. We recommend review before public use, particularly for verbatim reproduction of known works.”
Layer three: user input rights
The third layer is what rights users retain when they submit content to an AI service. This is covered by terms of service, but the key questions are:
- Does the provider claim a license to user content?
- Can the provider use user content for training?
- Does the user retain ownership of their inputs and outputs?
Most major providers now state that users retain ownership of their inputs and that API content is not used for training by default. But the details matter, and consumer-facing products often have different terms from API services.
What you can responsibly say: “You retain ownership of your inputs and outputs. Our provider’s terms state that API content is not used for training. Full terms are available at [link].”
The indemnity question
Some providers offer legal indemnity for copyright claims:
- OpenAI offers both a “Copyright Shield” for ChatGPT Enterprise and API customers, and a broader IP indemnity for certain API use cases.
- Anthropic offers IP indemnity for enterprise API customers with a signed agreement.
- Google offers IP indemnity through Google Cloud contractual protections.
- Mistral does not offer general IP indemnity but may negotiate terms in enterprise agreements.
Indemnity covers specific scenarios and has exclusions. It is a contractual protection, not a blanket guarantee. The terms should be reviewed by a lawyer, not taken at face value from a marketing page.
What teams get wrong
- assuming “indemnity” means “we are fully protected” (it does not — exclusions apply);
- claiming that outputs are “original” as if that were a legal determination;
- ignoring the user input rights question and discovering later that the provider claims a broad license to user content;
- treating the copyright question as settled when it is actively being litigated in multiple jurisdictions;
- saying “we do not train on your data” without distinguishing between “do not use for training” and “do not store at all.”
Practical decision check
- Do you know whether your provider offers IP indemnity, and what it covers?
- Have you reviewed the provider’s training data disclosure?
- Can your users retain rights to their inputs and outputs?
- Do you have a documented position on output similarity risk?
- Have you consulted a lawyer qualified in your jurisdiction?
If the answer to the last question is no, that is the priority. This page is operational guidance, not legal advice.
Methodology and sources
Check date: 2026-05-24
What was checked: Published provider terms of service, data processing agreements, indemnity documentation, and regulatory guidance. Court filings and legal commentary were reviewed for the evolving legal landscape summary.
What the sources were used for: Building the three-layer framework and characterising the current state of provider protections and disclosures.
Assumptions and limits: Copyright law is unsettled and varies by jurisdiction. Provider positions change. This is not legal advice and should not be treated as such.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief, with a three-layer framework for understanding AI copyright risk.
Source list
- OpenAI Copyright Shield — https://openai.com/blog/copyright-shield-for-chatgpt-enterprise-and-the-api
- Anthropic IP indemnity — https://www.anthropic.com/news/intellectual-property-protection
- Google Cloud IP indemnity — https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-ai-ip-indemnity
- UK IPO guidance on AI and copyright — https://www.gov.uk/government/consultations/ai-and-copyright-consultation
- US Copyright Office AI guidance — https://www.copyright.gov/ai/