Generative AI – Addressing Copyright
When it comes to the interaction of AI and IP rights, bar a flurry of activity surrounding the inevitable outcome by the courts in the Thaler, Dabus case (see here) and the Court of Appeal's ruling on the potential for exclusion from patentability of artificial neural networks in the Emotional Perception case, most attention has been focused on copyright issues. There are three main potentially thorny issues and all have been extensively covered by the mainstream media.
As a quick recap, the issues are whether:
- the way foundation models (FM) are trained using works from the internet infringes the copyright in the works of content creators such as authors, artists and software developers
- the outputs of FM infringe the copyright of content creators
- AI generated works are protectable.
The problem with training data
Copyright is a right that in the UK and EU subsists automatically when certain requirements are met. Copyright infringers must be found to have copied the whole of the copyright work, or part of it where that part is regarded as ‘substantial’. Both proof of copying from a copyright work and similarity are required to prove infringement.
Content creators such as news providers, authors, visual content agencies and other creative professionals allege that their work is being unlawfully used to train AI models. Some use of this material is expressly authorised, for example, in July 2023 Associated Press announced that OpenAI had taken a licence of part of its text archive. However, the main thrust of the allegations by content creators is that millions of texts, parts of texts and other literary material and images have been scraped from publicly available websites without consent. This scraped content used as an input to train and develop AI models is alleged to infringe their copyright and often their database rights.
The case of Getty Images (US) Inc v Stability AI Ltd is the most prominent case making these kinds of allegations in the UK (there is also a corresponding US action). Setting aside the arguments on territorial extent raised in that case (i.e. whether the training and development of Stable Diffusion took place within the UK or in another jurisdiction), the allegations of copyright and database right infringement relevant here are that Stability AI:
- has downloaded and stored Getty Image's copyright works (necessary for encoding the content and other steps in the training process) on servers or computers in the UK during the development and training of Stable Diffusion
- infringed the communication to the public right by making Stable Diffusion available in the UK, where Stable Diffusion provides the means using text and/or image prompts to generate synthetic images that reproduce the whole or a substantial part of the copyright works.
Getty alleges that Stable Diffusion was trained using subsets of the LAION-5B dataset, a dataset comprising 5.85 billion CLIP-filtered (Contrastive Language-Image Pre-training) image-text pairs, created by scraping links to photographs and videos, together with associated captions, from the web, including from Pinterest, WordPress-hosted blogs, SmugMug, Blogspot, Flickr, Wikimedia, Tumblr and the Getty Images websites. The LAION-5B dataset comprises around 5 billion links. The LAION subsets together comprise approximately 3 billion image-text pairs from the LAION-5B dataset. At the time of filing its claim, Getty had identified around 12 million links in the LAION subsets to content on the Getty Images websites. In response to Getty's claim, Stability AI filed an unsuccessful application for summary judgment to dispose of certain key aspects of the case. It has since submitted its written defence to the court, denying liability.
The training and use of FM has resulted in intense debate on the infringement questions and the adequacy of legislation and/or guidance on licensing. Other similar ongoing legal actions (all US based) include:
- the New York Times action against OpenAI and Microsoft in the US, for unlawful use of journalistic (including behind paywall) content to train LLMs
- Arkansas news organisation Helena World Chronicle's class action against Google and Alphabet alleging that “unlawful tying” arrangements, have been extended and amplified by Google’s introduction of Bard in 2023, which they allege was trained on content from Helena World Chronicle publishers
- Thomson Reuters' action against ROSS Intelligence based on allegations of unlawful copying of content from its legal-research platform Westlaw to train a competing artificial intelligence-based platform
- a class action filed against OpenAI by the Authors Guild and some big-name authors including George RR Martin, John Grisham, and Jodi Picoult, alleging that the training of ChatGPT infringed the copyright in the authors’ works of fiction.
One of the issues for publishers and content creators is that they are not being rewarded for the use of their content to train AI models and that use of LLMs such as ChatGPT disrupts the business model of consumers who search online via a search engine for content, no longer being directed to publications on their websites where the traffic attracts revenue made through digital advertising. This is because a search for digital content on many AI systems results in a direct response that stays within the LLM or image generation platform even though that response may be drawing from the same content that would have been revealed in search results in the search engine example.
In December 2023, OpenAI provided written evidence to a UK committee inquiry into large language models including an explanation of its position on the use of copyright protected works in LLM training data. It explained that its LLMs, including the models that power ChatGPT, are developed using three primary sources of training data: (1) information that is publicly accessible on the internet, (2) information licensed from third parties (such as Associated Press), and (3) information from users or their human trainers. OpenAI acknowledged because "copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials". OpenAI stressed that it was for creators to exclude their content from AI training and that it has provided a way to disallow OpenAI's “GPTBot” web crawler to access a site, as well as an opt-out process for creators who want to exclude their images from future DALL∙E training datasets. It also mentioned its partnerships with publishers like Associated Press.
In January 2024, in what might be interpreted as the beginning of a shift by AI providers, OpenAI's CEO Sam Altman at the World Economic Forum, Davos said that OpenAI was open to deal with publishers and that there's a need for "new economic models" between publishers and generative AI models. Licensing deals now appear to be becoming more prevalent and some, for example between Condé Nast and Open AI, and between academic publishers such as Wiley and GenAI providers have been highlighted in press reports since the summer.
When licensing negotiations break down there is a risk of legal action being taken, such as that reportedly being taken by Mumsnet against OpenAI. In its complaint against OpenAI, Mumsnet claims "scraping without permission is an explicit breach of our terms of use, which clearly state that no part of the site may be distributed, scraped or copied for any purpose without our express approval." As the first case involving allegations of unlawful website scraping by AI developers in the UK, the outcome will be important in developing the law in this area – if the case reaches trial.
Training data issue resolution – UK government
Since our March 2023 Generative AI and intellectual property rights piece covering the UK's current position and reforms relating to a proposed commercial text and data mining exception, there has been no significant legal development on text and data mining in the UK. In January 2024, the previous government's Culture, Media and Sport Committee confirmed that the government is no longer proceeding with its original proposal for a broad copyright exception for TDM.
While, apparently, work had commenced on the voluntary code of practice (promised by the Intellectual Property Office "by summer 2023") to provide guidance to support AI firms in accessing copyright protected works as an input to their models and to provide protections (e.g. watermarking) on generated output, it has not materialised.
In the previous government's February 2024 response to its consultation on the 2023 AI whitepaper, it acknowledged that the stalemate between AI companies and rights holders on the voluntary code of practice led the IPO to return the task of producing the code to the Department for Science Innovation and Technology (DSIT). DSIT and DCMS then reengaged with the AI and rights holder sectors without resolution of this complex global issue.
The Labour government's current position is that it is working with a range of stakeholders and aims to resolve the deadlock by the end of 2024.
In the EU
The EU AI Act entered into force on 1 August 2024 and is largely applicable by August 2026. The text provides for general-purpose AI (GPAI) systems such as ChatGPT, and the GPAI models they are based on (such as OpenAI's GPT-4), to have to adhere to transparency requirements. These include drawing up technical documentation explaining how the model has been trained, how it performs, how it should be used and its energy use, complying with EU copyright law (in particular to obtain authorisation from or enable content owners to opt out from the text and data mining of their content as provided for under the EU DSM Copyright Directive) and disseminating "sufficiently detailed" summaries about the content used for training GPAI including its provenance and curation methods. Codes of practice will be provided to help providers demonstrate compliance with these obligations until a harmonised standard is published.
Notably, on the question of GPAI model providers identifying and respecting opt out rights, this will be done using methods including "state of the art" technologies and: "Any provider placing a general-purpose AI model on the Union market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place. This is necessary to ensure a level playing field among providers of general-purpose AI models where no provider should be able to gain a competitive advantage in the EU market by applying lower copyright standards than those provided in the Union." (EU AI Act, Recital 106)
The detailed summaries should be comprehensive enough to allow rights holders to be able to exercise and enforce their rights, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used. The EU AI office, responsible for the implementation and enforcement of the EU AI Act, will provide a template.
The EU AI Act obligations relating to providers of GPAI models' compliance with EU copyright laws are applicable from August 2025 (for new models, and August 2027 for GPAI models placed on the market before 2 August 2025). However, the exception for text and data mining provided for under the EU DSM Copyright Directive that allows content owners to opt out from bulk scraping of their online content is proving challenging to apply in practice. There is currently no standard protocol to enable machine readable "opt out" or to expressly reserve rights.
To assist with this issue, the European Commission is currently working on developing a licensing market for the training data used to train AI systems like ChatGPT. Some copyright owners such as Sony Music, the Society of Authors and the Creators’ Rights Alliance are pre-empting the EU AI Act (while providing a warning within the UK) by publicly reserving their rights in relation to text and data mining via a statement on their website and/or in a letter to various companies including AI developers.
The output of FMs – works created by users
As well as the possibility of training data related copyright infringements explained above, the outputs of AI FM models such as ChatGPT or Midjourney generated as the result of user prompts may also provide grounds for copyright infringement of third party original works.
For example, if you are the author of an artwork and find a markedly similar copy has been generated by a user of a FM model without your permission, you will have to make your case on copyright infringement. In showing infringement, as a first step, proof of copying of features from the protected work is required. Then the question is whether what has been taken constitutes all or a substantial part of the copyright work. A challenge with user generated works will be to show that the output work was derived from the original copyright protected work (did the AI provider introduce it in its training data, was it introduced during the fine tuning process or did a user provide it as one of its prompts). The EU AI Act somewhat provides for this (see above) by allowing the copyright owner to see if their work is contained in a particular data set. The UK has been silent on this issue so far, but transparency requirements may be coming.
In this scenario the users of FM (and/or AI providers) face potential liability for copyright infringement. These claims may be low value, and challenging to prove for rights holders so this might be a low risk, but it nevertheless produces risk for AI users and providers. Consequently, a number of key players (Microsoft, Google and OpenAI) now provide offers to indemnify certain (mainly enterprise) users if they are subsequently sued for copyright infringement. Microsoft's Customer Copyright Commitment states that if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, they will defend the customer and pay the amount of any adverse judgments or settlements that result from the legal proceedings, as long as the customer has used the guardrails and content filters built into their products. OpenAI's "Copyright Shield" promises to step in and defend their customers, and pay the costs incurred, if they face claims of copyright infringement. This applies to generally available features of ChatGPT Enterprise and their developer platform. Note: some of these indemnities may include carve-outs and liability caps.
Protection for the outputs of AI FM models
Most public facing generative AI models are accessed via a platform or website and are therefore subject to website terms and conditions. ChatGPT states that: "Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output."
What is actually being assigned is an important consideration for businesses and individuals. For example, there seems to be high use of AI FM in the advertising sector. If you, as a user, have produced marketing materials with the assistance of a FM you are likely to want to prevent their unauthorised use by third parties as a normal part of your business' brand/content protection strategy. This would not normally be problematic if they are created without AI FM assistance – then the copyright likely belongs to the company concerned as the employer of the employee author. However, most jurisdictions, including the UK, require that copyright protection only applies to works created by human authors and if the work is solely computer generated there may be a subsistence issue. This is because authorship and ownership of copyright is tied into the concept of "originality", that is, protection is only extended to works categorised as "original literary, dramatic, musical or artistic works". The work may of course be attributed to the developer of the FM in circumstances where the user's role is confined to a single simple prompt and the FM has been finely tuned to produce marketing materials – in this situation there are likely to be terms that assign the developer's rights in works to the end user.
At this point the apparently prescient section of the Copyright Designs and Patents Act 1988 (CDPA 1988) that grants protection to computer-generated works (CGWs) may be considered. Section 9(3) states that the author in the case of CGW is the person by whom "the arrangements necessary for the creation of the work are undertaken". The problem with this section relates to the date of the Act: 1988. What the legislators may have had in mind at this time is something like the use of computers as digital aids in cartography. Now, however, this section is being applied to GenAI models.
However, since 1988, there have been some developments when it comes to "originality". The test for originality has changed and now to be an original work, works must be "the author's own intellectual creation" whereby an author has been "able to express their creative abilities in the production of the work by making free and creative choices so as to stamp the work created with their personal touch…" That definition is not very CGW/AI friendly. Where works are created by entering prompts into a GenAI system (i.e. using it merely as a tool) there would be room to apply the "author's own intellectual creation" originality test. However, literary, dramatic, musical or artistic CGW are more problematic under this originality test if a work has no human author. Therefore, in order to claim authorship and ownership, squeezing out the human element may be the best approach until clarification is provided from the UK government or the courts. The position is not clear cut though and if you are creating content for a client, the Ts & Cs relied on historically for human authored work may not be effective in transferring absolute ownership.
In November 2023, a Chinese court did find that an AI generated image, created using Stable Diffusion, satisfied the requirements of "originality" and was capable of copyright protection. The Beijing Internet Court found that the image had been created (using AI as a tool) in a way that reflected the ingenuity and original intellectual investment of human beings. In February 2023 in a US case concerning authorship of the images contained within Kristina Kashtanova's work: Zarya of the Dawn, the US Copyright Office took a different approach. The images were developed using the generative AI tool Midjourney. By its own description Midjourney does not interpret prompts as specific instructions to create a particular expressive result (Midjourney does not understand grammar, sentence structure, or words like humans) it instead converts words and phrases “into smaller pieces, called tokens, that can be compared to its training data and then uses them to generate an image. The US copyright office decided that the images claimed were not original works of authorship protected by copyright because they were produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author (the designer modifying the images produced by the AI model using subsequent prompts and inputs was not sufficient to fulfil the requirement for human creativity). They were therefore removed from the U.S. Copyright Office register as not copyrightable. Because of the significant distance between what a user may direct Midjourney to create and the visual material Midjourney actually produces, the U.S. Copyright Office found that Midjourney users are deemed to lack sufficient control over generated images to be treated as the “master mind” behind them.
How might these issues impact those developing and interacting with FM?
This is a complex area and tricky to navigate in a commercial setting given that the UK and many other jurisdictions are failing to reach a position and provide guidance. However, it is worth keeping up to date on and in mind the following live issues:
- the risk surrounding the use of data sets
- that there may be a need to disclose the contents of data sets under the EU AI Act and the UK framework
- who owns FM outputs? Is an AI output as protectable as a human created work?
Discover more insights on the AI guide
Stay connected and subscribe to our latest insights and views
Subscribe Here