insAIde #35 [Legal & Policy]: Could the New York Times win its case?
An analysis of the case that could change everything
👀 Some updates before we begin.
🇮🇹 🔜 🇬🇧 insAIde is being updated. We have decided to switch to an English newsletter. In the coming days, we will update all the articles published so far, adding the English translation at the beginning, and updating the title in English. The work on AI is so extensive and global that we want to share it with as many readers as possible.
Therefore, if you are an English-speaking reader, please subscribe too!
Enjoy the reading!
What about ©️?
On 27 December 2023, the New York Times filed a lawsuit against OpenAI and Microsoft, in the district court of New York, for copyright infringement.
The lawsuit is only the latest in a long series concerning copyrights, mostly by writers and artists, in addition to the measures taken by various authorities to protect privacy, including the intervention of the Italian Garante in March 2023, a case in which our law firm, PANETTA, was called in to advise. But data protection profiles are only one side of the coin, and equally relevant, and nebulous, are those relating to copyright.
The document drafted by the New York Times' lawyers, 69 pages long, begins with an introduction on the importance of journalism for democracy and how the actions of OpenAI and Microsoft, to use without consent millions of articles from the Times with an economic impact on the newspaper's business model, endanger the service that the NYT has been providing to society for over 170 years.
📰 The position of the Times
The NYT's position is very clear. Quality journalism needs to be funded and this is done, for the NYT, through subscriptions (10 million subscribers), advertising (seen weekly by 50-100 million users) and commissions earned from products recommended in its Wirecutter column. For the NYT, OpenAI and Microsoft enriched themselves by exploiting, without permission and without financial recognition, the newspaper's massive archive.
Among the disputed points, the NYT claims to be a primary source of relevance in the CommonCrawl database used by OpenAI to train ChatGPT, representing the first proprietary source after the US patent database and Wikipedia, totalling 66 million documents. For the journal, this relevance must be recognised economically.
The second point concerns both the unauthorised publication of its content and the creation of substitute content for the Times, with an associated economic loss.
The newspaper argues that ChatGPT allows users to obtain detailed summaries of The Times, and to dispense with reading them in the newspaper, thus depriving it of important economic revenue.
This result, it contends, is quite different from that offered by search engines that present a very short abstract of a few words but redirect the reader to the original article.
In other cases, however, ChatGPT provided a full-bodied, word-for-word portion of articles from the Times. In one reported example, a user asks ChatGPT to show him the paragraphs of an article protected by the paywall and ChatGPT shows it.
The NYT then complains that, when ChatGPT shows an article from Wirecutter, their publication featuring reviews of consumer goods where, when the reader buys one of these goods online via the provided link, the NYT gets financial remuneration, this link is not present on ChatGPT, thus eliminating a possible source of revenue for the NYT.
Lastly, the NYT disputes that the production of so-called 'hallucinations', wrong answers taken for true, led ChatGPT to attribute to the NYT content that was never produced, sometimes mixing information in the original article with invented information. The result would therefore be to attribute inaccurate content to the Times, damaging its reputation.
The Times is therefore asking for damages, an injunction that ChatGPT and Microsoft no longer use its content, and the destruction of Large language models and datasets resulting from the unauthorised exploitation of the Times' content.
🇺🇸 ©️ What the American doctrine of Fair Use says
Until now, all providers of Large Language Models such as OpenAI have relied on the Fair Use doctrine to use, without permission, billions of online content to train their systems.
Unlike Italian copyright law, where exceptions are listed exhaustively, American copyright provides for a more open approach, through Fair Use, a doctrine born in the 19th century and codified in the 1976 Copyright Act. According to Fair use, the judge, applying a set of criteria, could assess that the unauthorised use of a copyrighted work is not unlawful.
The judge would have to assess, on a case-by-case basis, firstly the use and purpose for which the work was used, whether of a commercial nature or for educational and non-profit purposes. "Moreover, 'transformative' uses are more likely to be considered fair. Transformative uses are those that add something new, with an additional purpose or a different character, and do not replace the original use of the work." Using, for example, the short excerpt of a film to make a YouTube video of commentary or criticism or parody of the film may be considered fair use.
The second factor to take into account is the nature of the copyrighted work. The more creative the work, the more difficult it will be to prove fair use. Conversely, news use is more likely to be subject to fair use.
One must then look at the quantity and consistency of the part used in relation to the copyrighted work as a whole. Going back to the example of the criticism video on YouTube, how long is the excerpt I am using? A few seconds or several minutes? In any case, courts have in some cases considered the use of an entire work not fair to the use of an excerpt because it is considered the heart of the work itself.
Last but not least, the effect of the use on the potential market or the value of the copyrighted work needs to be verified. Posting all the highlights of a film on YouTube, even if not in their entirety, would deter potential film lovers from going to see it, resulting in financial losses for the producer.
There is, however, no mathematical rule that can be applied with certainty and each case is context-specific.
TRAINING
🇺🇸 Would fair use apply here?
Limited to the training phase, a balancing of the four factors would make us lean towards an affirmative answer. Although millions of articles from the Times have been used, the primary purpose is to 'teach' the system how to compose sentences that are correct in terms of syntax, meaning, and hopefully also true. There is thus a transformative use of the original work, and not an interest in copying and distributing it.
🇪🇺 The European Copyright Directive (in force) and the AI Act (coming soon)
But what do, and soon will, European regulations say in this respect?
As far as the training phase is concerned, the EU Copyright Directive 2019/790 provides in Article 3(1) for the legality of "reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access." But OpenAI has long since ceased to be a mere research organisation, having abandoned the non-profit nature with which it was conceived. But the directive provides, in Article 4(3), an exception also for private operators with the limitation, this time, that it "has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online."
As Recital 18 of the directive explains, if research organisations can extract data and information from sources to which they have legal access, even those covered by copyright, without constraints and without having to pay a financial compensation, private individuals must stop where the rights holder decides not to agree. This objection can be made, at a technical level, by instructing robots.txt not to perform data extraction, or, for instance, by making this prohibition explicit in the terms and conditions.
From a technical point of view, this opt-out operation has been facilitated by OpenAI for ChatGPT and by Google for Bard, by giving the possibility to deny access to their robots, which the New York Times did for both companies, by deactivating both the GPTbot and Google-extended bots, as pointed out by Bruno Saetta in Valigia Blu.
In Italy, for example, almost all newspapers display the words 'all rights reserved' at the bottom of each article, constituting a de facto opt-out. Wanting therefore to apply European legislation, this wording was missing in the NYT and, until the prohibition was made explicit in the terms and conditions or in some other identifiable way, nothing could be disputed.
With the upcoming AI Act, then, to increase the level of transparency, large language model providers will have to "draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office”. This summary should, for instance, list the main collections or datasets that were used for training the model, such as private or public large databases or data archives, and provide a narrative explanation of other data sources used.
This will make it easier for rights holders to identify possible copyright infringements if they have denied the possibility of free data extraction.
PUBLISHED ARTICLES
Regarding the possibility of reading the articles of the Times, in their entirety, it is possible that this is one of those errors that are part of that continuous process of learning the limits and possibilities of the system. Companies operating in this field, even with all their efforts and imagination, have no precedent policies to anticipate all possible misuses of LLM. In this sense, the AI Act will provide some indications, which however cannot foresee all possible outputs. It is therefore important to have an ongoing dialogue with stakeholders and authorities in order to correct those that have not yet been identified.
In the case of the NYT, the attached evidence shows that parts of the newspaper were provided in their entirety, without any licence. But it would also be necessary to prove the extent to which this 'abuse' took place in order to be able to quantify the damage more precisely. If, on the other hand, it were OpenAI's intention to give access to excerpts of newspaper articles, the NYT could make the fourth point of the Fair Use analysis, the effect on the original market, weigh heavily. Readers would demand detailed summaries of NYT articles, directly affecting business.
In Europe, on this point, the European Copyright Directive stipulates in Article 15 the need for a licence of use (to be granted) for the publication of journalistic extracts, which today also applies to snippets, the previews that are created on a social when a link to an article is inserted. Similar legal solutions have also spread in other countries, at the request of publishers to legislators, in order to offset the advertising losses suffered when Google and Meta entered the market.
In the present case, the NYT, unable to find an economic licence agreement with OpenAI and Microsoft for the use of its articles, took legal action. While the NYT reports that it has always found agreements with large technology companies such as Google, Meta and Apple, for the use of its content, it must be remembered that OpenAI has also found similar agreements with other large publishing groups such as Axel Springer (publisher of Politico, Business Insider, Bild and Welt) and Associated Press. The size of such agreements, according to The Information, and reported by The Verge, would amount to between USD 1 and 5 million per year, while Apple, the latest entrant in the race for generative AI, would offer much higher sums, but in order to wrest broader licensing agreements, a proposal that the publishers may not yet feel like accepting.
🤖 OpenAI's position
In its statement of 8 January, OpenAI reported that it was taken by surprise by the lawsuit, given that the last communication about a collaboration was only ten days old. Moreover, as assumed, the company reports that cases of 'regurgitation', i.e. when ChatGPT reproduces entire chunks of datasets, are rare and the company's commitment is to bring them to zero. On the point, most relevant to the case, according to OpenAI, the examples brought by the NYT were carefully selected and prompted by several attempts, referring to dated articles, the content of which had been taken from other sources on the web, and thus also available through other channels.
The case will nevertheless have an impact on the world of AI
It is a case that will nonetheless have an impact on the development of present and future LLM, since one of the demands is the deletion of training models based on the database containing the NYT articles.
If, for mere training, American fair use and Article 4 of the copyright directive should relieve OpenAI and Microsoft of liability, corrections or, as is already done with other partners for more extensive use, the acquisition of licences will be necessary for the use of the articles in the output phase.
In reality, as we have also seen in the area of data protection, many 'old' rules are perfectly suited to regulate these problems, provided we are open in applying them to new cases, which by their nature are much more unpredictable.
If you understand Italian, Vincenzo Tiani discussed the case in the Podcast “Actually”, with Riccardo Haupt and Riccardo Bassetto.
⏰ That's all for us, see you insAIde, next Tuesday, at 08:00.
Rocco Panetta , Federico Sartore , Vincenzo Tiani, LL.M. , Davide Montanaro , Gabriele Franco