Skip to content.

TDM Exceptions and Copyright: A German Court Decision

Is the creation of a dataset that can be used to train generative AI systems (GenAI systems) an infringement of copyright?  According to a recent decision of a German court, the activity  may be covered by the text and data mining (TDM) exception in Articles 3 and 4 of the Digital Single Market Directive  (EU) 2019/790. The decision is an important one. However, it leaves many questions unanswered including whether using such a dataset for training a GenAI system would be infringing. Given the uncertainties surrounding how copyright laws apply to AI systems training, the decision provides at least some useful guidance on how the EU copyright laws will treat creating data sets that may be used for AI training.

Facts

The decision arose from a suit brought by photographer Robert Kneschke against LAION e. V. (Hamburg Regional Court, File Number: 310 O 227/23, 27.09.2024, unofficial English translation here) claiming that the scraping of his photo from a photo stock website by LAION to create a dataset to be used for AI training infringed his copyright in the photos.

LAION created a dataset for image-text pairs. It is a tabular document that contains hyperlinks to images or image files publicly accessible on the Internet as well as other information related to the respective images, including an image description (also called alternative text), which provides information about the content of the image in text form. The dataset includes 5.85 billion such image-text pairs. LAION relied on an existing dataset of a random cross-section of images found on the Internet which contained the respective URLs along with the textual description of the respective image content. It extracted the URLs of the images from this dataset and downloaded the images from their respective storage locations. The images were analyzed by software to determine whether the description of the image content in the preexisting dataset actually corresponded to the content visible in the image. Where they matched the metadata including the URL of the image’s storage location and the image description, they were extracted and included in the LAION dataset.

As part of the TDM process, the plaintiff’s disputed image was captured, downloaded, analyzed, and incorporated into the dataset. A preview image (likely a thumbnail) of a watermarked image file was downloaded from a file posted on a publicly accessible website of a photo agency.

The agency website displayed the following restrictions:

“RESTRICTIONS:

YOU MAY NOT:…

Use automated programs, applets, bots or the like to access the XXX.com website or any content thereon for any purpose, including, by way of example only, downloading content, indexing, scraping, or caching any content on the website.”

The plaintiff alleged that the copying of the photo was a reproduction contrary to the German Copyright Act (UrhG). There was no dispute that the TDM activities implicated the reproduction right. However, the defendants claimed, among other things, that the activities were the subject of two exceptions for TDM activities, namely the exceptions in §44b and §60d of the UrhG. These exceptions were enacted to implement the TDM exceptions in Articles 3 and 4 of the DSM Directive. §44b implemented Article 4 while Section §60d implemented Article 3.[1]

The TDM Exceptions in the DSM Directive

Article 4 of the DSM Directive requires Member States to provide for an exception “for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining”. The reproductions and extractions may be retained for as long as is necessary for the purposes of text and data mining. This TDM exception only applies “on condition that the use of works and other subject matter” “has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online”.

Article 3 of the DSM Directive requires Member States to provide an exception “for reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access”.  Copies of works or other subject matter must be stored with an appropriate level of security “and may be retained for the purposes of scientific research, including for the verification of research results”.

Under the DSM Directive the term “text and data mining” is defined to mean “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”.

The interpretation of the TDM exceptions

Article 4 of the DSM Directive

The German court first examined whether the defendant could rely on the exception in § 44b. The court expressed the opinion that the defendant could likely show that the creation of the dataset was carried out for the purposes of text and data mining.  However, it concluded that it was doubtful that the defendant would succeed on this ground due to a valid reservation within the meaning of §44b.

According to the court the exception in the German copyright law defined “text and data mining” as “the automated analysis of individual or multiple digital or digitized works to extract information, particularly about patterns, trends, and correlations.” The court held it was unnecessary to decide the further question, whether the training of AI systems was covered by the exception. However, training the LAION dataset was covered.

“The defendant undertook the reproduction action for the purpose of extracting information about “correlations” within the literal meaning of § 44b para. 1 UrhG. The defendant downloaded the disputed photograph from its original storage location in order to compare the image content with the image description already stored in the text using an available software application — evidently the XXX application from XXX.

This analysis of the image file for comparison with a pre-existing image description undoubtedly constitutes an analysis for the purpose of extracting information about “correlations” (namely, the question of whether there is agreement or disagreement between images and image descriptions). The plaintiff did not dispute that the defendant analyzed the images included in the XXX dataset in this manner.

The court rejected the argument that the TDM exception did not apply on the basis that AI web scraping for the creation of data sets to train AI systems was not contemplated at the time it was enacted. It also rejected the argument that the exception could not apply because it could be used for training AI systems. The latter conclusion was premised, in part, on Article 53 the EU AIA (aka, the EU AI Act) which “unequivocally” demonstrated the EU legislator’s intent that “providers of general-purpose AI models are required to implement a strategy, in particular, to identify and comply with a rights reservation asserted under Article 4 (3) of the DSM Directive” as well as the legislative history to Germany’s implementation of the DSM Directive.

The court also rejected the plaintiff’s argument that the exception was to be more narrowly construed to avoid violating the 3-step-test enshrined in Article 5 (5) of the InfoSoc Directive. This too was rejected by the court.[2]

Despite the court’s view that the TDM activities at issue fell within the wording of the exception the court doubted that the defendant could meet the condition that the plaintiff had not validly opted out of the TDM exception.

The court found that the image file was lawfully accessed.  The “preview file” was freely available for download from the image agency’s website. However, in the provisional view of the court, the agency’s general website notice was sufficient to constitute a validly declared reservation. In so ruling, the court gave a generous reading to how the opt out right could be met. In particular:

  • The opt out right could be manifested by the copyright holder or “subsequent rights holders, whether they are legal successors or holders of derivative rights from the original author”. This included the agency which held “sublicensable usage rights to the original image”.
  • The opt out did not have to be specific to the photo reproduced. A restriction in the agency’s general terms was sufficient.
  • The opt out could be effective even if expressed in a natural language. That was sufficient to comply with the “machine readability” requirement. In this regard, the court seemed to accept that “state-of-the-art technologies” already include AI applications capable of comprehending text written in natural language.[3]

Article 3 of the DSM Directive

The court then examined whether the defendant could rely on the exception in § 60d UrhG. The court held it could.

As required by Article 3 of the DSM Directive, reproductions for the TDM purpose of scientific research are permitted by research organizations. The court, already having analyzed the meaning of text and data mining, concluded that the creation of the data set fell within that definition.

The court went on to find that the “scientific purpose” requirement was also met.

Scientific research generally refers to the methodical and systematic pursuit of new knowledge…The term “scientific research” is not to be understood narrowly, as it already considers the methodical and systematic “pursuit” of new knowledge to be sufficient. It does not only encompass the steps directly linked to the generation of new insights; rather, it suffices if the step in question is aimed at achieving a (later) knowledge gain, as is the case with many data collections, which must first be carried out to later draw empirical conclusions. Specifically, the term “scientific research” does not require a successful research outcome.

Thus, contrary to the plaintiff’s view, the creation of a dataset of the type in question, which can serve as a basis for training AI systems, can certainly be considered scientific research in the above-mentioned sense. Although the creation of the dataset itself may not yet be associated with a knowledge gain, it is a fundamental step aimed at using the dataset for the purpose of later knowledge acquisition. Such an objective can be affirmed in the present case. It is sufficient that the dataset was—undisputedly—published for free and thus made available, particularly to researchers working in the field of artificial neural networks. Whether the dataset is also used by commercial enterprises for training or further development of their AI systems, as the plaintiff claims regarding the services … is irrelevant, because even research conducted by commercial enterprises is still research—although not privileged as such under §§ 60c ff. UrhG.

Therefore, the disputed question between the parties as to whether the defendant, beyond the creation of such datasets, also engages in scientific research in the form of developing its own AI models is not relevant in this context.

The court also found that defendant’s TDM activities were not for a commercial purpose.

Comments on the LAION decision

The decision of the Hamburg court is an important one. It is the first EU decision to interpret the requirements of Articles 3 and 4 of the DSM Directive. The decision provides guidance on how the definition “text and data mining” is to be construed. It also provides guidance on how the opt-out in Article 4 of the DSM Directive may be construed, although the decision is only provisional and not binding. The decision is also important as it suggests that Article 53 of the EU AIA is applicable to Article 4 TDM activities that are carried out to create datasets that may (or may not) be used for training AI systems. For more on the interrelationship between the DSM Directive and the EU AIA, see, Barry Sookman, Understanding the AIA Copyright Provisions in the EU Artificial Intelligence Act.

The decision still leaves many questions unanswered. For example, it did not address whether making available a dataset that is lawfully created by relying on a TDM exception for subsequent training of a GenAI system would also be covered by the TDM exception or whether any such uses could still be infringing. It also did not expressly confirm that training of a GenAI system would be covered by a TDM exception in light of Article 53 of the EU AIA.[4] The provisional decision that opt outs under Article 4 of the DSM Directive can be done using natural language rather than in other formats such as robot.txt, ai.txt, or other communication standards also leaves open many questions.[5] 

This article was first published on barrysookman.com

___


[1] The defendant also claimed the benefit of an exception for making temporary reproductions in § 44a UrhG. According to this provision, temporary reproductions are permitted if they are fleeting or incidental and form an integral and essential part of a technical process, and their sole purpose is to enable a transmission in a network between third parties by an intermediary or a lawful use of a work or other protected subject matter, and the reproduction has no independent economic significance. The court held these conditions were not met relying on the decision of the CJEU in Case C-5/08 – Infopaq/Danske Dagblades Forening.

[2] “The reproduction relevant under copyright law in this case is limited to the purpose of analyzing the image files for their correspondence to a pre-existing image description, followed by incorporation into a dataset. There is no indication, nor has the plaintiff claimed, that this use would impair the potential exploitation of the respective works.

While the dataset created in this way may subsequently be used to train artificial neural networks, and the resulting AI-generated content may compete with works created by (human) authors, this alone does not justify viewing the mere creation of training datasets as an impairment of the exploitation rights of works within the meaning of Article 5 (5) of the InfoSoc Directive. This must apply, if only for the reason that considering merely future technological developments—which cannot yet be fully foreseen—does not allow for a legally certain distinction between permissible and impermissible uses (see similarly above (b)).

Since, based on current technological developments, it can never be ruled out with certainty that insights gained through text and data mining will be used to train artificial neural networks that may then compete with human authors, the opposing view would ultimately require a complete prohibition of text and data mining within the meaning of § 44b UrhG. However, such a complete nullification of the exception rule would obviously contradict the legislative intent and, therefore, cannot represent a viable interpretation.

[3] “However, there are indications that the defendant already had access to suitable technology. According to the defendant’s own submissions, the analysis conducted in the context of creating the dataset XXX, in the form of comparing image content with pre-existing image descriptions, clearly required semantic recognition of these image descriptions by the software used. In this context, there is much to suggest that—particularly for the defendant—systems were already available in 2021 that were capable of recognizing a reservation of use formulated in natural language in an automated manner.”

[4] Some of these questions are raised in the blog post, by Elonora Rosati,  The German LAION decision: A problematic understanding of the scope of the TDM copyright exceptions and the transition from TDM to AI training, and Ronak Kalhor-Witzel See, Kristina Ehle, and Yesim Tuzun, German Court Says Non-Commercial AI Training Data Meets Scientific Research Exception to Copyright Infringement.

[5] See, Kristina Ehle, and Yesim Tuzun, To Scrape or Not to Scrape? First Court Decision on the EU Copyright Exception for Text and Data Mining in Germany.

Authors

Subscribe

Stay Connected

Get the latest posts from this blog

Please enter a valid email address