Core Copyright Issues in AI Large Model Training

baoshi.rao

The rapid rise of AIGC large models has reshaped the creative logic of the content industry and empowered people's work and lives, while also triggering a series of new challenges at the copyright system level. Compared to the copyright attribution and rights distribution of AI-generated content, what copyright utilization behaviors are involved in the training phase of large models, what potential infringement risks exist in this process, and how to draw on existing overseas explorations to establish a scientific exemption mechanism for large model training from an industrial development perspective have garnered widespread attention. Against the backdrop of global efforts in the AIGC field, there is an urgent need to research and discuss these issues to reduce the uncertainties faced by technological progress and industrial development.

On one hand, the issue of large model training exists at the very beginning of the generative AI lifecycle. If not properly addressed, the development of AIGC large models will remain in a state of uncertain infringement. From industry practices and technical principles, the current methods of using massive content data for model training in various generative AI can be roughly abstracted into the following two steps: First, obtaining massive content data through purchasing databases, public crawling, etc., and after certain transformations, storing it in relevant servers. Second, analyzing and processing the content data to identify certain patterns, trends, and correlations, and converting them into large model parameters for use in subsequent content generation.

On the other hand, current copyright disputes in the generative AI field mostly focus on unauthorized copyright utilization behaviors during the model training phase. According to incomplete statistics, from November 2022 to October 2023, the Northern District Court of California alone has accepted 10 cases where copyright holders sued AIGC development companies such as Stability AI, OpenAI, Meta, and Alphabet for unauthorized use of copyrighted works in model training. In June 2023, disputes also arose in China's online education and training industry due to unauthorized use of third-party platform content data for large model training.

First, the legal basis for authorization is unclear, and the specific copyright rights granted require further examination. Superficially, AIGC model training resembles the "thinking, absorption, and re-creation" process of natural persons reading literary works or appreciating artworks, which does not directly correspond to existing exclusive copyright rights. [1] It's important to note that a model's learning and imitation of artistic styles is not an issue regulated by copyright law—artistic styles should remain freely accessible to the public, as this relates to freedom of expression and the development of creative industries. Even if this behavior were to be included under copyright regulation, copyright holders would face practical difficulties in enforcement. This is because AIGC model training is essentially an internal, non-explicit use of works by machines, creating challenges for copyright holders in detecting infringement, providing evidence, and conducting content comparisons.

Second, the feasibility of authorization is questionable, with issues such as scale, unclear subjects, and operational mechanisms. AIGC model training involves numerous works from diverse sources with varying ownership. If prior authorization were required: on one hand, protected works would need to be accurately identified and extracted from massive datasets; on the other hand, each copyright holder would need to be located for negotiation and payment of varying licensing fees. This process would be lengthy, complex, and extremely difficult to implement.

Third, the value of authorization needs evaluation, as it may lead to negative effects like "overfitting," "chilling effects," and "model bias." Ironically in practice, any measures limiting the scale or availability of training data may produce unintended consequences—increasing the probability of models simply reproducing content from training materials. Additionally, high licensing costs and uncertainty about infringement risks could create a "chilling effect" on AI technology and industry development, potentially leading to "model bias" due to insufficient data quantity and quality. [2]

From the technical workflow and fundamental principles of AIGC model training, when discussing copyright issues at this stage, we're essentially examining how copyright law should view three behaviors: "work acquisition," "work storage," and "work analysis." Currently, only "work storage" clearly falls under copyright law's "reproduction right," while whether the more crucial "work analysis" behavior should be regulated remains debatable.

In the 'work acquisition' phase, it is essential to address the legality of content data acquisition, including whether obtaining database content or publicly available online content is lawful, and whether it involves actions such as damaging computer information systems, violating data scraping protections, or breaching API port agreements. The reason for distinguishing between 'work acquisition' and 'work storage' behaviors in discussions about copyright issues during the model training phase is that, with the advancement of technology, methods like 'cloud computing' and 'federated learning' may enable model training directly through contact with content data without the need for storage.

From a copyright law perspective, 'acquiring works' or accessing them is akin to browsing web pages online or reading books offline. Merely accessing works without subsequent dissemination or utilization is unlikely to trigger copyright infringement liability. The core copyright issue at this stage primarily involves whether there are actions that circumvent the 'technical protection measures' of the works. According to China's Copyright Law, violating the provisions on technical protection measures for works also constitutes infringement. Even if the use of the work falls under 'fair use,' circumventing technical measures to access the work may still lead to infringement liability if such circumvention does not comply with the exemptions under Article 50 of the Copyright Law.

It can be said that, in the model training phase, the storage of content data falls within the scope of the 'reproduction right' under copyright law, which is relatively uncontroversial. However, it is worth noting that with the application of new technologies in content production and dissemination, we must also consider whether there is an over-isolation of the 'reproduction right.' Reproduction is often merely a preparatory act for the 'primary utilization of works.' Without subsequent acts such as distribution, broadcasting, or dissemination via information networks—which are regulated by copyright law—there is no actual infringement harm, and copyright holders cannot determine whether their works have been utilized.

Today, whether the evolution of information technology and business models should allow for a certain degree of 'freedom to reproduce,' similar to the 'cache freedom' created with the advent of the 'safe harbor' system, remains a topic for further discussion. In the 2013 Supreme People's Court's Top Ten Intellectual Property Cases—'Wang Xin (Mianmian) v. Google et al. (Book Search Case)'—the core dispute was whether the initial 'reproduction' constituted a separate infringement (as held by the Beijing First Intermediate Court) or could be absorbed by subsequent fair use (as held by the Beijing High Court).

Specifically, in the first trial of the "Google Library Case," the Beijing First Intermediate Court explained the reasons for separately determining "copying behavior" in early "text and data mining" cases: On one hand, copying works for "use" purposes does not allow the public to obtain the copy but enables the copier to use the work without purchasing a legal copy, potentially affecting sales of legitimate copies. On the other hand, potential harm typically stems from copying aimed at "disseminating works" (e.g., distribution, broadcasting, or online transmission). However, in the context of current AIGC model training: First, if the method of obtaining training data is legal, actual harm may not be a major concern. Second, potential harm remains debatable, as it is unclear whether content processing in model training constitutes copyright-regulated behavior that could harm copyright holders—a point discussed in detail later.

There is uncertainty about which copyright rights apply to the internal content processing in models, with no clear consensus in theory or practice. Some argue that "content processing" falls under the scope of "adaptation rights" in copyright law. However, adaptation rights refer to creating new works from existing ones, whereas processing data to generate parameters reflecting patterns, trends, and correlations does not involve creating new works, making it difficult to align with "adaptation rights." Others suggest that since current copyright law lacks specific rights for "content processing," it could be regulated under "catch-all clauses."

Another view holds that such behavior does not fall under copyright regulation. Currently, AIGC model training involves two main types of "content processing": In "text-to-image" models like Stable Diffusion, artistic styles, feelings, and inspirations are extracted from existing images and stored as model parameters. In "text-to-text" models like GPT, statistical "autoregressive principles" are used to learn probabilities and patterns of word combinations from vast amounts of existing works, internalizing them as model parameters.

Copyright law follows the fundamental logic of the 'idea-expression dichotomy,' emphasizing that 'it does not protect the ideas of natural persons, only their external expressions of those ideas.' For the first type of 'work processing' behavior, it essentially involves only the analysis and learning of artistic styles at the conceptual level of a work. Therefore, the object of this behavior does not fall under the scope of copyright protection, and the behavior itself is not subject to copyright regulation. For the second type of 'work processing' behavior, it involves only the statistical learning of word combination probabilities in a work, not the use or display of expressive content, and thus does not constitute a copyright-infringing act.

The current wave of AI-generated content (AIGC) transformation began with the release of ChatGPT in late November 2022, less than a year ago. National copyright legislation has yet to make targeted adjustments. However, during the earlier stages of weak AI, some countries explored copyright law reforms to exempt AIGC platforms from liability during model training. These approaches can be broadly categorized into three models: the EU's 'Text and Data Mining' model, Japan's 'Non-Appreciative Use of Works' model, and the US's 'Four-Factor Analysis + Transformative Use' model.

As early as September 2016, when the European Commission proposed amendments to copyright law to adapt to the digital economy, 'Text and Data Mining' (TDM) became a focal point. The EU noted that new technologies enable automated computational analysis of digital information, such as text, sound, images, or data. TDM allows for the processing of vast amounts of information to uncover new knowledge and trends. However, TDM often involves large volumes of copyrighted content. To provide legal certainty and encourage innovation, the EU introduced limitations or exceptions for copying and extracting works or other protected materials.

On March 26, 2019, the EU's final 'Directive on Copyright in the Digital Single Market' established Article 3 ('Text and Data Mining for Scientific Research') and Article 4 ('Text and Data Mining Without Purpose Restrictions') under 'Chapter II: Measures to Adapt Exceptions and Limitations to the Digital and Cross-Border Environment.' The specifics are illustrated in the image below:

Overall, the vast majority of AIGC model training activities currently fall under commercial utilization, qualifying only for the 'text and data mining without purpose restriction' exemption under Article 4. This provision adopts a mechanism similar to 'implied license + opt-out' for text and data mining, with three key points requiring attention.

First, the core exemption of this provision concerns the 'reproduction of works' during text and data mining. The EU's Digital Single Market Copyright Directive legislative background clarifies that reproductions and extractions for text and data mining (where 'extraction' refers to database rights equivalent to copyright 'reproduction') must be performed on legally accessed works or materials, especially when such technical processing doesn't meet existing exemptions for 'temporary reproductions' (i.e., caching under safe harbor provisions). This validates our analysis in Part II regarding copyright infringement risks during model training - unauthorized training primarily risks violating reproduction rights under EU legal logic.

Second, the exemption requires 'lawful access to the works or content being trained on.' The EU specifies this exception only applies when the exempted party has lawful access, including publicly available online content where rights holders haven't reserved these rights appropriately. Previously, since valuable databases were often paywalled, this exception didn't substantially reduce licensing burdens. However, with generative AI like ChatGPT primarily training on publicly available data (Common Crawl, Wikipedia, etc.), the exemption's value becomes significant.

Third, the exemption applies only when copyright holders haven't 'reserved text and data mining rights appropriately.' The EU emphasizes rights holders should be able to implement measures protecting such reservations. Per the Directive's legislative background, 'appropriate reservation' means: for online content, rights must be reserved through machine-readable means including anti-crawling technical protections; for offline physical publications, reservations may use contracts or declarations. Simply put, unless copyright holders proactively reserve rights through technical means or specifically notify platforms about training prohibitions, model training can proceed without authorization or compensation.

In 2018, Japan amended its Copyright Law, introducing a new fair use provision in Article 30(4) for 'utilization not intended to appreciate the original value of the work.' According to the Agency for Cultural Affairs, this amendment generally expanded copyright limitations to encourage innovation in response to the Fourth Industrial Revolution represented by AI, IoT, and big data. Notably, the latest amendment to Japan's Copyright Law passed the House of Councilors on May 17, 2023, without modifying Article 30(4), suggesting legislators believe this provision adequately addresses copyright challenges posed by generative AI.

Under this provision, copyright utilization during AIGC model training can qualify for exemption as 'non-appreciative utilization,' meeting two requirements: 'cases of use for information analysis' and the catch-all 'cases of use in computer processing where the work's expression is not perceived or recognized by humans.' Therefore, as long as the utilization during model training doesn't 'unreasonably prejudice the interests of copyright owners considering the nature, purpose, and manner of use,' it will likely qualify for exemption under Article 30(4).

Key aspects of Japan's 'non-appreciative utilization' exemption include:

1. It essentially corresponds to 'non-expressive use' of works, clarifying a category of non-infringing behaviors rather than providing infringement exemptions. The specific cases listed involve information analysis and machine processing that don't involve public dissemination of the work's expressive content, thus not constituting copyright-relevant use.

2. The exemption isn't limited to 'storage of works' but covers 'any manner of utilizing works within necessary limits.' This avoids disputes about specific behaviors during AIGC model training, such as whether processing constitutes copyright-relevant use. Japan's approach provides clearer expectations for model developers.

Third, this regulation includes a limitation clause: 'However, this shall not apply if the type, purpose, or method of use of the work unfairly prejudices the interests of the copyright owner.' As previously mentioned, given the current content generation models of AIGC, it remains questionable whether model training falls under the exclusive rights regulated by copyright law, and thus it does not significantly conflict with the normal exercise of copyright holders' rights. Moreover, since model training fundamentally involves 'non-expressive use' of prior works and learning styles and concepts at an abstract level, it does not substantially replace the original market for the dissemination and utilization of such works.

In May 2023, the Japanese government clarified its stance on model training under copyright law—it will not extend copyright protection to content used in AIGC model training. Keiko Nagaoka, Japan's Minister of Education, Culture, Sports, Science and Technology, stated that Japanese law does not protect copyrighted materials used in AIGC training datasets, effectively permitting the use of copyrighted works for model training, whether for non-profit or commercial purposes, and whether through reproduction or other means. This stance partially validates that the exemption under Article 30, Paragraph 4 of Japan's Copyright Act—'use not intended to appreciate the original value of the work'—can be applied to current AIGC model training practices.

On May 17, 2023, the U.S. Congress held a hearing titled 'Artificial Intelligence in Interaction and Copyright Law,' where former U.S. Copyright Office General Counsel Sy Damle remarked: 'Any attempt to mandate licensing fees for training content would either bankrupt the U.S. AI industry, eliminating our competitiveness on the global stage, or drive leading AI companies out of the country.' The U.S. has become a global hub for AI research and development largely due to its uniquely broad and flexible fair use provisions in copyright law, which are believed to remain applicable to AIGC models. These models extract abstract concepts and patterns from billions of training data points, creating entirely new content that differs from and does not infringe upon existing works.

U.S. copyright law defines the fair use doctrine through an 'illustrative enumeration + general requirements' approach, offering significant flexibility. Section 107 specifies that in determining whether the use of a work constitutes fair use, the following factors must be considered: (1) the purpose and character of the use; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. These are collectively known as the 'four-factor test' for fair use.

Initially, 'commercial use' was excluded from the scope of fair use in the U.S., as profiting from others' works without compensation was deemed contrary to general principles of fairness. However, subsequent judicial practice saw U.S. courts gradually develop the 'transformative use' rule from the 'purpose and character of the use' factor. As a result, 'commercial use' is no longer the key determinant of fair use; instead, the focus is on the degree of 'transformative' nature in the new work. For example, in the high-profile 'Google Books' case, the U.S. Court of Appeals for the Second Circuit ruled that Google's digitization of books for limited snippet searches did not substitute the market for the original books and met the requirements of transformative use.

The highly flexible 'four-factor test' and 'transformative use' doctrine grant U.S. courts considerable discretion in determining fair use on a case-by-case basis, opening the possibility for justifying the use of copyrighted works in AI model training. For ChatGPT-like products, the use of works during model training exhibits strong transformative purpose—specifically, 'the utilization of the work does not disseminate its original expression to the public.' The new generation of AI-generated content (AIGC) mechanisms essentially 'learn the probabilities of word arrangements in prior works or recreate styles and patterns at the conceptual level.' Thus, AIGC outputs rarely involve the replication of complete works (or even fragments), resulting in lower risks of 'infringing dissemination' and higher degrees of 'transformative use' compared to cases like Google Books.

Currently, U.S. administrative and judicial authorities have not provided a definitive stance on whether the 'four-factor test' applies to copyright use in model training. Notably, however, Israel, which also adopts the 'four-factor analysis' for fair use, has indicated that machine learning may qualify for copyright law exemptions. In January 2023, Israel's Ministry of Justice issued an opinion supporting the use of copyrighted works for machine learning. Section 19 of Israel's Copyright Law, modeled after Section 107 of the U.S. Copyright Act, suggests that the 'four-factor analysis' framework can encompass AI model training. However, the Israeli Ministry clarified that exemptions do not apply to 'training models exclusively on a specific author's works,' as this would create significant market substitution effects. Additionally, the opinion states that exemptions are limited to the training phase and do not extend to content output, where direct infringement may occur.

From the perspective of global copyright legislation practices, granting generative AI development platforms certain liability exemptions during the model training phase through 'limitations and exceptions to rights' is an emerging trend. Currently, China's existing Copyright Law has not effectively addressed copyright utilization issues in the model training stage, requiring consideration of the legitimacy of establishing new copyright liability exemption mechanisms based on the 'three-step test' legislative standard.

Article 24 of China's Copyright Law stipulates specific scenarios for 'fair use' (utilizing works without the copyright owner's permission or payment), which roughly include 'personal use,' 'appropriate citation,' and 'use for learning and research.' The applicability of 'personal use' is strictly limited by purpose, whereas current AIGC models ultimately serve commercial purposes for unspecified entities, making it difficult to align with this provision. 'Appropriate citation' requires the purpose of 'introducing, commenting on, or explaining a work' or 'illustrating a certain issue,' which the commercial applications of AIGC models clearly do not fit. 'Scientific research' limits the use of works to 'classroom teaching or scientific research' and emphasizes 'small-scale reproduction,' a requirement that the current large-scale copying and utilization of works by AIGC models cannot meet.

Although the 2021 revised Copyright Law added 'general requirements' and a 'catch-all clause' to the 'fair use' provisions, the 'catch-all clause' is semi-open—'other circumstances stipulated by laws and administrative regulations'—and cannot be directly applied by courts in judicial practice based on 'general requirements' and specific cases. Therefore, whether AIGC model training can qualify for 'fair use' exemptions remains to be clarified through subsequent revisions to the Copyright Law, the Copyright Implementation Regulations, and related legislation.

Additionally, China's Copyright Law provisions on 'statutory licenses' are relatively scattered, primarily covering four categories: 'journal reprints,' 'performances by artistic groups,' 'sound recording production,' and 'use of published works by radio and television stations to produce programs.' These differ significantly from model training activities and are difficult to apply.

From an institutional objective perspective, copyright law not only protects copyright holders but also serves higher-level public interests such as promoting the sharing of cultural knowledge and advancing content dissemination technologies. Therefore, international agreements like the Berne Convention, TRIPs Agreement, and WIPO Copyright Treaty allow member states to impose limitations and exceptions to copyright, provided they meet the three-step test: (1) the exceptions must be confined to special cases, (2) they must not conflict with the normal exploitation of the work, and (3) they must not unreasonably prejudice the legitimate interests of the copyright holder. The three-step test is also the legislative standard that countries should follow when establishing copyright limitations and exceptions. If AIGC model training is to be included in China's copyright law under the "limitations and exceptions" system, it must comply with this requirement.

In the three-step test, the first step—"confined to special cases"—is a principle-based provision aimed at preventing overly broad limitations that could harm copyright holders. The core criteria for judgment lie in the second step ("not conflict with the normal exploitation of the work") and the third step ("not unreasonably prejudice the legitimate interests of the copyright holder"). On one hand, these two criteria are highly abstract, and there is currently no unified consensus in legislation, judicial practice, or theoretical discussions. On the other hand, these criteria are difficult to completely separate in practice, as actions that affect the normal exploitation of a work often also harm the interests of the copyright holder. The former focuses on "behavioral judgment," while the latter centers on "outcome judgment."

Generally, the criteria of "not conflict with the normal exploitation of the work" and "not unreasonably prejudice the legitimate interests of the copyright holder" can be summarized into three standards. First, whether the specific use falls within the scope of the copyright holder's normal exercise of rights—i.e., whether the copyright holder could have regulated the behavior through normal licensing and derived benefits from it. Second, whether the specific use has a significant substitution effect on the dissemination and market utilization of the work. Third, the balance between the impact of the specific use on the copyright holder's market interests and its contribution to the public interest.

First, does authorizing one's own works for model training constitute a foreseeable normal use scenario for copyright holders? If the answer is affirmative, then exempting unauthorized model training would violate the requirement of 'not conflicting with the normal use of the work.' Although in practice, copyright holders worldwide are already attempting to demand payment from large model platforms—such as Reddit announcing fees for content use by OpenAI and Google—the core 'processing' of works during model training does not clearly fall under the statutory rights of copyright holders. As such, AIGC model training of copyrighted content is not legally recognized as a clear 'normal use scenario.'

Second, does unauthorized model training create a substitution effect on the potential market of the trained works? Generative AI, as the name suggests, is designed for content creation, whether in text-to-text (e.g., ChatGPT) or text-to-image (e.g., Midjourney) domains. However, since generative AI outputs rarely reproduce the entirety or even fragments of the trained works—often producing statistical 'word combinations' or 'single-character references'—it does not significantly substitute the market for the original works. Instead, it intensifies competition in related content markets. An exception arises when training relies solely on a single author's or artist's works, raising concerns about intentional market substitution.

Third, how does unauthorized model training balance market impact and public interest? This is inherently a process of value judgment and interest balancing. The rapid development of AIGC holds immense potential for societal impact, with some likening its significance to the advent of personal computers or even a new 'industrial revolution.' While the actual impact on copyright holders and markets remains to be assessed, overemphasizing paid licensing for training content could severely hinder AI industry progress. For instance, South Korea's 2023 'New Growth 4.0 Plan' explicitly advocates revising copyright laws to permit the use of copyrighted works in data analysis to advance large-scale AI development.

Currently, adopting a 'statutory licensing model' for AIGC model training presents a series of insurmountable challenges in practice. Given the unique nature of AIGC model training behavior, a 'fair use' model with certain restrictions is more appropriate. Under the premise of clarifying its commercial purpose, copyright holders should be granted 'the right to reserve model training in an appropriate manner,' thereby achieving a more logically comprehensive and balanced design of specific rules. To establish a 'fair use' mechanism for copyright in the AIGC era, the following issues need to be considered:

Focus 1: The Scope of Liability Exemption Mechanisms
From a practical standpoint, the purpose of model training for liability exemption should not be limited to non-commercial fields.

Focus 2: The Prerequisites for Liability Exemption Mechanisms
It is essential to clarify that the prerequisite is 'granting copyright holders the right to reserve model training in an appropriate manner.'

Focus 3: The Core Conditions of Liability Exemption Mechanisms
On one hand, the use of works for AIGC model training exemptions should include 'reproduction.' On the other hand, such liability exemptions must be strictly limited to the purpose of model training and should not extend to other dissemination activities protected and regulated by current copyright laws.