The new AI tools are indeed capable of perfectly paraphrasing entire texts. And they do it so well that it is very complicated to understand whether a piece of content is an AI plagiarism or a human creation. This nefarious practice is tainted by copyright infringement and has already recorded several cases in the journalistic field to date. Not citing sources means plagiarizing the hard work of publishers and journalists.
The main problem is that it is increasingly difficult to distinguish between content produced by humans and that generated by artificial intelligence. Identifying potential plagiarism is therefore complex. At a legal level, it is also not yet clear whether articles rewritten through AI can constitute original content. And therefore whether it falls under the positive connotation of efficient aggregation or the negative one of hyper-efficient plagiarism. OpenAI, along with BingChat, is accused by the New York Times of having used proprietary content to train its machine learning algorithms. While awaiting the Judge's verdict, it declares that it prohibits plagiarism in its ChatGPT models. Without, however, providing any further detailed description of how it intends to do so. In 2018, the Italian Supreme Court had already ruled that plagiarism of another's work arises in cases of counterfeiting of the protected work and evolutionary plagiarism. AI reworkings could fall under this second case. As a result, the holder of the original work could demand payment for reproduction rights and compensation for damages. Some argue that AI today is limited to aggregating existing content, but in the future, it will be capable of self-generating original content, albeit based on a knowledge base "taken without permission."
As journalist Winston Cho points out in his investigation for Hollywood Reporter, however, for the aggrieved party, the path is uphill today. Proving that one has been plagiarized by an AI, even in a predictive sense thinking about the future, is not simple. Both because it is a new matter for lawmakers, and because "it is necessary to show an example of output substantially similar to one's own work to have a case that can survive dismissal," as stated by Jason Bloom, president of Haynes Boone, the intellectual property firm consulted by the NYT. This is because the training datasets of LLM models are largely black boxes; in most lawsuits filed, the plaintiffs have not been able to assert with certainty that their works were included in the AI models. The authors, besides the NYT, who have sued OpenAI, for example, can only point out that ChatGPT has generated summaries and in-depth analyses of the themes of their novels as evidence that the company has used their books. The problem, therefore, is not so much the imitation of style as the prior scraping. The search for data without the authors' authorization means improper and unauthorized use.
What, then, is the solution? The most important companies in the sector are working to develop systems to:
- create an automatically non-removable watermark embedded in newly generated content;
- use other tools to clarify which content is produced by artificial intelligence and which is not;
- find new ways to preserve the integrity of information and respect the real authors of content.
Maenoox represents a valuable aid in this case, as by tracking the content in an NFT with the guarantee of immutability on the blockchain, the content will be easily recoverable from a forensic perspective, demonstrating with full certainty the attribution of ownership and any unauthorized use.