AI Large Models Enter a New Phase of Attack and Defense

baoshi.rao

The essence of AI large models is training with massive amounts of data to understand and master various data types. The text responses, images, videos, and music you see are essentially constructed results based on model algorithms from subsets of data.

In reality, our brains operate similarly when answering questions, but we possess more powerful fuzzy computing capabilities. We can even break free from existing knowledge constraints and analyze problems from new perspectives—in other words, creating something out of nothing.

However, we often arrive at completely wrong or contradictory answers due to misconceptions, distorted memories, and other factors. What about AI? They are no different. When their databases become contaminated, they may confidently provide entirely incorrect answers, believing them to be true.

As data copyright issues surrounding AI large models intensify, many traps targeting these models are emerging online. These traps involve inserting special data to corrupt AI databases, forcing them to produce completely erroneous responses. The ultimate goal is to compel developers to roll back relevant data versions and avoid websites generating faulty data, thereby protecting their own data copyrights.

There’s a fitting term for this behavior—"poison pill."

Those following the AI large model field may recall a recent incident where a domestic tech company lost tens of billions in market value in a single day. The cause was contamination of its AI large model, which led to the generation of an article contradicting mainstream values. After a parent discovered and shared it online, the incident garnered widespread attention.

It's worth noting that some argue the article wasn't generated by AI but was inadvertently included in the AI's database during web scraping and later surfaced in applications. Regardless of the reason, one obvious fact remains: AI still has significant deficiencies in its ability to discern good from bad.

Early in the rise of large AI models, questions were raised: "If we feed AI harmful data, can we turn it into a bad actor?" The answer is undoubtedly yes. One experiment deployed an AI on the anonymous forum 4chan to learn from user interactions. After training, the developers ended up with an AI that supported Nazism, racism, ethnic cleansing, and excelled at hurling vicious insults.

This outcome even shocked the developers, demonstrating how unchecked training data can lead to severe errors in AI cognition and responses. Consequently, mainstream AI models now incorporate multiple correction and filtering mechanisms to prevent database contamination by harmful information.

However, compared to relatively easier-to-identify text data, "poison pills" in visual data like paintings are more covert and effective. A hacker team developed specialized "poisoning" tools that embed special feature codes into seemingly normal artworks, tricking AI into misclassifying them. By repeatedly polluting the data pool, they can completely disrupt AI's recognition capabilities.

A contaminated AI will give completely wrong answers to drawing requests. For example, if you ask it to draw a dog, after a brief wait, you might get a cat, a cow, or something entirely different - but definitely not a dog.

Image source: technologyreview

As pollution data increases, AI-generated images become increasingly abstract. When they eventually degrade into meaningless lines, the AI's database is essentially rendered useless. To restore normal functionality, the only option is version rollback, reverting to a state before the issues arose.

However, identifying the exact point of data pollution is a time-consuming and labor-intensive task. This not only wastes the training data from that period but also increases training costs and reduces efficiency. Artists are using this method to protect their copyrights and pressure AI companies to avoid works marked with 'do not scrape' labels.

If poison pills were only used in works explicitly marked as 'do not scrape,' this would merely be a copyright dispute, and most netizens might side with the artists. However, developers soon discovered that many works without such labels also contained poison pills, continuously polluting AI databases. Identifying these poison pills within vast training datasets is extremely difficult, directly affecting the training speed of AI art models.

How to prevent poison pill pollution has become a critical issue that all major AI models must address carefully.

How to prevent AI from being contaminated? Developers have devised many solutions, such as implementing stricter data review systems, even at the cost of reduced training efficiency, to filter out potentially problematic data. However, this approach is not entirely effective, as the concealment of 'poison pills' improves alongside increased scrutiny.

Through specialized algorithms, hacker teams continuously iterate and update poisoning tools, disguising these 'poison pills' as normal data to bypass AI security mechanisms and infiltrate core datasets. Perhaps only 1 out of 10 poison pills successfully evades detection, but their generation speed is extremely fast. It only takes a few dozen poison pills to corrupt a database, and once the number reaches hundreds, the AI's understanding of certain concepts can be completely distorted.

Additionally, AI's learning capability can be leveraged as a countermeasure. By labeling disguised poison pills and repeatedly feeding them to the AI, it can learn to recognize data with such characteristics as 'toxic,' enabling it to generalize and identify harmful data from vast datasets.

Of course, some covert, non-public poisoning tools cannot be countered this way. In such cases, developers must conduct regular security audits to verify and remove malicious data while enhancing the model's ability to respond to such threats based on the characteristics of the malicious data.

However, these methods are not highly efficient, requiring developers to constantly monitor and update models. Is there a better solution? Yes, but it demands more effort and cost—such as AI fusion models.

Simply put, this involves merging multiple AI models into a matrix. Before outputting data, the models exchange information to review the content, ensuring its validity through cross-verification. Given the low probability of multiple AI models being contaminated simultaneously, this method offers the highest effectiveness and efficiency.

However, integrating multiple AI models heavily depends on developers' technical expertise and significantly increases system complexity and computational costs, making it difficult for many unprofitable AI teams or small-to-medium-sized development teams to bear. Therefore, this method is mostly used in the AI model matrices of large enterprises, where the cost is negligible to ensure the correctness of output data (at least to avoid obvious errors).

It can be said that AI model training today is no longer just about competing in data scale and algorithm architecture—error correction and anti-interference capabilities have also become crucial metrics. As large AI models are increasingly applied and their user base grows, ensuring that AI does not make mistakes when answering questions has become critical. Given the highly sensitive and cautious investment market climate today, a small mistake could easily lead to losses worth billions, which is no joke.