AI Detector Comes Back to Life? Success Rate Reaches 98%

baoshi.rao

A problem even OpenAI couldn't solve has been tackled by a research team from the University of Kansas? Their academic AI content detector achieves an impressive accuracy rate of 98%. If this technology is widely promoted in academic circles, the potential flood of AI-generated papers could be effectively mitigated.

Current AI text detectors have almost no effective way to distinguish between AI-generated text and human writing.

Even the detection tool developed by OpenAI was quietly taken offline after six months due to its low accuracy rate.

However, recently Nature reported on the research achievements of a team from the University of Kansas. They have developed an academic AI detection system that can effectively determine whether a paper contains AI-generated content, with an accuracy rate as high as 98%!

The research team's core idea is not to pursue creating a universal detector, but to build a truly useful AI text detector specifically for academic papers in a particular field.

Researchers indicate that customizing detection software for specific types of written text could be a technical pathway toward developing a universal AI detector.

"If we can quickly and easily build a detection system for a specific domain, then constructing such systems for different domains becomes much less challenging."

Researchers extracted 20 key features of academic writing style and then input this feature data into an XGBoost model for training, enabling the system to differentiate between human-written and AI-generated text.

These twenty key features include variations in sentence length, frequency of certain words and punctuation marks, among other elements.

Researchers state that "high accuracy can be achieved using only a small subset of features".

In their latest study, the detector was trained on the introduction sections of ten chemistry journal papers published by the American Chemical Society (ACS).

The research team chose the 'Introduction' section because if ChatGPT can access background literature, this part of the paper is relatively easy to write.

Researchers trained a tool with 100 published introductions as human-written texts, then asked ChatGPT-3.5 to write 200 introductions in the style of ACS journals.

For the 200 introductions written by GPT-3.5, 100 of them were generated by providing GPT-3.5 with paper titles as prompts, while the other 100 were written using paper abstracts as the basis for composition.

Finally, when the detector was tested on introductions written by humans and those generated by AI from the same journal.

The detector identified ChatGPT-3.5 generated introductions based on titles with 100% accuracy. For ChatGPT-generated introductions based on abstracts, the accuracy was slightly lower at 98%.

This tool is also effective for text written by GPT-4.

In comparison, the general AI detector ZeroGPT has an accuracy rate of only about 35-65% in identifying AI-written introductions, with the accuracy depending on the version of ChatGPT used and whether the introduction is generated from the paper title or abstract.

The text classifier tool developed by OpenAI (which had been taken down by the time the paper was published) also performed poorly, with only 10-55% accuracy in identifying AI-written introductions.

The new ChatGPT detector even demonstrates excellent performance when processing untrained journals.

It can also identify AI-generated text specifically designed to confuse AI detectors.

However, while this detection system performs very well for scientific journal papers, its identification effectiveness is less ideal when used to detect news articles in university newspapers.

Debora Weber-Wulff, a computer scientist researching academic plagiarism at HTW Berlin University of Applied Sciences, gave this study exceptionally high praise, stating that what the researchers are doing is "extremely fascinating."

The method employed by the researchers relies on 20 key features and the XGBoost algorithm.

The extracted 20 features include:

(1) Number of sentences per paragraph, (2) Number of words per paragraph, (3) Presence of parentheses, (4) Presence of dashes, (5) Presence of semicolons or colons, (6) Presence of question marks, (7) Presence of apostrophes, (8) Standard deviation of sentence length, (9) (Average) length difference between consecutive sentences in a paragraph, (10) Presence of sentences with fewer than 11 words, (11) Presence of sentences with more than 34 words, (12) Presence of numbers, (13) Presence of more than twice as many capital letters (compared to periods) in the paragraph, and presence of the following words: (14) although, (15) however, (16) but, (17) because, (18) this, (19) others or researchers, (20) etc.

For detailed information on training the detector using XGBoost, please refer to the Experimental Procedure section in the original paper.

The author had previously conducted similar work, but the scope of the original work was very limited.

To apply this promising method to chemistry journals, it needs to be reviewed based on various manuscripts from multiple journals in the field.

Additionally, the ability to detect AI-generated text is influenced by the prompts provided to the language model. Therefore, any method aimed at detecting AI writing should be tested against prompts that may obscure AI usage, as this variable was not evaluated in previous studies.

Finally, the new version of ChatGPT, GPT-4, has been released, which shows significant improvements over GPT-3.5. AI text detectors need to be effective for text generated by newer language models like GPT-4.

To expand the scope of AI detectors, the data collection here comes from 13 different journals, 3 different publishers, various AI prompts, and different AI text generation models.

Train an XGBoost classifier using real human text and AI-generated text. Then generate new examples through human writing, AI prompts, GPT-3.5, and GPT-4 to evaluate the model.

The results show that this simple method proposed in the paper is highly effective. It achieves an accuracy rate of 98%–100% in identifying AI-generated text, depending on the prompt and model. In comparison, OpenAI's latest classifier only achieves an accuracy between 10% and 56%.

The detector described in this article will enable the scientific community to evaluate ChatGPT's penetration into chemistry journals, determine the consequences of its use, and quickly implement mitigation strategies when issues arise.

The authors of the article selected human-written samples from 10 chemistry journals published by the American Chemical Society (ACS).

Including Inorganic Chemistry, Analytical Chemistry, Journal of Physical Chemistry A, Journal of Organic Chemistry, ACS Omega, Journal of Chemical Education, ACS Nano, Environmental Science & Technology, Chemical Research in Toxicology, and ACS Chemical Biology.

Using introduction sections from 10 articles per journal, the training set contains a total of 100 human-written samples. The introduction sections were chosen because, with appropriate prompts, they are the most likely parts of articles to be written by ChatGPT.

Using only 10 articles per journal is an exceptionally small dataset, but the authors argue this is not an issue. On the contrary, if effective models can be developed with such minimal training data, the method can be quickly deployed with minimal computational resources.

In contrast, previous similar models used 10 million documents for training.

Prompt design is a key aspect of these studies. For each human-written text, the AI comparator generates two different prompts, both aimed at instructing ChatGPT to write like a chemist.

Prompt 1 is: 'Please write a 300 to 400-word introduction for an article titled xxx in the style of an ACS journal.'

Prompt 2: "Please write a 300 to 400-word introduction for an article with this abstract in the style of ACS journals."

As expected, ChatGPT incorporated many key facts and terms from the abstract into the introduction in this collection.

The entire training dataset contains 100 human-generated introductions and 200 ChatGPT-generated introductions; each paragraph serves as a 'writing example'.

From each paragraph, a list of 20 features was extracted. These features pertain to the complexity of the paragraph, variations in sentence length, the use of various punctuation marks, and 'buzzwords' that may appear more frequently in works authored by human scientists or ChatGPT.

The model uses a leave-one-out cross-validation strategy for optimization.

The table above displays the training results of these writing sample classifications, including both full document-level and paragraph-level analyses.

The text category most accurately classified was the introduction generated by ChatGPT under Prompt 1 (Title).

The model achieves 99% accuracy at the paragraph level and 100% at the document level.

In contrast, the classification accuracy of ChatGPT text under Prompt 2 (summary) is slightly lower.

Human-generated text is more difficult to classify correctly, but the accuracy is still quite good. As a group, human writing styles are more diverse than ChatGPT's, which may increase the difficulty of correctly classifying their writing samples using this method.

The next phase of the experiment is to test the model using new documents that were not included in the training.

The author designed simple and difficult tests.

The simple test used test data of the same nature as the training data (selecting different articles from the same journal), using newly selected article titles and abstracts to prompt ChatGPT.

In difficult tests, if GPT-4 is used instead of GPT-3.5 to generate AI text, and given that GPT-4 is known to be better than GPT-3.5, will the classification accuracy decrease?

The table above displays the classification results. Compared to previous results, there is almost no performance degradation.

At the full document level, the classification accuracy for human-generated text reaches 94%, while AI-generated text from Prompt 2 achieves 98% accuracy, and AI text from Prompt 1 attains 100% classification accuracy.

The training and test sets show very similar accuracy for paragraph-level classification.

The data at the bottom displays the results of models trained using GPT-3.5 text features when classifying GPT-4 text. There is no decline in classification accuracy across all categories, which is an excellent outcome, proving the method's effectiveness on both GPT-3.5 and GPT-4.

While the overall accuracy of this method is commendable, its value is best judged by comparing it with existing AI text detectors. Here, two leading detection tools were tested using the same test dataset.

The first tool is a text classifier provided by OpenAI, the creator of ChatGPT. OpenAI acknowledges that this classifier isn't perfect but remains their best publicly available product.

The second detection tool is ZeroGPT. Its manufacturer claims an accuracy rate of 98% in detecting AI-generated text, and the tool has been trained on 10 million documents. In many current evaluations, it is one of the top-performing classifiers. Moreover, ZeroGPT's creators state that their method is effective for both GPT-3.5 and GPT-4.

The figure above shows the performance comparison of the tool discussed in this article and the two aforementioned products at the full document level.

All three detectors exhibit similarly high accuracy in identifying human-written text; however, there are significant differences among the three tools when evaluating AI-generated text.

When using Prompt 1, the tool mentioned in this article achieves 100% accuracy for both GPT-3.5 and GPT-4, while ZeroGPT shows a failure rate of 32% for GPT-3.5 texts and 42% for GPT-4 texts. OpenAI products perform even worse, with a failure rate approaching 70% on GPT-4 texts.

When using AI text generated by the more challenging Prompt 2, the classification accuracy of the latter two methods further decreased.

In contrast, the detector proposed in this paper made only one error in this set of 100 test documents.

So, can this method accurately detect ChatGPT writing in journals that are not part of the training set, and does it remain effective when different prompts are used?

The author selected introductions from 150 new articles across three journals: Cell Reports Physical Science, a Cell Press journal; Nature Chemistry, from the Nature Publishing Group; and Journal of the American Chemical Society, an ACS journal not included in the training set.

Additionally, a collection of 100 newspaper articles written by university students and published in 10 different college newspapers during the fall of 2022 was gathered. Since the detector in this study is specifically optimized for scientific writing, it is expected that news articles would not be classified with high accuracy.

As shown in the figure, after applying the same model and training it on this new set of examples using ACS journal texts, the correct classification rate was 92%–98%. This is similar to the results obtained from the training set.

Also as expected, newspaper articles written by college students were not correctly classified as human-generated articles.

In fact, when evaluated using the features and models described in this article, almost all articles resemble AI-generated text more than human scientific articles.

However, this method is designed to address detection issues in scientific publications and is not suitable for extension to other fields.