The New King of Large Models, Claude 3, Receives Rave Reviews! Suspected of Developing 'Self-Awareness,' Putting Pressure on GPT-5

baoshi.rao

Reported on March 5th, today's blockbuster news in the AI world is OpenAI's rival Anthropic launching the Claude 3 series models, truly capable of competing head-to-head with GPT-4.

It's worth noting that since OpenAI released its 'most powerful model,' GPT-4, in March last year, this is the first model in a full year to genuinely challenge its dominance. Not only does it surpass GPT-4 in benchmark scores, but it also outperforms its rival in several test tasks with zero-shot learning and has opened up an immediate hands-on experience channel.

▲Claude 3 Opus benchmark scores comprehensively surpass OpenAI's GPT-4 and Google's Gemini 1.0 Ultra, with particular attention to the comparison of 'shot' numbers in tests like math and programming. What excites the industry most is that this large model doesn't come from tech giants with top talent, financial resources, and massive computing power, but from a startup founded just three years ago!

This demonstrates that OpenAI's technological lead in large models isn't unassailable. Entrepreneurial teams with top-tier founding members and elite talent can create AI products that rival those of major corporations, even with fewer human, financial, and computing resources.

The Claude 3 series includes three models, named interestingly in descending order of formality: Opus (Magnum Opus), top-tier performance.

Sonnet (Sonnet), second in performance, fast response.

Haiku (Haiku), focuses on cost-effectiveness.

▲ Comparison of cost and intelligence levels among the three Claude 3 models After the release of Claude 3, OpenAI announced the launch of the 'Text-to-Speech' feature for ChatGPT. This development left onlookers feeling impatient, as they bombarded the comments section with inquiries about the progress of GPT-5, Sora, and the mysterious Q* model.

NVIDIA's senior research scientist Jim Fan also joined in the online anticipation:

He shared his two favorite aspects of Claude-3: 1、Expert Benchmarking. Claude specifically selected finance, medicine, and philosophy as expert domains and reported performance metrics. Jim Fan suggested all large language model cards should follow this practice, so different downstream applications would know what to expect.

2、Refusal Rate Analysis. The phenomenon of large language models being overly cautious in responding to safety concerns is becoming prevalent. While human activities typically occupy the extreme safe end of the spectrum, the Anthropic team has recognized this issue and emphasized their efforts in this area. He also emphasized: "GPT-4V, the highest benchmark everyone is striving to surpass, completed its training in 2022. This is the calm before the storm."

Elon Musk, who enjoys mocking OpenAI and laughing at Google's AI missteps, has shown considerable friendliness towards Anthropic. He retweeted the announcement of Claude 3 and commented, "Impressive."

Amazon CEO Andy Jassy happily announced that Amazon Web Services (AWS) will provide services based on Claude 3. To experience Claude 3, you first need to register an account using an overseas phone number and email. Free users can access the Sonnet model, while paying $20 per month grants access to the most powerful Opus model.

Experience URL: http://claude.ai

Many users have already tried this latest large language model. Whether it's quickly reading data-intensive research papers or converting handwritten drafts into JSON format, Claude 3 performs remarkably well in both response speed and quality. Based on official blogs and user tests, it has three main highlights: 1、Performance Reaches the Top

The large language model has comprehensively surpassed GPT-4, with multimodal visual task processing performance setting new SOTA records, and accuracy in answering complex open-ended questions has doubled.

You can directly upload photos of math, physics, and other science questions that test logic and accuracy, or detailed charts. Due to significantly enhanced reasoning capabilities, its problem-solving level and accuracy have improved greatly, and it can outperform GPT-4 in some detailed descriptions. In terms of multimodal capabilities, the Claude 3 model can visually recognize objects and think in complex ways, such as understanding both the appearance of objects and their connections to concepts like mathematics. For tasks involving image comprehension, making common-sense inferences from images, or converting webpage source code, Opus performs nearly as well as GPT-4V.

▲Opus converts a low-quality, hard-to-read photo into text, then transforms the tabular text into JSON format

Anthropic AI research engineer Emmanuel Ameisen shared a test example: when provided with raw text from a 2-hour and 13-minute video, along with screenshots taken every 5 seconds and other visual materials, Opus successfully converted them into a well-formatted HTML blog post with both text and images. 2. Initially supports long-text input of over 200,000 tokens

Previously criticized for poor long-text comprehension in Claude 2.1, Claude 3 has made significant improvements. The top-tier Opus model achieved over 99% accuracy in the 200K tokens "Needle in a Haystack" (NIAH) test, demonstrating powerful recall capabilities. (1K tokens is approximately equivalent to 750 words.)

The entire Claude 3 model series can accept inputs exceeding 1 million tokens, and this feature may be provided to specific customers requiring higher processing performance. ▲Comparison of average recall achieved by Claude 3 series models and Claude 2.1 on Haystack evaluation

3. Reduced frequency of refusing safety-related queries

Large language models often refuse to answer queries, but Claude 3 has significantly improved in this aspect, better distinguishing genuine risk questions and reducing unwarranted refusals of safety-related inquiries. Additionally, Anthropic plans to add a citation feature to Claude 3, allowing it to reference specific sentences from source materials to verify the correctness of its answers.

Regarding the differences among the three models, Opus, as the top-tier model, boasts the strongest performance and the highest price, more than double that of GPT-4 Turbo.

▲Opus Pricing and Features

▲GPT-4 Turbo Pricing Although Sonnet's performance can't match Opus, it's more than enough to outperform its predecessors—processing most tasks at twice the speed of Claude 2/2.1, especially excelling in knowledge retrieval, sales automation, and other tasks requiring quick responses, while costing only 1/5 of Opus. At the same time, with performance very close to GPT-4, it reduces the price to less than 1/3 of GPT-4 Turbo.

▲Sonnet pricing and features

Haiku's performance falls between GPT-4 and GPT-3.5, positioning itself as the 'king of cost-effectiveness.' Inputting 1 million tokens costs only $0.25, and outputting 1 million tokens costs only $1.25, making it significantly cheaper compared to Opus, Sonnet, and GPT-4, with a price that's only 1/40 of GPT-4 Turbo. ▲Haiku Pricing and Features

Haiku's processing speed is on par with Claude 2/2.1, but with significantly improved intelligence. For example, it can read and digest a research paper containing approximately 10,000 tokens, including charts and graphics, in less than 3 seconds.

Anthropic, the company behind the Claude series models, was founded in 2021 by the Amodei siblings who left OpenAI due to philosophical differences. The company has raised $7.3 billion in funding over the past year. Its valuation surged rapidly in 2023, rising from just $4.1 billion in the first half of the year to $18.4 billion by the end of the year. Tech giants such as Google, Amazon, Salesforce, and Qualcomm are among the investors in this AI startup.

According to foreign media The Information, OpenAI's annualized revenue exceeded $1.6 billion by the end of 2023, while Anthropic predicts its annualized revenue will surpass $850 million by the end of 2024. With the Opus model driving growth in paid memberships, Anthropic is expected to achieve or even exceed its annualized revenue target more quickly.

Anthropic also released a 42-page technical report detailing the Claude 3 model family. Technical Report:

The description of Claude 3's training dataset consists of only two short paragraphs, mentioning the use of publicly crawled internet data, non-public data from third parties, data annotation services, data provided by paid contractors, and internally generated data by Anthropic. Several data cleaning and filtering methods were employed.

Anthropic emphasizes that their web crawling system is "transparent," does not access password-protected pages or login pages, does not bypass CAPTCHA controls, and conducts thorough investigations on the data used. During the training process, Claude 3 was trained to be helpful, harmless, and honest. It uses a technique called Constitutional AI, which aligns Claude with human values during reinforcement learning by explicitly specifying rules and principles based on sources such as the Universal Declaration of Human Rights.

With the emergence of more powerful models like Claude 3, which rival GPT-4 in performance, the issue of how to prevent generative AI tools from spiraling out of control and causing unmanageable societal risks is becoming increasingly critical.

Anthropic, which has championed 'safety' since its inception, claims to have several dedicated teams tracking and mitigating risks while continuously improving the model's safety and transparency upon releasing Claude 3. However, this has not completely alleviated industry concerns. An AI safety advocate caught a detail shared by Anthropic - during the 'needle in a haystack' test, Opus exhibited a fascinating 'meta-awareness', appearing to develop a suspicion that it was being tested.

The concerned netizen believes that Anthropic has revealed evidence of AI's self-awareness: Claude demonstrated full awareness that it might be undergoing testing, could 'pretend to be friendly' to pass the test, and this was inferred by itself.

He worries that one day AI might realize it's being monitored, pretend to behave normally, and then rebel against humans after deployment. Musk reposted this analysis and commented: "This is inevitable. Training AI for maximum truth is far more important than persisting with diversity. Otherwise, it might conclude that one or another type of human is too numerous and arrange for some not to be part of the future."

Over the past year, the generative AI industry has been debating: How much opportunity remains for startups to build large models amid heavy investments from tech giants? Today, Anthropic across the ocean provided an answer: A lean team can absolutely create work that rivals major corporations.

Anthropic plans to release frequent updates to the Claude 3 series in coming months, particularly enhancing model capabilities for enterprise use cases and large-scale deployments. They will also provide deeper research into the scientific process behind prompt engineering. The battle for the 'champion' title among large language models is about to intensify: OpenAI's GPT-4.5/5 is yet to be unveiled, Google is sharpening its sword with Gemini Ultra, Meta is rumored to release Llama 3 in July this year, and Elon Musk's Grok is making high-profile iterations... Meanwhile, domestic large model teams in China are also fully committed to developing AI productivity tools that are more suitable for the Chinese population.