Do ChatGPT and Sora Actually Limit Our Imagination of Large Models?

baoshi.rao

A recent report from a U.S. market research agency has gone viral. The report provides a detailed analysis of the hardware resources required for OpenAI to deploy Sora, calculating that at peak times, Sora would need up to 720,000 Nvidia H100 GPUs, costing 156.1 billion RMB.

At the same time, another piece of news is making waves. A Microsoft engineer revealed that 100,000 H100 GPUs were assembled to train GPT-6, but the power grid was overwhelmed as a result.

These stories have left those following large models wondering:

Is it really worth exhausting the Earth's resources just to generate some text and videos?

(Image source: Factorial funds)

In a way, ChatGPT and Sora have limited people's imagination of what large models can achieve— Generating text can 'understand the world by predicting the next token,' while generating videos can become 'an engine for understanding the physical world.' As a result, all resources are poured into generating text and images.

But is this the limit of large models' imagination?

You Won't Believe How Powerful Industry-Specific Models Have Become

Recently, a series of fascinating cases circulating in the industry have significantly surpassed the samples provided by ChatGPT and Sora, revealing even greater possibilities for generative AI.

The image shows an AI generating a medical examination report—yes, it's generating a 'future' medical examination report.

In the health management industry, how to provide earlier risk warnings for people's health conditions is a critical issue. Given the immense capabilities of generative AI, what if we had AI generate future medical examination reports directly?

Indeed, AI can do just that. The future health assessment results it produces demand serious attention.

It's not limited to human medical reports—AI can also generate complex "health check" reports for hydropower units.

We can observe that AI provides specific timestamps down to the minute for operational status, alerting about potential high-temperature failures.

It prompts experienced technicians to conduct inspections and adjust monitoring and operational strategies accordingly.

These cases originate from some industrial applications by the AI company 4Paradigm. These industry-specific large models are built on a platform called Prophet AIOS, which covers the development, management, and application of various AI models. This platform has now evolved to version 5.0. AI Generates Everything, Everything AI is Generation

Observant readers must have noticed that these remarkable cases share a common characteristic:

In essence, they are all about 'Predict the next X'. This X is not just the 'language' processed by large language models like ChatGPT, but rather a broader and richer array of multimodal data across various industries.

To some extent, ChatGPT has demonstrated that pre-training with massive amounts of data and using the 'Predict the next token' approach can indeed produce intelligence. Sora, on the other hand, proves that this 'Predict the next X' method should not be limited to text data represented by tokens.

The emergence of both ChatGPT and Sora has validated the correctness of the 'Predict the next X' approach. Therefore, the direction to further expand the imagination space and unleash the value of large models is to continuously extend the representation of the unknown variable X in 'Predict the next X'.

This X could be medical examination reports, hydrological data, monitoring values, or emergency plans. These industry-specific large models require a vast amount of industry-specific data and deep domain knowledge to ultimately generate the X for a particular industry.

For example, consider this sound effect large model developed by professionals in a vertical industry.

When designing the optimal sound experience for a concert hall, you simply need this industry-specific large model to generate sound solutions under different scenarios, provide specific data, and present it with intuitive visualizations.

This kind of sound experience generation cannot be achieved by predicting the next word. However, after training an industry-specific large model with a vast amount of proprietary formats and special data from the sound industry, it can be generated in this way. To develop such models, a crucial prerequisite is clearly to place the initiative in the hands of professionals across various industries, allowing specialized knowledge and data to play their roles.

What they may need is not a traditional large language model, nor one fine-tuned with industry data, but rather a foundational large model genuinely trained on diverse forms of data from their own industry.

Fourth Paradigm's AIOS 5.0 can accept various types of "X" and then build corresponding vertical industry-specific large models based on these X. As they put it—"You reap what you sow." A language model can't solve math problems.

In fact, this approach is increasingly being adopted by major companies. Even OpenAI doesn't believe there will ultimately be a one-size-fits-all large model to solve all problems. OpenAI's COO recently stated at a forum, "You certainly don't need a single model to solve everything. People should dynamically call different models based on specific use cases to better allocate intelligent resources."

So, don't be limited by ChatGPT and Sora. The "X" in "Predict the next X" should have far more possibilities. And these possibilities will only sprout and grow from within various industries. When they connect, AGI may arrive even sooner.