Backed by Sequoia with 2 Billion Investment, Tsinghua-affiliated AI Model Star Makes Debut

baoshi.rao

What does it feel like to stand at ground zero of a nuclear explosion? Amid the AI explosion triggered by ChatGPT, Yang Zhilin, who has spent a decade in the relatively niche field of NLP (Natural Language Processing), finds himself in such a position. This "prodigy" who was admitted to Tsinghua University with perfect scores in programming courses had already published two seminal papers as the first author during his Ph.D. at Carnegie Mellon University—on Transformer-XL and XLNet—which became crucial breakthroughs in today's large AI model technology.

"At first, I was incredibly excited, like being hit by an apple," Yang told 36Kr, before quickly sinking into frustration. Then, realizing how much more there was to do, he became "excited again."

Yang Zhilin

This is also the origin of his second AI startup, "Moonshot." The name comes from the iconic Pink Floyd album The Dark Side of the Moon. Yang believes that developing large AI models is akin to a moonshot—"the dark side of the moon" symbolizes mystery, curiosity, and immense challenge.

In fact, Moonshot's core team has participated in the development of several major AI models including Google Gemini, Google Bard, PanGu NLP, and WuDao - this is a team that has been exploring the "moonshot" path for years. Currently, the success of large AI models still largely depends on technical capabilities.

Despite maintaining a low profile in China's competitive large model market over the past six months, Moonshot has attracted significant investor interest. According to latest information obtained by 36Kr, Moonshot has completed a funding round exceeding $200 million, placing it among the top-funded large model startups in China.

After operating for over half a year, Moonshot officially launched its first large model product on October 9th - the intelligent assistant Kimi Chat. This marks Moonshot's first attempt at creating a To C super application in the large model field.

Kimi Chat interface
Image source: Moonshot

Kimi Chat supports input of up to 200,000 Chinese characters, currently the longest context window supported by any global large model product.

This achievement represents a new height in Moonshot's exploration of long-text technology. Compared to current mainstream models, Kimi Chat's context length is 2.5 times that of Claude 100k (which supports about 80,000 characters in practice) and 8 times that of GPT-4-32k (about 25,000 characters in practice).

Nowadays, there are numerous large model products on the market. How does Kimi Chat, with its extended context length, differ in usage?

The most obvious difference is that you can input a large amount of information to the model at once, allowing it to understand and process the information for Q&A, effectively reducing hallucination issues.

For example, you can hand over long articles from public accounts to Kimi Chat and let it summarize and analyze them for you:

Example of Kimi Chat summarizing an article

When you discover a new algorithm paper, Kimi can directly help you reproduce the code based on the paper:

Example of Kimi Chat reproducing code from a paper

When exams are approaching, you can simply hand over an entire textbook to Kimi, and it will help you prepare:

Kimi assisting with exam preparation

Moreover, with just a link, it can role-play your favorite game character and converse with you:

Kimi role-playing a game character

Currently, Moonshot AI's intelligent assistant product Kimi Chat has opened its beta testing. By visiting Moonshot.cn (or scanning the QR code at the end of the article), users can join the beta program.

Notably, unlike other large model companies that focus on parameter counts and showcasing various industry cases, 'long-context' became the absolute highlight at Moonshot's product launch.

'Whether it's text, voice, or video, lossless compression of massive data can achieve high-level intelligence. To effectively improve large model performance, we need to not only expand model parameters but also increase context length - both are equally important,' said Yang Zhilin.

The reason large models can achieve qualitative leaps in intelligence is that by scaling parameters to hundreds of billions, they enable 'emergence' (where models autonomously produce complex behaviors or characteristics).

However, the current major bottleneck in large model implementation isn't model size, but insufficient context length. Limited text length severely constrains model capabilities.

A typical problem is that in multi-turn conversations or scenarios requiring complex steps, models often fail to remember - they forget settings immediately after being mentioned. For example, Character AI users frequently complain that models can't retain key information:

This is similar to how computers operate: CPUs handle computations while memory stores temporary data that determines processing speed. "If we say parameter count determines how complex the 'computations' a large model can support, then the amount of text input it can receive (i.e., long-context technology) determines the model's 'memory capacity' - together they define the model's application effectiveness," he explained.

This is why Moonshot prioritized maximizing context length while maintaining models with hundreds of billions of parameters.

Expanding context length presents dual challenges in both model training and inference regarding computing power and memory requirements. For instance, computational load increases quadratically with context length - a 32x increase in context leads to 1000x more computation. Even with top-tier hardware configurations (like 8 GPUs with 80GB memory each), current systems can only process about 50,000 Chinese characters in hundred-billion-parameter models.

However, with Kimi Chat, Moonshot's team implemented hundreds of optimizations across model training through innovative network architectures and improved algorithms, enabling comprehensive understanding of ultra-long texts in hundred-billion-parameter models.

Simply put, Moonshot AI doesn't rely on common but performance-limiting techniques like sliding windows, downsampling, or smaller models for long-text processing. Instead, they developed long-range attention mechanisms based on large models to create truly practical ultra-long text technology.

Enhancing the memory of models can significantly broaden the application scenarios of large models in the future. For instance, professions like lawyers and analysts could leverage large models to analyze lengthy reports, while games requiring extensive information-based reasoning, such as Werewolf, could also be handled adeptly by these models.

Prior to this product launch, 36Kr conducted an in-depth interview with Yang Zhilin. As someone at the epicenter of this technological revolution, Yang speaks about AI large models with a sense of conviction. He occasionally makes startling assertions in a casual tone.

For example, "Next token prediction is the only problem." "As long as we stay the course, we can achieve general artificial intelligence (AGI)."

Another example: "Within five years, large models will maintain strong technological barriers and will not commoditize."

From LLM (Large Language Model) to LLLM (Long-Context Large Language Model), Kimi Chat is just Moonshot's first step. However, Moonshot now embodies some of Yang Zhilin's more futuristic, almost "Black Mirror"-like visions: in the future, if machines can capture a person's lifetime of information, individuals could have AI avatars that share all their memories, effectively becoming another version of themselves.

The following is an edited transcript of 36Kr's conversation with Yang Zhilin:

36Kr: Let's start by talking about this product launch. Many large companies and startups choose to release a specific large model first, whether open-source or closed-source. After the hype around large models for half a year, Moonshot has now opted to launch a To C intelligent assistant product first. Why?

Yang Zhilin: Because I firmly believe in starting with the end goal in mind—only when large models are used by the majority can the most intelligence emerge. Moonshot adheres to application-oriented model development. We don’t just want to release a model to quickly gain short-term technical attention in the tech community.

For example, the value of 'long-context' technology might not be immediately apparent to users. But through the Kimi intelligent assistant, we can directly reach users. We want technology to become an indispensable assistant in users' daily lives, iterating the model based on real feedback to create practical value as early as possible.

36Kr: How have you felt in the six months since ChatGPT was released?

Yang Zhilin: This past year has been a rollercoaster of emotions. If it were a breakthrough in controlled nuclear fusion, it wouldn’t really concern me. But this (large language models) is something I’ve worked on for ten years—it feels like being hit by an apple (a reference to Newton's apple).

When ChatGPT was first released, I was incredibly excited. I was curious about what kind of AI the world could create and how much I could replicate or even surpass human capabilities.

At the same time, I also felt very frustrated—because this wasn’t something I had achieved, right? I started thinking about what I could contribute in this wave, and then I got excited again: now is the perfect timing, no matter what happens, we must act.

36Kr: Did ChatGPT directly prompt you to establish your new company 'Dark Side of the Moon'?

Yang Zhilin: Yes. From initial excitement to frustration, and then deciding to start a business, I gradually regained rational thinking—considering what kind of team I wanted, what stage we're at in the technological evolution, and what we should do.

Then came the anxiety—everyone was talking about building large models. Can large models really be built? Or is it impossible?

Eventually, I returned to rationality. I began to look at these matters from a long-term perspective. Short-term progress in large models, with releases from all directions, is essentially just noise.

GPT-4's level is here (significantly higher), while other models are below. Right now, debates about 'who's better' are meaningless. Over the past six months, I've been thinking about the underlying logic and realized this is something we're well-suited to tackle.

36Kr: In what way are you suited?

Yang Zhilin: Every technological breakthrough presents three layers of opportunity.

The first layer is seized by those who discover the first principles—like OpenAI. This requires strong vision and foresight, backed by experience.

The second layer emerges during the technological innovation phase, solving directional challenges—like how to handle long context. Teams that can master the technology can capture this.

The third layer is purely about application—when the technology is fully understood, and only implementation matters.

We can seize the second-tier opportunities where we have strong accumulation and advantages.

36Kr: What kind of large model does Moon's Dark Side aim to create?

Yang Zhilin: We hope to first achieve world-leading model capabilities while also focusing on consumer-facing super applications. By connecting technology with users through products, we aim to jointly create general intelligence. Kimi Chat is just our first product attempt.
Our current model has reached the scale of hundreds of billions of parameters, and in the future, it will evolve into a multimodal large model. For now, we are focusing on perfecting the language model.

36Kr: What is your general direction in developing applications?

Yang Zhilin: We are still in the stage of technological innovation, so we will continue to pursue world-class breakthroughs, such as in long-context understanding and multimodal capabilities.
On the product side, we are firmly committed to the consumer market, aiming to create a leading Super App. Taking ChatGPT and Character.ai as examples, these products have already accumulated vast amounts of data and user feedback, with clear signs that they have created new entry points. The new generation of AI holds tremendous potential in both "useful" and "fun" directions.
I believe that whether it's intelligent assistants or emotional companionship, we can use technology to solve real-world problems in people's work and daily lives.

36Kr: What constitutes a genuine demand?

Yang Zhilin: For example, Character.AI offers more diverse emotional interactions, which fundamentally satisfy people's desire for conquest. I believe conquest is a real, essential need.

AI will not ultimately become a completely homogeneous product. It's not like electricity, where charging in Singapore is the same as charging in China. Therefore, the intelligence achieved by Character.AI may surpass that of other companies because they can continuously accumulate data and later specialize in certain areas. This will also lead to higher gross margins for AI compared to traditional cloud computing.

36Kr: Many large model companies are busy poaching talent in Silicon Valley, such as from OpenAI, Google, and Microsoft. How did you assemble the team at Dark Side of the Moon?

Yang Zhilin: Many of our team members were newly recruited. We primarily look for people around 30 years old with extensive hands-on experience. Starting last December, I went overseas to begin preparing for recruitment.

36Kr: Are overseas AI talents willing to return?

Yang Zhilin: We have overseas offices, and the two sides can actually work in combination.

36Kr: How many people are currently on the MoonDark team? What does your ideal team look like?

Yang Zhilin: Our team has about 60 members, including many technical experts. Every month, individuals with significant influence in specific global fields join us. We are striving to build the team with the highest density of product talent among large model companies.

In the internet era, technology and product roles were clearly separated, but we want our product team to be more directly involved in model optimization, significantly shortening the innovation cycle. In the AI era, there are opportunities for innovation in technology, product development, growth, and commercialization. Our vision is to create a new type of organization that empathizes with users, defines beauty and intelligence standards through objective data, and integrates technology with humanity.

36Kr: Is OpenAI the ideal model for such an organization?

Yang Zhilin: I think they provide many good practices. For example, they don’t engage in internal competition ("horse racing"), which is a very important example.

This isn’t because they lack resources or people—they have plenty. Instead, they focus resources under a unified scope. For instance, if they want to spend 10% of their effort exploring new ideas, there’s a dedicated team for that, with only one main focus—this is crucial. Moreover, they encourage grassroots innovation, where everyone contributes ideas.

36Kr: Many people are currently concerned about cost issues, which directly relate to engineering costs and subsequent commercialization progress. At this stage, what factors are you most focused on?

Yang Zhilin: The key is whether we can find PMF (Product-Market Fit) as soon as possible—that's the top priority.

36Kr: Currently, many major companies and startups are releasing open-source models. Does Moonshot have any open-source plans? How do you think about this issue?

Yang Zhilin: We currently have no plans for open source. I believe open source and closed source will play different roles in the ecosystem. A major function of open source is customer acquisition in the B2B sector. If you want to build a leading Super App, everyone will definitely use closed-source models—it's very difficult to create differentiation with C-end applications on open-source models.

36Kr: You started your entrepreneurial journey during your Ph.D. What insights did your experience founding your first AI company, 'Recurrent AI,' bring you?

Yang Zhilin: Currently, Moon Dark Side is still in its first phase, with the more important task being technical work like reducing unpredictability, which isn’t heavily influenced by external factors.

However, from a broader perspective, unpredictability has certainly increased compared to before. A few years ago, the market was more favorable, allowing for expansion and revenue growth. But in a weaker market, the focus shifts to cost control and reducing burn rates. This is what I learned most from my previous entrepreneurial experience.

Large models are expensive, so managing the pace of investment while ensuring tangible outputs and product data is crucial.

36Kr: The AI field has several major directions: computer vision (CV), natural language processing (NLP), and machine learning (ML). CV was more prominent in recent years, with the 'AI Four Dragons' (SenseTime, Megvii, CloudWalk, and Yitu) all focusing on it. You’ve always worked on NLP—why?

Yang Zhilin: Beyond coincidental factors, there are inevitable reasons. I believe the vision (CV) direction saw industrial results earlier, but NLP can address more cognitive problems, enabling AI to deliver real value.

36Kr: How does NLP help AI realize its value?

Yang Zhilin: NLP represents an evolution from visual perception to a more cognitive level.

Take AI painting products like Midjourney—they can generate stunning images, but they’re essentially painters without a brain. They don’t understand U.S.-China relations or the history of Indigenous enslavement. To become a top-tier painter, you need to know such history. Eventually, it’s not just about painting; you must do many things beyond it.

From this perspective, NLP tackles harder, more challenging problems, like reasoning, making AI’s capabilities more comprehensive.

36Kr: Transformer is your main research focus, and it's also the foundation of ChatGPT. What is the revolutionary significance of Transformer?

Yang Zhilin: I was fortunate that half of my Ph.D. was after 2017 because that's when Transformer emerged—a monumental turning point. The advent of the Transformer architecture brought a massive shift in the cognitive framework of the entire NLP field. Once it appeared, we realized the vast potential it unlocked, suddenly providing clear direction. Many things that were previously impossible became achievable.

36Kr: How do you interpret this "cognitive shift"?

Yang Zhilin: The understanding of language models in AI has evolved through three stages:

Before 2017, language models were seen as having limited utility—useful for small tasks like speech recognition, sorting, grammar, and spell-checking, but with very narrow use cases.

The second stage came with Transformer and Bard, where language models could handle most tasks, though still in a supporting role—AI engineers could fine-tune them for specific tasks.

In the third stage, the AI field has progressed to the point where the prevailing view is that everything is essentially a language model. Language models are the core problem, or rather, next-token prediction is the fundamental challenge.

The world is like a hard drive model. Once human civilization is digitized, the sum of all human knowledge is the total capacity of the hard drive. Input tokens can be language or anything else—as long as you can predict the next token, you achieve intelligence.

From conceptual to systemic levels, technology has undergone significant changes, with many variables at play. This opens up a space to explore how to further refine these technologies.

36Kr: From the emergence of Transformer in 2017 to the explosion of ChatGPT this year, there were five years in between. During these five years, your important work—the paper on Transformer-XLNet—was actually rejected at one point. Did you ever doubt your research direction during that time?

Yang Zhilin: This is quite interesting. When the industry undergoes a shift in understanding, and the adjustment hasn't caught up yet, there exists non-consensus. Some people think non-consensus is wrong, but in reality, it might be correct. OpenAI was undoubtedly a pioneer in this regard because they were the first to have this correct non-consensus, the first to realize that 'language modeling is the only problem that matters.'

Our research at the time achieved outstanding results, among the best in the world. But the reviewers asked us one question: What's the use of language modeling? You haven't proven its utility. However, at that point, the key wasn't to seek validation but to actually make it work.

36Kr: You said, 'The only important problem is predicting the next token.' If this was non-consensus at the time, how did you realize and firmly believe in it?

Yang Zhilin: To be honest, I wasn't entirely convinced back then. Even now, I don’t think it’s necessarily a consensus—it’s still in the process of becoming one.

36Kr: What does 'predicting the next token' mean, and how should we understand it?

Yang Zhilin: Essentially, predicting the next token is equivalent to "modeling the probability of the entire world." Given anything, you can estimate its probability. The world itself is a vast probability distribution, with some uncertainties that cannot be modeled—you don’t know what will happen next. But there are certain aspects you can determine, allowing you to rule out some possibilities. This is a universal model for understanding the world. Many historians have researched this, such as Density Estimation. At its core, large models are doing exactly this. However, at the time, I only realized it was an important problem but didn’t recognize it as the sole issue to solve.

36Kr: What made you change your mind?

Yang Zhilin: When GPT-3 was released in 2020, there was clearer evidence. The most impressive thing about OpenAI is that they observed more data and scaled up model parameters and training earlier. They realized sooner that simply scaling up could solve all problems.

36Kr: Knowing its importance, how does this influence your technical approach?

Yang Zhilin: To revisit the earlier point, if the world had only one problem—predicting the next field—then input and output would essentially be the same. In other words, 'understanding' and 'generation' are actually the same problem.

A few years ago, we used to distinguish between whether we were building understanding models or generation models, but now that distinction is unnecessary.

36Kr: However, many teams today focus more on text understanding first, prioritizing that aspect while leaving generation for later.

Yang Zhilin: That line of thinking isn't fundamental enough. Any approach that claims 'only understanding, not generation' is misguided. The correct direction is this: understanding and generation are one and the same. If you can achieve strong understanding, you can achieve strong generation—they are entirely equivalent.

36Kr: So the two can't be separated.

Yang Zhilin: Exactly. There's only one problem now. For example, if I can generate the next 10 seconds of a video, I must first have a deep understanding of the preceding video—what happened, what the story is, and how it might logically progress. These aspects are inseparable.

36Kr: Are you confident about achieving AGI (Artificial General Intelligence)?

Yang Zhilin: Confidence depends on its first principles. I think we already understand the principle now - it's essentially about predicting the next token. If we persist down this path, I believe we can achieve it.

However, there are indeed some "second-level" problems - specific technical challenges. But these are minor issues, not fundamental ones. The second level is what we need to overcome.

36Kr: How would you describe Moon Dark Side's goals and vision in one sentence?

Yang Zhilin: Our long-term goals are: to explore the limits of intelligence, to make AI useful, and to enable everyone to have truly inclusive and accessible AI.

36Kr: How do you understand "inclusive AI"?

Yang Zhilin: One current issue is that AI's values are often controlled by a centralized institution. How a model behaves is entirely determined by the platform—what it deems 'good' or the 'correct' answer based on its values.

But everyone has their own values. Values are more fundamental; they encompass personal preferences—what you believe is right or wrong.

Everyone should have the opportunity for such personalized customization, so future AI should also have the chance for 'alignment' (the process of ensuring AI systems' behavior matches expected human values and goals).

Of course, we must establish safety baselines and regulatory frameworks. On this foundation, there can be many opportunities for personalized AI.

36Kr: What is the implementation path for personalized AI? Can everyone train an AI model that represents themselves?

Yang Zhilin: Training is one approach, but I believe in the future, it might not even require training—perhaps direct configuration will suffice.

A possible ultimate form is that AI will digitally record everything. Your phone or computer will host a symbiotic AI Agent (AI proxy or digital twin) that knows everything you know.

36Kr: On your personal homepage, you state that all your work aims to 'maximize AI's value.' What does this refer to?

Yang Zhilin: The greatest value is that, ultimately, people won't have to do what they don't want to do, preserving the most essential aspects of humanity.

For example, this conversation of ours doesn't have to be face-to-face; there could be more efficient ways—like having our AI Agents communicate directly. The same goes for companies. Current organizations spend time setting performance metrics and evaluations, which is very time-consuming. In the future, we might not even need companies. An individual's efficiency could be much higher, and people wouldn't have to go to work just to earn a little money, as AI could handle many tasks.

Achieving this will undoubtedly be difficult, but ultimately, humanity might maximize productivity. In the end, true communism could emerge.

36Kr: If you were to make a prediction about the future now, what changes do you think society will see in ten years? Or, in your opinion, what will be the most significant transformation brought by AI to society?

Yang Zhilin: Ten years is a bit hard to predict, but I can talk about the next five.

I think that, for at least the next five years, large model technology won't commoditize (meaning the technology will still have barriers and won't become a cheap commodity). At least a significant number of models haven't been released yet, and we haven't truly seen large video models.

I believe the next two years will be a window for the continuous iteration of text models. After that, the following three years will be the window for the iteration of video models. There will always be technological barriers here.

36Kr: So, large video models will be a critical milestone?

Yang Zhilin: Yes, after surpassing these milestones, a significant transformation will occur.

There's a company in the U.S. called Rewind (focused on 'recording everything,' enabling humans to search all content they've seen). Currently, its product can only answer questions like, 'What did I do last month?' It records information, but the functionality is still relatively shallow.

In the future, AI Agents will achieve deeper personalization. For example, large models will share memories with you, understanding all your value preferences and orientations. If you ask it to draft a Q3 plan, it will do so based on this knowledge without needing to know what was done in Q2.

36Kr: From text to images, and now to video models and Agents, what is the key to achieving this?

Yang Zhilin: It's context (context length, which can be understood as the amount of information a model can process at once). This essentially determines the upper limit of the value AI can deliver.

If a large model's context encompasses all your memories, theoretically, it could perform all the tasks you currently do.

For large models, the most critical factor is how much context can be captured. This depends on the capabilities of the video model. If the model is powerful enough, theoretically, the combined data from your phone and computer could represent your complete context.

A human life is no different—we live in a digital world every day. Except for offline conversations like this one, which it might not capture, most other interactions are manageable.

36Kr: If this state is truly achieved, how should humans coexist with machines?

Yang Zhilin: I am quite optimistic. While AI enhances productivity, it should also create many new job opportunities.

Video content currently consumes the most of people's time, so it will undoubtedly significantly impact production relations. This means everyone could potentially produce (videos), leading to a redistribution of value.

However, this is a process with a long feedback loop. The challenge lies in the fact that the current rate of job displacement outpaces the creation of new jobs. The core issue is how to address social problems before ideal job opportunities are created.

36Kr: How should ordinary people respond to this technological transformation? What should they do as these changes continue?

Yang Zhilin: I believe the most important thing is continuous learning. Not just for ordinary people—everyone must develop strong lifelong learning capabilities to realize their true value in the future.

Another key point is to stay open-minded. Five years ago, I approached many people about working on large models, but they dismissed it, saying they were focused on digital humans (laughs). People are often limited by their own perceptions. Regardless of our stance on technology, historical progress transcends individual will. Therefore, we must constantly iterate and adapt, as the only constant in this world is change itself.