GPT-2: OpenAI's Commercial Ambitions in NLP

baoshi.rao

The article reviews the evolution of NLP in recent years, tracing its development through three key phases to outline the progression of NLP technology.

Natural Language Processing (NLP) technology is transforming our lives in various ways. The smart speakers in our living rooms are rapidly improving through daily conversations, even adapting to our preferences and habits with playful banter.

To understand the cause, we aim to briefly revisit the leaps and bounds in NLP over recent years, tracing back along this technological torrent. Returning to the abundant and diverse sources of technological innovation helps us grasp the trajectory of NLP's evolution.

Those following NLP developments are well aware that 2018 was a landmark year for the field. In June 2018, OpenAI published a paper titled Improving Language Understanding by Generative Pre-Training, introducing GPT, a model based on "pre-trained language models." GPT was the first to replace LSTM with Transformer networks for language modeling, achieving state-of-the-art (SOTA) performance in 9 out of 12 NLP tasks. However, for various reasons, GPT did not gain widespread attention at the time.

GPT's approach involves unsupervised pre-training on large-scale corpora, followed by fine-tuning on much smaller supervised datasets for specific tasks. This method does not rely on task-specific model design techniques and can achieve strong performance across multiple tasks simultaneously.

By October, Google's BERT (Bidirectional Encoder Representation from Transformers) was released, quickly capturing widespread attention. BERT achieved SOTA performance in 11 NLP tasks, leading Google researchers to declare that "BERT has ushered in a new era for NLP."

BERT essentially adopted the same two-stage model as GPT: first, unsupervised pre-training of a language model, followed by fine-tuning for downstream tasks. The key difference was that BERT used a bidirectional language model similar to ELMo during pre-training and employed a much larger dataset.

BERT's versatility and impressive performance in transforming downstream NLP tasks—including sequence labeling (e.g., Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role labeling), classification tasks (e.g., text classification, sentiment analysis), sentence relation judgment (e.g., entailment, QA, semantic rewriting, natural language inference), and generative tasks (e.g., machine translation, text summarization, poetry generation, image captioning)—solidified its prominence in NLP.

Just four months later, OpenAI released GPT-2. This large-scale unsupervised NLP model could generate coherent text paragraphs,刷新了7大数据集SOTA表现，并且能在未经预训练的情况下，完成阅读理解、问答、机器翻译等多项不同的语言建模任务。

First, GPT-2, like BERT and GPT, continued to use Transformer's self-attention mechanism as its underlying architecture. OpenAI researchers' insistence on unsupervised data training may stem from the idea that supervised learning limits language models to excelling only at specific tasks while performing poorly in generalization. Simply increasing training samples does not effectively expand task capabilities.

Thus, they opted for transfer learning with self-attention modules on more general datasets, building models capable of performing multiple NLP tasks in a zero-shot setting.

Unlike BERT, GPT-2 retained GPT 1.0's "unidirectional language model" structure. GPT-2 seemed to have a single goal: predicting the next word given all preceding words in a text. This stubborn adherence reflects OpenAI's approach. They expanded the Transformer model to 48 layers, incorporating 1.5 billion parameters, and trained it on an 8-million-webpage dataset (WebText) for unsupervised learning.

In short, GPT-2 is a direct scaling-up of GPT, trained on over 10 times more data with 10 times more parameters. This allowed GPT-2 to surpass BERT through sheer brute force—increasing model capacity and training data volume.

As a text generator, GPT-2 can take a few words as input and autonomously decide how to continue writing. In essence, as a general-purpose language model, GPT-2 can be used to create AI writing assistants, more powerful chatbots, unsupervised language translation, and improved speech recognition systems. OpenAI envisioned potential malicious uses, such as generating misleading news, impersonating others online, producing fake or harmful social media content, or automating spam and phishing emails.

Consequently, upon releasing GPT-2, OpenAI warned of "the risk of malicious abuse" and chose not to fully open-source the model. This decision sparked intense debate among machine learning and NLP researchers. Whether criticized as "overconfidence" in their product or seen as a PR stunt, GPT-2's ability to "generate fake news" impressed the industry. While critics mocked it, many were eager to explore its generative capabilities.

Over nearly a year, GPT-2 underwent dazzling updates amid cautious open-sourcing and developer experimentation.

Amid controversy and developer demand, OpenAI opted for phased open-sourcing. After August, it released smaller models: a 124-million-parameter model (500MB on disk), a 355-million-parameter model (1.5GB), and a 774-million-parameter model (3GB).

On November 6, it finally released the full code for the largest GPT-2 version with 1.5 billion parameters. Despite no evidence of abuse, OpenAI remained cautious, warning that full release could enable malicious actors to evade detection.

As GPT-2 versions rolled out, OpenAI collaborated with teams replicating the model to validate its performance while mitigating misuse risks and improving text-generation detectors.

OpenAI also partnered with research institutions to study human sensitivity to AI-generated content, potential malicious uses, and statistical detectability of GPT-2 outputs. Despite OpenAI's caution, developers eagerly explored the model's possibilities.

In April 2019, Buzzfeed data scientist Max Woolf open-sourced a "lite" version of GPT-2 with 117 million parameters, enabling fine-tuning and text generation with surprising results.

During OpenAI's phased release, two Brown University graduate students replicated a 1.5-billion-parameter GPT-2, naming it OpenGPT-2. Training from scratch cost about $50,000, using methods from OpenAI's paper.

Many users reported OpenGPT-2 outperforming OpenAI's 774-million-parameter version, though some disagreed.

In China, developer Zeyao Du from Nanjing open-sourced GPT-2 Chinese on GitHub, capable of writing poetry, news, novels, and scripts or training general language models. This impressive 1.5-billion-parameter model offers pre-trained results and a Colab Demo for generating custom Chinese stories in three clicks.

Other experiments include Singaporean high schooler Rishabh Anand's lightweight GPT-2 "client," gpt2-client, a wrapper for the original repository enabling text generation in five lines of code.

Several researchers from China are using GPT models to generate high-quality classical Chinese poetry. For example, one poem mentioned in a paper, titled Seven-Character Regulated Verse: A Safe Journey, reads:

A lone goose cries across the autumn sky,
Suddenly dreaming of old friends in Qingcheng.
The path through green woods lacks departing horses,
Yet my hand holds a yellow scroll for the returning boat.
A lifetime’s ambitions fade like Shangshan’s elders,
When will high office in the Han court remain?
If only we could reminisce together,
Sharing a drink atop ten thousand hills.

A seemingly ordinary farewell is imbued with profound melancholy and nostalgia, making one wonder: Does this language model truly possess emotions?

GPT-2 models can also be applied to music composition. OpenAI introduced MuseNet, a deep neural network for generating musical works, which employs the same general unsupervised technology as GPT-2’s Sparse Transformer. MuseNet predicts the next note based on a given sequence, enabling it to create four-minute compositions using ten different instruments and learn diverse musical styles from composers like Bach, Mozart, and The Beatles. It can even convincingly blend styles to produce entirely new pieces.

One particularly intriguing application is AI Dungeon, a text adventure game developed using GPT-2. Through multi-turn text interactions, the AI crafts unexpected narratives, whether a knight’s dragon-slaying quest or a detective’s urban investigation. In the future, AI-generated storylines might offer even greater imaginative potential for the gaming industry.

In the year since GPT-2’s release, its open-source applications have dazzled with their variety. Beyond the excitement and innovation, however, OpenAI faces significant challenges, including the high costs of training such models. The evolution from BERT to GPT-2 shows that larger models and unsupervised training can produce superior human-like content, but this requires exorbitant GPU computing resources, massive machine-learning clusters, and lengthy training processes. This "money-burning" approach may centralize NLP advancements among a few well-funded players.

If OpenAI releases GPT-3.0 this year, it will likely continue with a unidirectional language model, scaling up training data and model size to compete with BERT, further pushing NLP achievements. Yet, the lack of clear commercial prospects for such costly models poses a dilemma: Should OpenAI prioritize its original technological vision or pivot toward profitability?

The answer seems clear. In July 2019, OpenAI accepted a $1 billion investment from Microsoft. Officially, the partnership aims to co-develop AI technologies for Microsoft Azure, with an exclusive agreement to expand large-scale AI capabilities and fulfill the promise of Artificial General Intelligence (AGI).

This move reflects OpenAI’s financial strain and the challenges of commercializing its research. For instance, the 1.5-billion-parameter GPT-2 model required 256 TPU v3 chips, costing $2,048 per hour. Future iterations like GPT-3.0 will likely demand even greater cloud computing resources.

Microsoft will serve as OpenAI’s exclusive cloud provider, with OpenAI’s AI technologies being delivered via Azure. OpenAI will license some technologies to Microsoft for commercialization and distribution to partners.

This financial backing empowers OpenAI to proceed confidently. As seen with GPT-2’s staggered release and eventual open-sourcing in November 2019, future commercialization efforts may leverage Microsoft Azure, such as integrating with Office365 for automated text generation, grammar correction, or more natural Q&A systems.

Once driven by youthful idealism, AGI’s realization now hinges on practical business strategies. In 2020, the Microsoft-OpenAI alliance is poised to stir the NLP commercialization landscape, challenging competitors like Google.