NVIDIA Releases Nemotron-4: A 15B-Parameter General-Purpose Large Model Designed to Run on a Single A100/H100 GPU

baoshi.rao

NVIDIA's newly introduced Nemotron-4 language model has attracted significant attention. This general-purpose large model features 15 billion parameters and, after training on 8T tokens, excels in English, multilingual, and coding tasks. Specifically, Nemotron-4 performs exceptionally well on seven evaluation benchmarks for 15B models, surpassing models of similar parameter sizes and even outperforming models four times larger.

The model's design is inspired by the "scaling law" of the Chinchilla model, which emphasizes optimizing both data and model size under a fixed computational budget. Unlike previous approaches that primarily focused on model size, this research highlights allocating computational resources to training with more data to reduce latency and the computational requirements for serving the model. Consequently, Nemotron-4's primary goal is to create the best "general-purpose large model" that can run on a single NVIDIA A100 or H100 GPU. In terms of architecture, Nemotron-4 adopts a standard decoder-only Transformer architecture with causal attention masking. The core hyperparameters include 3.2 billion embedding parameters and 12.5 billion non-embedding parameters. For data, the researchers used a pre-training dataset containing 8 trillion tokens, divided into English natural language data (70%), multilingual natural language data (15%), and source code data (15%).

To train this massive model, Nemotron-4 employed 384 DGX H100 nodes, each equipped with 8 NVIDIA H100 80GB SXM5 GPUs. With 16-bit floating-point (bfloat16) arithmetic, each GPU achieved a peak throughput of 989 teraFLOP/s. The researchers conducted training through a combination of tensor parallelism and data parallelism, utilizing a distributed optimizer.

In downstream evaluations, Nemotron-4 demonstrated strong performance across various domains, particularly in commonsense reasoning, popular comprehensive benchmarks, and tasks involving mathematics and code. The model also achieved best-in-class performance in multilingual classification and generation tasks, showcasing its exceptional understanding across different languages. Notably, Nemotron-4 made significant progress in machine translation tasks, excelling not only in translating Chinese to English but also achieving impressive results in direct translation from Chinese to other languages. The introduction of Nemotron-4 represents a major breakthrough for NVIDIA in the domain of general-purpose large models, establishing a new benchmark for the best general-purpose large models operating on a single A100 or H100 GPU.

Paper link: https://arxiv.org/abs/2402.16819