//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
Cerebras has open-sourced seven trained GPT-class large language models (LLMs), ranging in size from 111 million to 13 billion parameters, for use in research or commercial projects without royalties, Cerebras CEO Andrew Feldman told EE Times. The models were trained in a matter of weeks on Cerebras CS-2 wafer-scale systems in its Andromeda AI supercomputer.
GPT-class models are notoriously large: GPT-4, which powers ChatGPT, has 175 billion parameters. Training these models is therefore limited to the small number of companies that can afford it, and it takes many months. The pre-trained GPT-class models offered by Cerebras may be fine-tuned with a “modest amount” of custom data to make an industry-specific LLM requiring a relatively small amount of compute by comparison.
“I think if we’re not careful, we end up in this situation where a small handful of companies holds the keys to large language models,” Feldman said. “GPT-4 is a black box, and Llama is closed to for-profit organizations.”
It isn’t just companies smaller than OpenAI and DeepMind that are not able to afford the compute required; many fields of academia are also locked out.
“It’s too expensive, just plain too expensive,” Feldman said. “Conversely, in some of the most interesting work, necessity is driving innovation. … You’re seeing graduate students trying to fit [LLMs] on laptop CPUs, and you’re seeing all sorts of enormous creativity in an effort to do what they can with the resources that are available to them.”
Cerebras has some CS-2 systems available in the cloud for academic use through certain programs, as well as some at the Pittsburgh Supercomputing Center and in Argonne National Labs’ sandbox, he said.
Trained models popular
The trained models Cerebras has released, available under the permissive Apache 2.0 license, have been downloaded more than 200,000 times from HuggingFace at the time of writing (about two weeks after release). They are trained on the public PILE dataset from Eleuther.
Training seven models of different sizes allowed Cerebras to derive a scaling law linking the performance of the model (prediction accuracy) to the amount of compute required for training. This will allow the forecasting of model performance based on training budgets. While other companies have published scaling laws, this is the first using a public dataset, the company said.
Cerebras was able to train these models in a few weeks on Andromeda, its 16-node CS-2 supercomputer, as there was no effort required to partition models across smaller chips, Feldman said.
Distributing training workloads across multi-chip systems can be a difficult task. Training on multi-chip systems typically uses data parallelism, wherein copies of the model are trained on subsets of the data, which is sufficient for relatively small models. Once models get to about 2.5 billion parameters, data parallelism alone isn’t enough: The model needs to be broken up into chunks, with layers running on different chips. This is called tensor model parallelism. Above about 20 billion parameters, pipelined model parallelism applies, which is when single layers are too big for a single chip and need to be broken up. Feldman pointed out that training OpenAI’s ChatGPT took a team of 35 people to break up the training work and spread it over the GPUs they were using.
“Our work took one person,” he said. “Our wafer is big enough that we never need to break up the work, and because we use the weight-streaming architecture that holds parameters off-chip, we never need to break up the parameters and spread them across chips. As a result, we can train very, very large models.”
Sticking to a strictly data-parallel approach even for very large models makes training much simpler overall, he said.
Will there be a point at which models become so large that it will be too complex to train them on multi-chip systems?
“There is an upper bound on how big a cluster one can make, because at some point, the taxes of distributing compute overwhelm the gains in compute, [but] I don’t think the parameter counts are going to keep getting bigger. … There’s a tradeoff between model size and the amount of data,” he said, referring to Meta’s work on Llama, which showed that smaller models trained on more data are easier to retrain and fine-tune.
“If you keep growing the parameters … the models are so big, they’re difficult and awkward to work with,” he said. “I think what you’re going to see is a great deal of work on better data, cleaner data.”