Why distributed AI is key to pushing the AI innovation envelope

The future of AI is divided, said Ion Stoica, co-founder, executive chairman and president of Anyscale on the first day of VB Transformation. And that’s because the complexity of the model shows no signs of slowing down.

“In recent years, the computing requirements to train a state-of-the-art model have been dependent on the data set, grow between 10 and 35 times in 18 months,” he said.

Just five years ago, the largest models could fit on a single GPU; Fast forward to today and to meet the parameters of the most advanced models, hundreds or even thousands of GPUs are needed. PaLM, or Google’s Pathway Language Model, has 530 billion parameters – and that’s only about half the largest, with over 1 trillion parameters. The company uses more than 6,000 GPUs to train the most recent.

Even if these models stopped growing and GPUs continued to evolve at the same rapid pace as in previous years, it would still take about 19 years to be advanced enough to run these state-of-the-art models on a single GPU, Stoica added. . .

“Essentially, this is a huge gap, growing month by month, between the demands of machine learning applications and the capabilities of a single processor or a single server,” he said. “There is no other way to support these workloads than to distribute them. It’s that simple. Writing these distributed applications is difficult. It’s even harder than before.”

The unique challenges of scaling applications and workloads

There are multiple phases in building a machine learning application, from data labeling and preprocessing to training, hyperparameter tuning, operation, reinforcement learning and so on – and each of these phases must be scaled. Usually, each step requires a different distributed system. To build end-to-end machine learning pipelines or applications, it is now necessary to bond these systems together, but also to manage them all. And it also requires development against a variety of APIs. All of this adds a tremendous amount of complexity to an AI/ML project.

The mission of the open-source Ray Distributed Computing project and Anyscale is to make scaling these distributed computing workloads easier, Stoica said.

“With Ray, we tried to provide a computational framework on which to build these applications end-to-end,” he says. “W Anyscale basically provides a hosted, managed Ray and, of course, security features and tools to make the development, deployment and management of these applications easier.”

Hybrid stateful and stateless computation

The company recently launched a serverless product, which abstracts the required features, eliminating the worry of where these features will run and easing the burden on developers and programmers as they grow. But with a transparent infrastructure, functions are limited in their functionality – they do calculations, for example write the data back to S3 and then they are gone – but many applications require stateful operators.

For example, training, which requires a lot of data, would become way too expensive if they were written back to S3 after each iteration, or even just moved from GPU memory to machine memory, due to the overhead of getting the data in, and then also typically serialize and de-serialize that data.

“Ray was also built from day one around these types of operators who can maintain state and continuously update state, what we call ‘actors’ in software engineering jargon,” he says. “Ray has always supported this dual mode of this kind of stateless and stateful computation.”

What collection does AI implementation include?

It’s tempting to say that AI deployment has finally reached the walking stage, pushed forward in the AI transformation journey by the recent acceleration of digital growth — but we’ve only seen the tip of the iceberg, Stoica said. There is still a gap in the current market size, compared to the opportunity – comparable to the state of big data about 10 years ago.

“It takes time because the time [needed] is not just for developing tools,” he said. “It’s educating people. Education experts. That takes even more time. If you look at big data and what happened, eight years ago a lot of universities started giving degrees in data science. And of course there are now a lot of courses, AI courses, but I think you’re going to see more and more applied AI and data courses, of which there aren’t many these days.”

Learn more about how distributed AI is helping companies ramp up their business strategy and catch up on all Transform sessions by registering for a free virtual pass here.

Why distributed AI is key to pushing the AI innovation envelope

The unique challenges of scaling applications and workloads

Hybrid stateful and stateless computation

What collection does AI implementation include?

Recent Post

Keyword Tag Cloud