Run, tune, and scale generative models that power AI applications

Luis Ceze, CEO and Co-Founder of OctoML, gave this presentation at the Generative AI Summit in Boston in 2023.

I’ve spent most of my professional life building efficient computer systems for a bunch of applications, from life sciences and bioinformatics to simulations. For the past 10 years, I’ve been making AI systems more efficient using hardware and software techniques.

OctoML spun out of the University of Washington. We built a system called Apache TVM, which was one of the early machine learning compilers that enabled AI to be deployed on your phone, the cloud, and everywhere else.

OctoML has about 100 people, and we started four years ago. We started by focusing on model optimization, and over the past year or so, after realizing how hot GenAI would become, we started building a product that instead of serving machine learning engineers, serves application developers who are building with AI.

We're going to talk about what OctoAI is and how we help our customers be successful, and then we're going to take a look under the hood to see how it all works.

What is OctoAI?

OctoAI is a platform that enables developers to run AI models and add AI models to applications they choose, such as open-source models or custom models. It allows you to fine-tune the model on the platform, and once you’re putting in deployments, it makes it scale. That means it solves hardware availability problems.

Scaling these models involves doing a lot of deep optimizations, which is what we do really well.

It takes about 10 trillion computer operations to generate one high-definition GenAI-generated image. That's a lot of compute.

The reason I mention this is that as you scale this to have a global impact, you have to be able to provide the compute for that and make it fast enough for people to have the patience to use it and also have the resources to do so.

Based on our expertise and model optimization, we built a self-optimizing compute service that automatically chooses the right optimizations for your model to be viable and chooses the right hardware (you don't have to go and pick a specific GPU to do the work) and chooses the system parameters automatically.

And then you can build your application by choosing models you're probably familiar with, like Stable Diffusion, Llama 2, or ControlNet.

So that's what OctoAI is, and it's not just compute. The way this works is we offer a set of end-to-end solutions, like image generation, gaming, personal expression through art, marketing, and so on.

We also have solutions for language, such as summarizing text, questions and answers, etc.

All of that is underpinned by a set of highly optimized open-source models. We truly believe that open-source models are the way forward to build systems that you can rely on, you understand how they work, and you have control over the deployment.

We run these models for you. You don't have to provision the compute. That's part of the platform.

You can tune the models on the platform. For image generation, you can bring images to the system and then fine-tune the image generation model with those images and use them for your application in the system, and then we can manage these customized models for you.