The stunning rise of LLMOps: navigating the new frontier in AI deployment

Opportunities for tooling and infrastructure startups

Peter Zhegin

Yuri Makarov

, and

Sergey Bartunov

Mar 12, 2024

ChatGPT exploded onto the stage at the end of 2022, sparking broad interest in foundation models and their subset, large language models (LLMs). This surge in interest significantly compressed the timeline for widespread LLM adoption, with the interval between their public debut and broad application and tool development shrinking dramatically from years to months.

The rise of machine learning and deep learning (ML/DL) witnessed a slow build-up to the moment when MLOps—operational tools for machine learning—came into focus. In stark contrast, the ascendance of LLMs rapidly sparked a parallel interest in LLMOps and vector databases, as evidenced by recent search term trends (refer to Chart 1).

Chart 1. Interest in selected AI-related search terms over time (Google Trends)

Pattern matching: ML to LLM

The rise of LLMs and foundation models poses key questions for founders and investors: To what extent will existing tech stacks accommodate these models, and where might opportunities lie for launching and investing in new startups?

These opportunities will make themselves visible gradually and emerge from the tension between the existing MLOps players and new foundation model-native entrants. LLMOps as a category will not emerge overnight; it will absorb some bits and pieces of the existing MLOps stack and mix them with new, unknown product sub-categories.

Then, LLMOps will likely morph into Foundation Model Ops (FoMoOps, pun intended). LLMs are already becoming increasingly multi-modal; multimodality is becoming the core of the training process, e.g. in Rabbit's large action model.

To illustrate these dynamics, one may explore the evolution of ML, DL, and data landscape, or MAD, documented by FirstMark Capital, specifically the birth of the machine learning and artificial intelligence category. Introduced in 2021, the category did not come from nowhere, rather it emerged out of new and existing sub-categories that used to sit within the analytics and infrastructure categories of the stack, for instance, data science notebooks and GPU cloud, respectively.

If our pattern matching is correct, and the adoption of LLMOps mirrors trends observed in broader machine learning categories, we, as investors, must identify the aspects of the nascent LLM ecosystem where incumbents will have an advantage (distribution channels, partnerships, existing portfolio of customers, talent and so on). Simultaneously, we should pinpoint where opportunities for new entrants emerge.

Analysing ML project lifecycle

The existing MLOps stack is still quite fragmented (300+ companies according to Gartner) and dominated by relatively small companies (DataRobot made $100M+ ARR in 2021 vs DataDog at $1.6B in 2022, for comparison), so opportunities should be plentiful. To discern where LLM-specific tools and infrastructure could carve out a niche, we analysed the lifecycle of a typical ML project and compared its key phases with those needed to deploy LLMs. Here, we focus on LLMs specifically, but this approach could be applied when a new candidate to replace LLMOps emerges.

Within this lifecycle (chart 2), certain segments—marked in red—remain largely unaffected by LLMs and are domains where established players continue to hold sway. Conversely, the areas highlighted in yellow represent emerging opportunities within the grasp of existing players, while the green segments indicate fields ripe for innovation and new market entries.

Chart 2. Influence of LLMs and opportunities for new entrants within the MLOps stack (Approx.vc)

Data platform, databases & feature storage

A closer look at the data platform (includes databases/file systems, data processing frameworks, versioning, pipeline scheduling, etc.) and feature storage segments reveals a natural adaptation to the demands of LLMs. This partially happens via the emergence of vector databases, like Weaviate and vector embeddings being integrated into traditional databases, for instance, via Pgvector, a vector similarity search for Postgres. Both trajectories seem to resonate with the developers (chart 3).

We tend to believe, however, that existing database players are strong enough to make vectors a feature of database platforms. A realistic way to differentiate might be to push for real-time vectorisation while finding the best way to use traditional databases and feature stores (relational/NoSQL) together with vector DBs. So, we’d expect some opportunities to emerge in the feature storage segment of the project lifecycle, but not to the point of replacement or disruption.

Chart 3. GitHub stars history for Weaviate and pgvector (Star History)

Feature engineering & model training

LLMs radically transform several laborious steps for in-house ML model development, such as data collection, ingestion, and analysis. Foundation models and LLMs also changed the nature of the training itself. From a distinctive step of an in-house ML pipeline, training has been transformed into a collection of various fine-tuning strategies distributed across the model lifecycle.

For instance, prompt engineering does not involve training in the traditional sense but allows one to change the model's behaviour during inference by creating appropriate instructions or queries. This ‘training without training’ represents an interesting opportunity outside the MLOps pipeline, and we will cover it separately ( data sources, compute).

Labelling

Within the MLOps pipeline, fine-tuning brings a new dimension to labelling. Specifically, for the purpose of reinforcement learning by human feedback (RLHF), a preference data set is required, which is not usually relevant for purposes of classical ML. However, it’s not clear that RLHF would be necessary in the future: approaches like LIMA opened a promising direction of reinforcement learning free alignment, and we’re bullish on such techniques since they simplify workflow substantially.

The significance of labelling also increases as more extensive labelling is required for fine-tuning a pre-trained model for specific use cases, for instance, sales or customer service.

Model performance optimisation

LLM deployments face more acute memory bottlenecks than traditional ML models. They may require various memory optimisation strategies (data, tensor parallelism, etc.), model compression (quantisation, distillation, pruning), runtime and scheduling optimisation. On the optimisation side, we also see tight competition from existing players (Deci.AI, OctoML, NeuralMagic) and strong pressure from open-source solutions, so founders have to find their specific edge and insight to compete in this space and prepare to outrun open-source.

Existing MLOps tooling providers are already adjusting to a new reality of working with pre-trained LLMs. For example, it took less than two years for a leading labelling platform, Scale AI ($600M raised according to Crunchbase), to introduce RLHF products (chart 4) and gain more foundation model developers as customers (chart 5). DataRobot, probably one of the largest players in the MLOps space, founded in 2012, put generative AI at the forefront of its platform (chart 6). On the optimisation side, OctoML, rebranded as OctoAI, also moved optimisation for foundation modes to the forefront of its product offering (chart 7).

Chart 4. Evolution of Scale AI product offering as of Jan 2022 and Dec 2023 (scale.ai, Internet Archive)

Chart 5. Gen AI/foundation model customer portfolio Scale AI as of Jan 2022 and Dec 2023 (scale.ai, Internet Archive)

Chart 6. Evolution of DataRobot’s product offering as of Jan 2022 and Dec 2023 (DataRobot, Internet Archive)

Chart 7. Evolution of Octo ML’s product offering of Jan 2022 and Jan 2024 (Octo.ai, Internet Archive)

So, it might seem that the MLOps incumbents are well prepared for LLMs. There is, however, good news for entrants:

There are steps in the ML lifecycle where adopting existing MLOps solutions will be harder due to a step change in LLMs' architecture and capabilities;
Some product investments by incumbents, especially in generative tasks, to improve in-house ML training workflows are obviated by LLMs, effectively lowering barriers to new entrants; however, in some use cases (demand forecasting, credit scoring, etc.), old MLOps guard still rules;
New customer segments emerge due to lowered entry barriers into AI.

Model evaluation, deployment, and monitoring are among the most exciting steps of the model life cycle where current MLOps incumbents are less well positioned.

Model evaluation

More complexity is involved in the evaluation of LLMs. LLMs require intrinsic metrics (ROUGE, BERT, and BLEU) for the similarity of a response to a provided reference answer combined with the human evaluation and task-specific benchmarks (GLUE, SuperGLUE). Traditional ML works with a hold-out validation set with evaluation metrics such as accuracy, precision, recall, F1 score, etc. Moreover, we expect new evaluation approaches for models that use external tools like search indexes, external databases, function calling, and others since designing a testing pipeline for that is a novel challenge.

Model deployment

Also, unlike ML projects, LLMs are deployed in chains/pipelines that connect multiple LLM calls and ping internal systems, like web searches or databases. Therefore, an additional tooling layer emerges within the deployment bucket, e.g. LangChain and LlamaIndex. As we mentioned before, these tools would affect evaluation, but more importantly, they will change deployment and monitoring: these chains of calls will increase the complexity of the stack and introduce new failure modes that have to be monitored and deployed with fallbacks.

Model monitoring

The generative capabilities of foundation models introduce more complexity to model monitoring. In addition to internal model parameters, the output should be monitored and somehow prevented from being delivered to an end user if it is offensive, incorrect, or leaks some proprietary data. Attacks on LLMs will pose additional threats when they would be able to reliably use external tools since attackers may find a way to corrupt internal data or extract proprietary information. We believe this part of the monitoring stack poses completely new challenges, and we’re seeing more and more companies coming up here.

In addition to identifying elements of the model lifecycle that are largely affected by LLMs-specific demands but not addressed by incumbents, new entrants may win over competitors by working with new customer groups. An illustration of that is Weights&Biases, which seems to cover wider audiences through product-led growth, while H2O focuses on sales-led growth that is more relevant for the enterprise segment (Chart 8).

Chart 8. Monthly visits: Weights&Biases vs DataRobot vs H2O (Similarweb)

What we are excited about

To sum up, we are excited about LLM-first tools and the emergence of LLMOps. We expect, however, fierce competition with existing MLOps vendors and open-source projects. To win this race, founders who build LLM tools will have to find go-to-market strategies that circumvent enterprise sales and/or target overlooked audiences. They also will have to identify unique niches where traditional MLOps struggles. The most exciting steps of the model life cycle are model evaluation, deployment, and monitoring.

The list of exciting components of the LLMStack is not definitive, and we appreciate that there are other overlooked niches. We’d suggest that founders use the following checklist to assess, at the high level, if there is an opportunity for an LLM tool they are building. This checklist might also help evaluate opportunities for whatever comes next after LLMs.

Look at the most well-funded MLOps startups and scaleups and compare their product offerings before and after the LLM hype. What is there and what is not there? Marketing page updates might not suggest if incumbents solve the problem effectively, but they point to where they concentrate resources;
What's your edge, especially regarding go-to-market, if your product directly competes with LLM offerings from MLOps vendors?
Are there any open-source alternatives to your product? Are these OS projects experiencing stratospheric growth? What’s your edge? Is there a gap or opportunity to help customers rapidly leverage these projects?
To what extent does your product address the specific unique needs of foundation models/LLMs driven by their architecture, capabilities, and deployment? What is the recent step change you wish to capitalise on?
To what extent will your LLM-first product scale into a platform and support ML models? For example, working with tabular data is a crucial capability of incumbent ML platforms. If a customer needs it, for instance, for demand forecasting, will they have to use incumbents’ product, in addition to yours, that is focused on LLMs only?
If building a platform is not your goal, then in what ways is your product designed to integrate seamlessly with existing MLOps ecosystems?
What are the fundamental research risks that can make your product irrelevant, i.e. is there another LLM-type architecture around the corner that will make a part of the pipeline disappear? How would you hedge against them?

The more positive answers you score, the more likely your project will become a standalone product or a platform rather than a feature for existing ML platforms.

✍️We will continue to explore the foundation model stack and the opportunities for startups in the infrastructure, tooling, and application spaces. Subscribe here to receive our writing in your inbox.

***

🙏🙏🙏 Kudos to Arthur Etchells and James Arthur for insightful suggestions and intriguing questions that helped us enhance the clarity and depth of this text.

Arthur is a product builder and AI consultant. He previously brought constructor.io to market and now builds, writes and invests in commerce, health and AI. Find him at arthuretchells.com.

James is a co-founder of electric, a local-first software platform that makes it easy to develop high-quality, modern apps with instant reactivity, realtime multi-user collaboration and conflict-free offline support. Electric is an approx.vc’s portfolio company.

approximations

Discussion about this post