Enterprise-Grade ML Solutions with Scale AI, Mosaic ML, Comet ML, and Arthur AI

By Gabriella Garcia on July 3, 2023

One of the advantages of being affiliated with Two Sigma is that I get to chat with really smart and curious people who are always playing around with the latest technologies and tools. The recent hype around generative AI has been no exception. Similar to other data science-driven enterprises, the potential of large language models (LLMs) is very exciting – no need for interns to label data, we can explore data like never before, we can migrate legacy codebases with a simple call, and we can even replace internal search with a chatbot.

However, it takes a lot of hard work to go from experiment to pilot to enterprise deployment – it’s easy to make something cool with LLMs, but very hard to make something production-ready. Technical challenges can include scaling prompt engineering, latency, tooling and even good agent policies. In addition, before fully unleashing LLMs into their internal stacks or user-facing products, many companies want better tools for handling data privacy, segregation, security, copyright, and monitoring model outputs. Companies in regulated industries from fintech to healthcare are especially focused on this and have trouble finding good / differentiated software solutions. As a result, we’re seeing significantly less adoption of generative models within enterprises versus consumers, and no significant deployments in production (yet).

While generative AI has captured the imagination of everyday people like myself (thanks Bard for helping me write this post), it’s essential to be transparent and set expectations about the capabilities and deployment feasibility of generative models.

There are a lot of things LLMs can do, but for many use cases a traditional predictive model is easier, cheaper and faster to deploy for the same business outcomes. From explainability and hallucinations to concerns about data privacy and logging, there are several critical challenges around LLMs that must be resolved before we start building for the high-sensitivity nature of enterprise use cases.

Enterprise ML challenges are not new – engineers are constantly combating training data bias, monitoring is essential but also painful, and evaluation is hard even with traditional deep learning models. However, the world’s newfound easy access to high-quality generative models has compounded these enterprise issues mostly due to the ambiguous nature of natural languages and partially due to the nascence of the field.

Fortunately, there are many smart people and startups focused on these problems. The LLM stack landscape is constantly evolving and the infrastructure layer will continue to evolve rapidly for the next several years. As I’m writing this, Contextual AI has announced their $20M seed round to build contextual language models, Sequoia released their thoughts on the LLM stack and Marc Andreesen published why AI will save the world. I have no doubt that as agent policies get clarified and more guardrails go into place, we may see another step change in adoption as these models are better trusted.

Last month, we had an opportunity to host an excellent event here in NYC where much of the enterprise AI revolution is happening, in collaboration with MIT’s Martin Trust Center for Entrepreneurship, the MIT Club of NYC, and the MIT Sloane Club of NYC. We were fortunate to partner with some incredible people, including Vijay Karunamurthy (Field CTO of Scale AI), Jonathan Frankle (CSO of Mosaic ML), Gideon Mendels (CEO of Comet ML), and Adam Wenchel (CEO of Arthur AI) who were refreshingly honest about the state of AI and quite spicy at times with their takes. The following outlines the responses from our great speakers; note the following is a summary of a panel discussion and no statement below should be attributed to any particular speaker.

Q1 – Challenges when building for an enterprise

Two Sigma Ventures Gabriella Garcia: From explainability and hallucinations to security concerns about proprietary data, there are several critical challenges businesses must address to make AI a reality. From your perspective, what is the biggest challenge when building for the enterprise (in comparison to consumers)?

Generative AI models have the potential to revolutionize the way we analyze data, tell stories, and create presentations. However, expecting these models to perform these tasks without human input is unrealistic. The technology is powerful and genuinely useful, but it is also deeply flawed. Therefore, it is important to understand the strengths and weaknesses of generative AI models and build accordingly. Enterprise users must approach this technology with caution and manage their expectations, rather than trusting it completely.

Evaluating generative AI models presents a significant challenge. In classical and deep learning use cases, evaluation is traditionally simpler. However, with generative models that generate text or images, evaluating the outputs is really hard. The fine-tuning and evaluation processes typically involve a template prompt, hyperparameters like temperature, and different variables to test the model’s behavior before putting it into production. This often means looking at massive Excel sheets to figure out how the models will behave, which is not a great developer experience.

Retrieving data containing sensitive information to augment generative models presents additional challenges. While reasoning about and aggregating this data is necessary, end-users must not be permitted to view individual records. Existing data governance policies determine who can access which columns in the database. However, in the transition from a deterministic to a probabilistic world, it is crucial to rethink data governance, privacy, and data leakage. The emergence of new challenges, such as toxicity, has yet to be fully explored.

Although some models can be put into production without human input, caution is necessary. This is particularly true for enterprise use cases, where the high-sensitivity nature of the data requires a more careful approach. It is important to critique model output and ask for explanations to avoid potentially inaccurate results.

Despite the challenges of generative AI models, there is work being done to improve their accuracy. For example, Scale uses reinforcement learning with human feedback to train reward models, which is expected to become a core technique in the future. Human preferences and feedback can be incorporated to critique the policy model and improve the accuracy of the outputs. It is also important to ask the model to explain how it arrived at its answers, even for basic problems. Without these techniques, the model’s output could be based on hallucinations or not based on fact.

In conclusion, while generative AI models have transformative potential, significant challenges must be overcome before we can fully realize their benefits. Continued investment and research will be necessary to make significant progress in this field in the years to come.

Q2 – Bottlenecks for scaling

Two Sigma Ventures Gabriella Garcia: Every wave of innovation is fueled by new powerful tools and processes, and each of your product suites (Scale AI, Mosaic ML, Arthur AI, Comet ML) is critical in enabling teams to build better, faster, and safer. However, as models and tech stacks become more complex and data-intensive, they can become more difficult to scale and maintain – leading to performance bottlenecks and outages. While some of these early challenges are being solved, what do you think are the biggest technical bottlenecks / unsolved eng challenges (from latency, human evaluation, GPU availability etc) in deploying generative models at scale?

Large-scale models are incredibly powerful, but training and inference can be extremely challenging. One of the biggest issues with models is their sensitivity to changes. As models improve and new versions are released, whether switching between open-source solutions or using OpenAI, it can be difficult to upgrade to the latest version without encountering issues. This means that sometimes, it’s necessary to start from scratch. While this is not only an engineering scalability issue, it’s also something to keep in mind when deploying solutions at scale.

Another challenge associated with large-scale models is the cost of training. For example, to train a state-of-the-art 30 billion parameter language model on a trillion tokens, the cost is likely around a million dollars. This is assuming everything goes smoothly from the start, but the reality is that it often doesn’t. Even if you can afford the cost of training, getting a cluster of GPUs to work well together in the cloud and actually accomplish anything is a difficult task.

On the inference side, newer models like GPT-4 have long context windows that can take up to 32,000 tokens. But even with a cluster of 100 GPUs, generating a response can still take a while. This high cost of inference is a significant factor to consider for tasks with low individual value. Some domains may not be affected by the cost of generating a response, but others find it unacceptable from an economic standpoint. As a result, it’s important to do the math to determine the feasibility of these projects.

One of the biggest challenges that is still being worked on is search. One notable difference for Bard, compared to some other chatbots, is that if you ask it a question about a recent event, such as “what did people think of the last episode of Succession this week?”, most models cannot answer you because they are only trained to give responses based on what is in the pre-trained data set. In contrast, a model like Bard can actually perform a search and provide a coherent and personalized response based on recent data.

In an enterprise context, search is a fundamental requirement for many use cases. Being able to query private data sources and achieving high precision and recall objectives are critical skills. However, the way in which search is done for many of these models is very different from the way search was architected previously. There are many approaches using embedding-based searches, and making those approaches accurate and effective is a significant challenge. Therefore, it is likely that there will be many investments in this field over the next few years.

Q3 – Helping Enterprises Navigate the Chaos

Two Sigma Ventures Gabriella Garcia: Suddenly everyone became an AI expert in November 2022 (Alex, CEO of Scale AI, calls them AI tourists vs natives), and tons of non-AI companies are rebranding and selling products that they only began to experiment with yesterday). As a result, enterprises are being overwhelmed with repetitive undifferentiated marketing (if they’re looking to buy), and new OS breakthroughs and academic papers (if they’re looking to build). For enterprises that are looking to experiment and integrate AI into their internal orgs or product suites, how would you recommend they begin (what roles to hire for, how to train)?

Many existing companies offer solutions that have been important in enterprise or government contexts. However, with the rise of large language and diffusion models, they are now reformulating their pitch around them. Despite this trend, it is important to note that there are deep challenges with models, such as hallucinations, ungrounded answers, and gender and racial bias. This is impressive technology, but that doesn’t mean it’s limitless. It’s important to keep that in mind. Just because it can provide answers in a compelling way or create great poems, it doesn’t mean it can fully replace your marketing personnel. We need to be realistic about our current capabilities. These are remarkable tools and we should be enthusiastic about what we can achieve, but we also need to acknowledge their limitations.

In an enterprise context, it is crucial to evaluate who has actually done the work of AI development, and who is just a thin layer on top of GPT-4 advertising that they do AI. To determine this, ask questions such as: Have you trained a model? Have you trained a model from scratch? How big was that model? Can I use that model, interact with it, and experiment with it?

Companies that simply take a base model, such as a Llama-based model, and apply it in the enterprise context without evaluating how those models perform against external dimensions are setting themselves up for failure. Especially in the finance and healthcare sectors, GPT-4 is not enough. You can’t just send all your customers’ data off to it. It is important to have a concrete understanding of the technical artifacts of the AI and what they can actually do.

In the future, we may look back at this period and realize that models without several rounds of RLHF and human alignment are not suitable for human consumption.

To successfully implement data science or machine learning within an organization, it’s crucial to bring together the people who can build or understand these technologies with those who know the business needs. This means that both the data science team and individuals from the business side must collaborate to create a successful outcome. By doing so, the team can ensure that the solutions created by the data science team meet the needs of the business, and that they are usable and feasible.

It’s recommended to come up with a list of use cases and take a portfolio approach, trying out each one for two to three weeks. By iterating through use cases, organizations can quickly determine what works and what doesn’t within their existing technical infrastructure, and adjust their approach accordingly. This approach can help companies avoid spending years on a project that may not yield any results. This approach has been successfully executed by Etsy, which built a model for image search in just three weeks, starting as a hackathon project and going into production.

Audience Questions

Other than compute, what are other costs enterprises should budget for?

There is much discussion about the high cost of training large language models, including the cost of data. In fact, data costs are often just as expensive, if not more so, than compute. This is especially true for specialized datasets that enable these models to be intelligent. While some data costs may decrease over time, it is equally likely that the opposite will occur, as data owners recognize the value of their datasets and charge more. For example, some companies pay $50 to $100 million for access to highly valuable datasets. Collecting data for reinforcement learning with human feedback or any type of instruction, as well as fine-tuning data, can cost tens of dollars per example, and hundreds of thousands of examples may be required. Although this is quite expensive, it is essential for building good models. Larger companies may spend tens of millions of dollars on data alone.

What are your thoughts on how the LLM landscape may evolve over time?

More and more individuals are experimenting with a wider variety of language models beyond OpenAI’s powerful and easily buildable model. We are seeing people build their own models, use open source models, or use models from other vendors. Thus, we are likely to see more diversity in the foundation models used in these applications over the next year or two.

Although the model itself may become more of a commodity over time, tasks such as training, fine-tuning, and serving these models will remain challenging. OpenAI’s GPT-4 is not the end-all-be-all, and it is only a matter of time before an open source model surpasses it. It is worth noting that leaked Google memos are generally unpopular within Google.

Is this just another hype cycle?

AI winters have occurred in the past, such as the ones in the seventies and nineties, and the recent $100 billion AI winter in self-driving cars. Despite the hype, amazing things are coming out of AI development, such as Alpha Fold and ChatGPT. Even autonomous vehicles are now driving around San Francisco without incident.

While there may be a correction in hype levels, it’s unlikely to be a full-on AI winter similar to what was seen in the past. There’s simply too much real-world value being delivered, especially in the enterprise where there’s low-hanging fruit to build on with these models.

Although the hype will eventually cool down, the value of AI will remain. Use cases such as identifying threats at home and abroad demonstrate the value of AI’s ability to keep us safer. As AI continues to be applied to more aspects of our daily lives, now is a great time to think about the regulations that should be put in place to govern its use. Interesting developments around using AI in hiring decisions while maintaining transparency about how that AI was evaluated have recently occurred in New York City. These regulations will set the ground rules for future investment in AI.

Takeaways from our discussion

  1. Generative AI models have transformative potential, but significant challenges must be overcome before we can fully realize their benefits. These models must be approached with caution, and it’s important to understand their strengths and weaknesses and build accordingly. Evaluation and data privacy are among the critical challenges that must be resolved before we can start building for the high-sensitivity nature of enterprise use cases.
  2. Technical bottlenecks and unsolved engineering challenges include the cost of training and inference, sensitivity to changes, and the high cost of generating a response. Industries that require high levels of precision and recall, such as finance and healthcare, require more than just LLMs. Moreover, search is a fundamental requirement for many use cases in an enterprise context, and making search approaches accurate and effective is a significant challenge.
  3. To successfully implement data science or machine learning within an organization, it’s crucial to bring together the people who can build or understand these technologies with those who know the business needs. It’s recommended to come up with a list of use cases and take a portfolio approach, trying out each one for two to three weeks. By iterating through use cases, organizations can quickly determine what works and what doesn’t within their existing technical infrastructure, and adjust their approach accordingly.

If you’re founding a company that will become a key pillar of the language model stack, targets some of these enterprise challenges or is an AI-first application, we would love to meet you. Reach out to Gabriella Garcia at gabriella@twosigmaventures.com or one of our investment team members.

The views expressed herein are solely the views of the author(s), are as of the date they were originally posted, and are not necessarily the views of Two Sigma Ventures, LP or any of its affiliates. They are not intended to provide, and should not be relied upon for, legal, regulatory and/or investment advice, nor is any information herein any offer to buy or sell any security or intended as the basis for the purchase or sale of any investment.  The information herein has not been and will not be updated or otherwise revised to reflect information that subsequently becomes available, or circumstances existing or changes occurring after the date of preparation.  Certain information contained herein is based on published and unpublished sources. The information has not been independently verified by TSV or its representatives, and the accuracy or completeness of such information is not guaranteed. Your linking to or use of any third-party websites is at your own risk. Two Sigma Ventures disclaims any responsibility for the products or services offered or the information contained on any third-party websites.