Cleanlab emerges with $5 million to automate data curation for LLMs and the modern AI stack

Head over to our on-demand library to view sessions from VB Transform 2023. Register Here


Cleanlab, a startup that provides a data curation solution for large language models (LLMs) used in enterprise AI, announced today that it has secured $5 million in seed funding. The investment round was led by Bain Capital Ventures, marking a significant vote of confidence in Cleanlab’s mission to eliminate the “dirty data problem” plaguing the machine learning space.

The startup, founded by Curtis Northcutt, Jonas Mueller and Anish Athalye, has developed an open-source product that identifies, understands and cleans incorrect labels in data. This unique approach promises to dramatically improve the effectiveness of machine learning models, which are often hampered by poor data quality.

“The dirty secret of machine learning is that your model is only as good as your data,” said Northcutt, CEO of Cleanlab, in a recent interview with VentureBeat. “And if you have incorrect labels in your data, which almost everyone does, it can wreak havoc on your model’s performance.”

Northcutt added that data curation is often a manual and tedious process that requires a lot of time and resources from data teams. He said that Cleanlab hopes to automate and simplify this process by using a method he invented during his Ph.D. studies at MIT called “confident learning.”

Event

VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.

Register Now

Confident learning is a method that estimates the joint distribution of the true and noisy labels, and then uses this information to find the most likely errors in the dataset. It can also estimate the accuracy of each label and each example, and provide a confidence score for each label.

“What we’re doing is we’re building statistical information about what is a typical data point for a given class, and we’re taking into account the distribution of probabilities that a model would output for that class — whether or not what’s given for this example seems statistically relevant and that distribution — and then we build a theoretically grounded model that we can show will give you exact guarantees in terms of label error finding,” Northcutt said.

A new dawn for data quality

Northcutt said that Cleanlab offers two products: Cleanlab Open Source and Cleanlab Studio. Cleanlab Open Source is a free and open-source Python library that anyone can use to apply confident learning to their datasets. Cleanlab Studio is a cloud-based SaaS product that provides a user-friendly interface and advanced features for data curation. Cleanlab Studio also integrates with popular LLM frameworks and platforms, such as Hugging Face Transformers, Google Cloud AI Platform, Amazon SageMaker, Microsoft Azure Machine Learning and IBM Watson.

Northcutt said that Cleanlab has already attracted more than 10,000 users for its open-source project, and more than 100 customers for its cloud product. He said that the customers include Fortune 500 companies, government agencies, research institutions, and startups from various domains and industries, such as ecommerce, healthcare, social media, education, entertainment and finance.

Northcutt said that Cleanlab plans to use the new funding to expand its team, scale its product development and grow its customer base. He said that he is excited to partner with Bain Capital Ventures, which has a strong track record of investing in AI startups.

A sign of rising investor confidence in data-centric AI solutions

Bain Capital Ventures partner Aaref Hilaly and principal Rak Garg said that they were impressed by Cleanlab’s team, technology and vision. They said that they believe that Cleanlab is solving a huge and underserved problem in the enterprise AI space.

“Cleanlab is the leading solution for data curation for LLMs, which is a huge unaddressed need in the enterprise. Data curation is essential for model performance and reliability, and offers users more control and an easier-to-adopt product through open source. We are very excited to back Curtis and his co-founders Jonas and Anish, who have built an amazing product and community around confident learning,” Hilaly said.

Garg added that Cleanlab is part of a broader emphasis on artificial intelligence at Bain Capital Ventures, which invests in both foundation models and the infrastructure around them. He said that Cleanlab is one of the several AI startups that Bain has invested in this year, such as Contextual AI, Evenup and Unstructured.

“We are very active investors in AI, and we are always looking for technical founders and engineers who can build innovative AI solutions. We have a strong focus on early stage, as evidenced by BCV Labs, our AI incubator in Palo Alto, where we support and co-create with talented AI entrepreneurs. We also have a multistage approach, where we can help our portfolio companies with their go-to-market, talent and scaling challenges,” Garg said.

Shaping the future of enterprise LLMs

Cleanlab is one of many emerging startups that are tapping into the growing demand for enterprise AI solutions, especially for LLMs. According to a recent Gartner report, 69% of routine work currently done by managers will be fully automated by 2024, which would likely involve the use of LLMs for tasks such as scheduling, reporting and decision-making. One of the biggest hurdles that influence the adoption and deployment of LLMs in the enterprise is data quality and data curation.

Cleanlab’s data curation solution can help enterprises overcome these challenges and unlock the full potential of LLMs for various use cases and applications. By using Cleanlab, enterprises can improve the quality and reliability of their datasets and models, reduce the time and cost of data curation and ensure the ethical and responsible use of LLMs. Cleanlab can also help enterprises gain a competitive edge and create value from their data assets.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Source