Databricks unifies data science and engineering with a federated data mesh

Elevate your enterprise data technology and strategy at Transform 2021.


During its online Data + AI Summit conference, Databricks today unveiled Databricks Machine Learning, a platform that lets data science teams build AI models based on the AutoML framework.

The offering follows yesterday’s launch of an open source Delta Sharing project that lets organizations employ a protocol to securely share data across disparate data warehouses and data lakes in real time.

The Delta Sharing project has been donated to the Linux Foundation and is being incorporated with Delta Lake, an open source data lake platform Databricks previously made available. Organizations that have pledged to support the Delta Sharing project include Nasdaq, ICE, S&P, Precisely, Factset, Foursquare, SafeGraph, Amazon Web Services (AWS), Microsoft, Google Cloud, and Tableau.

Delta Sharing can already be applied to share data across Azure Data Share, Azure Purview, Big Query, AtScale, Collibra, Dremio, Immuta, Looker, Privacera, Qlik, Power BI, and Tableau platforms.

A slew of updates

Delta Sharing will in effect establish a common standard for sharing all data types using an open protocol that can be invoked using SQL, visual analytics tools, and programming languages such as Python and R. Delta Sharing also enables data stored in Apache Parquet and Delta Lake formats to be shared in real time without employing copy management tools. And it provides built-in security controls and permissions to address privacy and compliance requirements.

This week, Databricks also unveiled Unity Catalog, a unified data catalog for Delta Lake that makes it easier to discover and apply controls at a more granular level in order to govern data assets using capabilities enabled in part by Delta Sharing. Alation, Collibra, Immuta, and Privacera have pledged to support Unity Catalog.

Finally, Databricks has added a cloud service dubbed Delta Live Tables to simplify the development and management of data pipelines using a set of simpler extract, transform, and load (ETL) capabilities that automates that process. Delta Live Tables abstracts away the low-level instructions data engineers previously had to code, which reduces the opportunity for errors. Delta Live Tables then automatically creates the instructions for both the data transformations and the data validations, as well as implementing error handling. Any dependencies are automatically executed downstream whenever a table is modified. Delta Live Tables will also make it simpler to identify the root cause of errors and restart pipelines when necessary.

Unity

Collectively, these offerings are part of an effort to unify data science and data engineering and reduce friction by creating a federated data mesh, Databricks CEO Ali Ghodsi told VentureBeat. As part of that effort, Delta sharing provides a standard mechanism through which data can be migrated from legacy platforms into any data lake without requiring data engineers to employ cumbersome processes using copy data management tools, Ghodsi noted.

IT organizations can also choose to not employ specific data lakes and warehouses as they see fit instead of being forced to standardize on a single platform just to simplify sharing and accessing data, Ghodsi noted. That’s especially critical for organizations that need to share data with other entities because the odds that those organizations will have adopted the same data warehouse or data lake are slim to none, Ghodsi added. “The data is always going to be out of sync,” he said.

AI models constructed with data aggregated using these tools are also extensible, which Ghodsi said is a capability unique to the Databricks platform. That approach enables organizations to construct and train AI models with either a user interface or application programming interface (API) in a way that allows experiments to be automatically generated without compromising transparency, Ghodsi added.

Those AutoML experiments are integrated with the rest of the Databricks Lakehouse Platform, which makes it possible to employ open source MLflow software Databricks has developed to track parameters, metrics, artifacts, and even entire models, Ghodsi said. The Databricks Machine Learning platform is designed from the ground up to unify best data science and engineering practices, Ghodsi added.

It remains to be seen to what degree data science and engineering will converge, but today most organizations find data management the biggest obstacle to implementing AI.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Source

Leave a Comment