To further strengthen our commitment to providing industry-leading data technology coverage, VentureBeat is pleased to welcome Andrew Brust and Tony Baer as regular contributors. Check out their articles in the Data Pipeline.
Summer has barely started, but MongoDB World and Snowflake Summit are now a thing of the past, even though the paint is still drying on all the announcements made during each event. With the Data + AI Summit kicking off today as a hybrid virtual/personal event in San Francisco, Databricks is wasting no time responding, with a massive manifesto of its own announcements.
Databricks co-founder and chief technologist (and creator of Apache Spark) Matei Zaharia briefed VentureBeat on all the announcements. They fall into two parts: improvements to open-source technologies underlying the Databricks platform — such as Apache Spark — on the one hand, and improvements, previews, and general availability (GA) releases related to its proprietary Databricks platform. platform on the other side.
Related:
In this post I will cover the full range of announcements. There’s a lot here, so feel free to use the subheadings as a sort of random entry interface to read the bits you might find most important, then come back to read the rest when you have time.
Spark Streaming goes Lightspeed
Since Spark and its companion open source projects have now become the de facto industry standard, I want to start with the announcements in that area. First, for Spark itself, Databricks is making two roadmap announcements, both for streaming data processing and connectivity for Spark client applications. Spark Streaming has been a subproject of Spark for many years, and the last major improvement – a technology called Spark Structured Streaming – was GA’d five years ago. Essentially, this meant that the technology around streaming data processing on Spark was languishing, a fact that proponents of competing platforms began to take advantage of.
In Zaharia’s words, “We didn’t have a very big streaming team, you know, after building the Spark streaming APIs in the company’s first three or four years.” Matei added, “We just kind of kept that up for a bit and we found it to be one of the fastest growing workloads on our platform; it’s growing faster than the rest.”
This realization that Spark Streaming needed some love has resulted in an overarching effort Databricks calls Project Lightspeed, to create a next-gen implementation of Spark Streaming. Databricks says Lightspeed will bring performance and latency improvements to streaming data processing; add new functionality such as advanced windows and pattern recognition; and make streaming easier in general.
Databricks has formed a new streaming team to drive Lightspeed and recently hired Karthik Ramasamy, formerly of Twitter and co-creator of Apache Pulsar, to lead it. Databricks also recently recruited Alex Balikov from Google Cloud and appointed him as a senior tech lead on the streaming team. Now let’s wait and see if processing streaming data on Spark can become relatively manageable for the average developer.
REST access
Speaking of developers, another Spark roadmap announcement involves something called Spark Connect, which will essentially implement a REST API for Spark, both for operational tasks (such as submitting tasks and retrieving results) and management tasks (such as the size and load balancing of Spark clusters or scheduling tasks). This removes the hard requirement for using programming language and version specific client libraries and allows application developers to take a more loosely coupled approach to working with Spark, using only HTTP.
Delta Lake opens
Sticking to open source announcements, but going beyond Apache Spark really, brings us to two related projects, both based at the Linux Foundation: Delta Lake and MLflow. Delta Lake is one of three popular technologies to bring data warehouse-like functionality to data lakes stored in open storage formats such as Apache Parquet. Delta Lake appears to have been in the lead, but rival format Apache Iceberg has recently taken a leap forward, seeing adoption from companies like Dremio, Cloudera, and Snowflake. One of the main criticisms of Delta Lake has been that Databricks has kept it too tight, mixing the open-source file format with Databricks proprietary technology such as time travel (allowing for examining previous states of a dataset).
Perhaps in response to that criticism, Databricks is announcing Delta Lake 2.0 today. The new version brings both performance improvements and more openness. Specifically, Databricks says it’s contributing all of Delta Lake to the Linux Foundation’s open-source project so that all users of the format can work with the same codebase and access all of its features.
MLflow, partly double
Open source project MLflow is the backbone of Databricks’ MLOps capabilities. While its own components exist, including the Databricks function store, the MLflow-based functionality includes the execution and management of machine learning experiments, as well as a model repository with version control. Today Databricks announced MLflow 2.0, which will add a major new feature called Pipelines. Pipelines are templates for building ML applications, so everything is ready for production, monitoring, testing and deployment. The templates — based on code files and Git-based versioning — are customizable and allow for the insertion of monitoring hooks. Although they are based on source code files, Pipelines allows developers to communicate from within notebook computers, providing a great deal of flexibility. Adding pipelines should be a boon to the industry as many companies, including all three major cloud providers, have adopted MLflow as the standard or documented how to use it with their platforms.
Databricks SQL is coming of age
A lot is happening, also in the area of ownership. For starters, Databricks SQL’s Photon engine, which brings query optimization and other data warehouse-like features to the Databricks platform, will be released to GA in July. Photon has recently made significant improvements, including support for nested data types and accelerated sorting capabilities.
In addition, Databricks is releasing several open source connectors for Databricks SQL, for languages such as Node.js, Python, and Go. Databricks SQL also gets its own command line interface (CLI) and now gets a query federation feature, allowing it to merge tables/datasets between different sources in the same query. The latter feature uses Spark’s proprietary ability to query multiple data sources.
An interesting aspect of Databricks SQL is that it supports different cluster types than are made available for other Databricks workloads. The special clusters, called SQL warehouses (and previously referred to as SQL endpoints), are “T-shirt-sized” and feature cloud server instances optimized for business intelligence-style queries. However, a new option is now being launched, Databricks SQL Serverless, which will allow customers to query their data through Databricks SQL without creating a cluster, in preview on AWS.
Delta Live Tables
Want more? Delta Live Tables, the Databricks platform’s SQL-based declarative facility for ETL and data pipelines, will receive several enhancements, including new performance optimization, Enhanced Autoscaling and Change Data Capture (CDC), to make the platform compatible with slowly changing dimensions , and they need to be updated incrementally, rather than all over again, as dimensional hierarchies change.
The latter is important: it ensures that analytical queries can be performed undisturbed when, for example, a particular branch is reclassified as part of another regional division. Queries pertaining to a period when it was in the original division will attribute sales at that office to that division; queries covering a later period will allocate sales to the new division, and queries covering both will allocate the correct sales amounts to each of the respective divisions.
Catalogue, cleanrooms and marketplace
Dataricks Unity Catalog will be released to GA later this summer, complete with new lineage capabilities recently added. A new “Data Cleanrooms” feature allows queries involving data from two different parties to run in the cloud without either party having to send the data to the other. Instead, each party’s data is placed in some sort of digital escrow and, provided both parties agree, tasks with both data are performed in Databricks’ cloud, from which the data is then deleted.
Finally, Databricks is starting its own marketplace, but with a few differences from the typical offerings of data marketplaces. For starters, Databricks Marketplace’s offerings can consist of complete solutions, including applications and samples, rather than just datasets. And because the product is based on Delta Sharing, Databricks says it can be used by customers who don’t use the Databricks platform itself.
Where this leads us
As the data and analytics space consolidates and the new generation of leaders emerge, competition becomes fierce. The customer will benefit as major players begin to play into each other’s territories, all of which are looking to deploy analytics, operations, streaming, data engineering, and machine learning workloads in a multicloud fashion. Databricks has doubled its investment in some of these areas and expanded investment into others. What’s especially nice is the cascading effect it has on several open source projects, including Spark, Delta Lake, and MLflow.
Will Databricks eventually allow individual clusters to span multiple clouds, or even shift focus to on-premises environments? Will Delta Lake or Apache Iceberg emerge as the default storage technology for lake houses? Will the Databricks feature store component become open source to round out MLflow’s appeal over commercial MLOps platforms? Will Databricks SQL Serverless Beat Amazon Athena’s Business Franchise? Keep an eye on this data space. Clients will place their bets in the coming years as Lakehouse’s flag bearers build momentum and chart their territory.
The mission of VentureBeat should be a digital city square for tech decision makers to gain knowledge about transformative business technology and transactions. Learn more about membership.