LakeFS brings a branch to the data lake

We are excited to bring Transform 2022 back in person from July 19th, effectively July 20th to 28th. Join an AI and data leader for insightful talk and exciting networking opportunities. Register today!


Can businesses find a better way to organize a constant onslaught of data? LakeFS is thinking about the answer: Git version control. LakeFS provides the opportunity to create and track different versions of data, essentially mimicking the process developers use to organize their code.

On June 27, the company announced the general availability of its service, LakeFS Cloud. Teams can use this to track the evolution of different versions of their data, just as they do with different versions of their code.

“LakeFS is really an infrastructure, it’s on top of the data,” explains Einat Orr, co-founder and CEO of LakeFS. “This is the interface between the data lake and the application, so any application can enjoy the Git-like operations that LakeFS provides, and data is managed through one consistent interface in the organization.”

For a long time, developers have treated software and data differently. Programmers have created versioning systems like Git that help organize software development by tracking various small and large changes. The team relies on tools to keep the work of different programmers separate until the final version is merged and shipped. Software teams routinely use dozens, hundreds, and even thousands of different versions located in a figurative tree with branches.

However, the data is usually stored in separate chunks. Developers often make full copies of backups made at different snapshots or at different times. The differences are difficult to track down, and the proliferation of copies has created large invoices for confusion and storage.

“The cloud didn’t warn us that the data would be clouded. The benefits of infinite storage quickly fell into an unmanageable mess, so we could re-access the data. We need technology like LakeFS, “explained Sivan Bercovici, CTO of Karius, a medical diagnostics company testing products in artificial intelligence research. And data collection.

LakeFS: Systems and Services

LakeFS is designed to work with object stores such as S3 and various data management systems such as Snowflake and BigQuery. This service provides one interface for storage and retrieval, passing data to back-end services such as AWS while tracking the current branch. LakeFS envisions that the group may work with several different storage providers. The Demonstration Playground gives users the opportunity to try it out without installing the code.

This system assists the team by tracking different branches and merging them only when needed. Developers may start experimenting with new features by creating a branch of the main dataset that is currently in production. You don’t have to make a complete copy for testing, and the changes introduced by the new experiment are kept in a separate branch that doesn’t affect the main product version.

“It’s very easy to keep a copy that causes confusion in S3 and no one deletes for years,” Orr said. “With LakeFS, we know that transparency for proper data management and that this branch isn’t being used, so we can tie retention to our business needs. This file is pointed to by the LakeFS branch. You know it hasn’t been done. “

LakeFS gives developers the option to create various branches and merge or delete them as needed. It also provides webhooks, allowing you to integrate operations with many other development pipelines used for continuous integration and deployment.

“Since we introduced LakeFS into our production data environment, we have enjoyed the benefits of atomic and isolation operations in our data pipeline, which allows us to spend more time improving other aspects of our data platform. , We were able to reduce the time it takes to deal with fallouts from conflicting conditions and partially failed operations, “said Lior Resisi, Windward’s data platform team leader.

Data lake competitors

Several other database companies are beginning to develop similar approaches. For example, Planetscale and Neon both offer the opportunity to branch or fork data stored on systems built around open source databases such as MySQL and PostgreSQL. They recently launched a version and focused on providing the same database interface that developers have been accustomed to over the years.

LakeFS is designed to work at a lower level using any object storage. The API accepts calls to blocks of data stored in the bucket. The branch information is stored together as metadata and is used to merge or delete objects as needed.

“I think it’s important to emphasize that we are format agnostic and very complementary to opentable formats such as Delta Lake and Iceberg,” Orr explains. This allows developers to work with larger and more diverse datasets that are often distributed across different products and silos.

However, the company promises to extend the interface to work with other storage options. They imagine LakeFS can be a popular API for developers to use. Time savings and storage fees for additional copies will justify the additional costs.

“That’s our vision,” Orr says. “After all, it will work with all the data sources you own, not just object storage.”

The product began as an open source project sponsored by Treeverse, an American company founded by Orr and Oz Katz in 2020. Investors include Dell Technologies Capital, Norwest Venture Partners and Zeev Ventures.

Venture Beat’s mission It’s about becoming a Digital Town Square for technical decision makers to gain knowledge about innovative enterprise technologies and transactions. See membership details.