Build reliable production data and ML pipelines with git support for Databricks Workflows (2024)

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Customers have asked us for ways to harden their production deployments by only allowing peer-reviewed and tested code to run in production. Further, they have asked for the ability to simplify the automation and improve reproducibility of their workflows. Git support in Databricks Workflows has already helped numerous customers achieve these goals.

"Being able to tie jobs to a specific Git repo and branch has been super valuable. It has allowed us to harden our deployment process, instill more safeguards around what gets into production, and prevent accidental edits to prod jobs. We can now track each change that hits a job through the related Git commits and PRs." - said Chrissy Bernardo, Lead Data Scientist at Disney Streaming

"We used the Databricks Terraform provider to define jobs with a git source. This feature simplified our CI/CD setup, replacing our previous mix of python scripts and Terraform code and relieved us of managing the 'production' copy. It also encourages good practices of using Git as a source for notebooks, which guarantees atomic changes of a collection of related notebooks" - said Edmondo Procu, Head of Engineering at Sapient Bio.

"Repos are now the gold standard for our mission critical pipelines. Our teams can efficiently develop in the familiar, rich notebook experience Databricks offers and can confidently deploy pipeline changes with Github as our source of truth - dramatically simplifying CI/CD. It is also straightforward to set up ETL workflows referencing Github artifacts without leaving the Databricks UI." - says Anup Segu, Senior Software Engineer at YipitData
See Also
Configure Git credentials & connect a remote repo to Databricks Run Git operations on Databricks Git folders (Repos)Set up Databricks Git folders (Repos)Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions

"We were able to reduce the complexity of our production deployments by a third. No more needing to keep a dedicated production copy and having a CD system, invoke APIs to update it." - says Arash Parnia, Senior Data Scientist at Warner Music Group

Getting started

It takes just a few minutes to get started:

Build reliable production data and ML pipelines with git support for Databricks Workflows (1)

Build reliable production data and ML pipelines with git support for Databricks Workflows (2)

FAQs

How to create a workflow in Databricks? ›

Create your first Databricks Workflow

Step #1 - Create a Databricks job.
Step #2 - Create your first task.
Step #3 - Additional task configurations.
Step #4 - Repeat for your other tasks.
Step #5 - Define dependencies and control flows.
Step #6 - Define compute clusters for your tasks.

Dec 12, 2023

Show Me More ›

What is the best practice of Databricks Git? ›

In your user folder in Databricks Git folders, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch.

Get More Info Here ›

How do you automate Databricks workflows? ›

To automate the deployment of Databricks workflows, you can use the Databricks REST API and a scripting language such as Python or Bash. The script can create a new workflow and add steps to it, as well as manage existing workflows.

Get More Info ›

What are the benefits of using Databricks workflows for orchestration purposes? ›

Actionable insights: As the Databricks Workflows is deeply integrated into the platform, you get the much deeper monitoring and observability capabilities vs. external orchestration tools like Apache Airflow. With Databricks Workflows, users get job metrics and operational metadata for the jobs they execute.

Explore More ›

How do I create an ETL pipeline in Databricks? ›

This dataset is available in the sample datasets included in your Azure Databricks workspace.

Step 1: Create a cluster. ...
Step 2: Explore the source data. ...
Step 3: Ingest the raw data. ...
Step 4: Prepare the raw data. ...
Step 5: Query the transformed data. ...
Step 6: Create an Azure Databricks job to run the pipeline.

More items...

May 8, 2024

Learn More Now ›

How to schedule a workflow in Databricks? ›

To define a schedule for the job:

In the sidebar, click Workflows.
In the Name column on the Jobs tab, click the job name.
Click Add trigger in the Job details panel and select Scheduled in Trigger type.
Specify the period, starting time, and time zone. ...
Click Save.

Mar 1, 2024

What is the primary purpose for Databricks? ›

Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models.

Get More Info ›

What is the difference between Databricks Workflows and data factory? ›

Learn More Now ›

How do I create a workflow file? ›

Creating a starter workflow

If it doesn't already exist, create a new public repository named . github in your organization.
Create a directory named workflow-templates .
Create your new workflow file inside the workflow-templates directory. ...
Create a metadata file inside the workflow-templates directory.

Get More Info ›