DBT for Spark | Tech Vault

I have been seeing a lot about DBT this last year. I was doing some research for Databricks Live Tables (DLT) and found this article that compares DBT and DLT dbt vs Delta Live Tables so I started thinking how DBT would fit for me in Databricks. This is an introduction of DBT, how I have used it and I see I could fit it with Databricks. I have been using this repo for all the examples and making it work: https://github.com/pblocz-templates/dbt-databrick-example

What is DBT?

DBT (Data Build Tool) is a popular open-source tool used for building data pipelines in data warehouses. It allows data analysts, engineers, and scientists to transform, test, and deploy data models efficiently.

Compared to traditional SQL projects, DBT provides a number of benefits. First, it allows for modularization and organization of code, making it easier to maintain and scale. Additionally, it includes testing functionality, enabling developers to catch errors early in the development process. Finally, DBT provides a simple interface for deploying models, simplifying the deployment process.

How it fits with Databricks?

Lets look to the usual medallion architecture with Bronze, Silver and Gold layers from Databricks.

For me in this diagram, where I would have DBT is in the Gold layer. In my experience, Data Analysts and BI Engineers are more familiar with SQL and traditional modeling in SQL DBs and straggle creating notebooks, jobs and versioning their code. DBT should feel more familiar to them while giving a lot of traceability, as well as control and integration with Databricks.

Using this model, the Data Engineers would get the data to the Silver layer using the standard Databricks jobs, ADF, Delta Live Tables, etc And from there the Data Analysts or BI Engineer would build the queries using the silver tables as sources.

How to set up a model?

First, setting up DBT is quite simple, just installing dbt-databricks and running dbt init <project-name> will create the startup project. There is documentation that you can follow for it:

Connect to dbt Core - Azure Databricks

Learn what is dbt, and how to connect your Azure Databricks workspace to dbt Core, an open-source command line tool that enables data teams to transform data.

You can then add SQL or Python models to the models/ folder, but if you use my example project in git, you can use the samples there to customize what you want it to do. You should have a structure like this:

To run it, run dbt deps to install dbt packages followed by dbt run If you have used dbt init command, it would have prompted you to set up a connect, but there is another way that is to use a generic profiles.yml file, with any secret as environment variables. Here is what I have been using:

databricks_sqldw:
  target: dev
  outputs:
    dev:
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "/sql/1.0/endpoints/{{ env_var('DBT_DBSQL_WAREHOUSE_ID') }}"
      catalog: demo
      schema: gold
      threads: 2
      token: "{{ env_var('DBT_ACCESS_TOKEN') }}"
      type: databricks
databricks_cluster:
  target: dev
  outputs:
    dev:
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "/sql/protocolv1/o/{{ env_var('DBT_CLUSTER_ID') }}"
      catalog: demo
      schema: gold
      threads: 2
      token: "{{ env_var('DBT_ACCESS_TOKEN') }}"
      type: databricks

This defines two profiles to connect to Databricks: SQL Datawarehouse or a Shared cluster, you need to select which one to use. You can customize which one to use for each model, but setting up a default profile should be as simple as adding the following to your dbt_project.yml

# This setting configures which "profile" dbt uses for this project.
profile: 'databricks_cluster'

⚠️ Right now DBT python models only work with clusters, not SQL Datawarehouse, so that is what I am making use in my example.

To work with this file locally you can use a .env file that will be loaded automatically by pipenv in the environment. You can use this file for local development and add the details of the connection.

DBT_HOST=
DBT_DBSQL_WAREHOUSE_ID=
DBT_ACCESS_TOKEN=
DBT_CLUSTER_ID=

How does the modeling part work?

Let’s go over how the modeling part works. Since we are going to be making references to the Silver layer that already exists in the database, we need to add a sources.yml file inside models. The actual location does not matter, but let’s look at the structure of it:

version: 2

sources:
  - name: silver_layer
    catalog: demo
    schema: main
    tables:
      - name: item_silver
      - name: user_silver
      - name: transaction_silver

You can find more information in dbt documentation, but you can add documentation and tests to your sources, same as you can to the model itself. This can be useful to test any assumption on the data coming from the sources.

With this file, now we can create a table in the model. This is transaction_facts.sql delta table, here you can see how the jinja template syntax is being used for the configuration, and making references to the source and other tables.

{{ config(
  materialized='table',
  file_format='delta'
) }}

SELECT
    userId as user_id,
    itemId as item_id,
    currency_id,
    provider_id,
    price,
    date
FROM {{ source('silver_layer', 'transaction_silver') }}, {{ ref('currency_dimension') }}, {{ ref('transaction_provider_dimension') }}

This is a materialized table, which means it is going to be recreated every time, there are ways to set incremental processing, but for now if we run dbt run we will get the model compiled into SQL and it will run against the cluster we configured in the profile.

As a side note, something that works quite well is the Lineage part of the integration with Unity Data Catalogue (UDC). It has it’s limitations but all the lineage and documentation is synched between DBT and UDC:

How to deploy it and run it in Databricks?

Databricks has already worked on this. There is an special job type that is dbt, you can find the documentation here:

Use dbt transformations in an Azure Databricks job - Azure Databricks

Learn how to integrate your dbt Core transformations in an Azure Databricks Jobs workflow.

There is an issue with this job, that the profile.yml needs to be static except for the token. At least I have not found a way of doing this out of the box. I have not implemented anything yet, but what I would do to overcome this issue is a pipeline that deploys the code to a new branch with the profile updated based on the environment you want to deploy the job to.

Conclusion

This is an exercise I have been doing to learn a new tool. DBT is a powerful open-source tool for building models in data warehouses. It allows for modularization and organization of code, includes testing functionality, and provides a simple interface for deploying models, with the addition to being familiar to Data Analysts and BI Engineers.

I have worked in projects where it would not have been a good fit. The trend with Databricks and how I have built most of my pipelines for a couple of years now have been using Structured Streaming to treat both batch and streaming pipelines to delta tables in the same way. This pattern simplifies a lot thinking and designing your pipelines to be incremental as they would be “by design”. You can still do an incremental model with DBT, but it starts to get messy.

In my opinion if you think and design you project as pipelines, then DBT is not for you or your team. If on the other hand you think about the project and design it as models, then DBT might be a good candidate.

Tags: databricks azure dbt