Putting Your DAG on the internet

June 14, 2024 · 5 min read

Ernesto Ongaro

Senior Solutions Architect @ dbt Labs

Sebastian Stan

Data Engineer @ EQT Group

Filip Byrén

VP and Software Architect @ EQT Group

Versionless is now the "latest" release track

This blog post was updated on December 04, 2024 to rename "versionless" to the "latest" release track allowing for the introduction of less-frequent release tracks. Learn more about Release Tracks and how to use them.

New in dbt: allow Snowflake Python models to access the internet

With dbt 1.8, dbt released support for Snowflake’s external access integrations further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, EQT AB. Learn about why they needed it and how they helped build the feature and get it shipped!

Why did EQT require this functionality?

by Filip Bryén, VP and Software Architect (EQT) and Sebastian Stan, Data Engineer (EQT)

EQT AB is a global investment organization and as a long-term customer of dbt Cloud, presented at dbt’s Coalesce 2020 and 2023.

Motherbrain Labs is EQT’s bespoke AI team, primarily focused on accelerating our portfolio companies' roadmaps through hands-on data and AI work. Due to the high demand for our time, we are constantly exploring mechanisms for simplifying our processes and increasing our own throughput. Integration of workflow components directly in dbt has been a major efficiency gain and helped us rapidly deliver across a global portfolio.

Motherbrain Labs is focused on creating measurable AI impact in our portfolio. We work hand-in-hand with leadership from our deal teams and portfolio company leadership but our starting approach is always the same: identify which data matters.

While we have access to reams of proprietary information, we believe the greatest effect happens when we combine that information with external datasets like geolocation, demographics, or competitor traction.

These valuable datasets often come from third-party vendors who operate on a pay-per-use model; a single charge for every piece of information we want. To avoid overspending, we focus on enriching only the specific subset of data that is relevant to an individual company's strategic question.

In response to this recurring need, we have partnered with Snowflake and dbt to introduce new functionality that facilitates communication with external endpoints and manages secrets within dbt. This new integration enables us to incorporate enrichment processes directly into our DAGs, similar to how current Python models are utilized within dbt environments. We’ve found that this augmented approach allows us to reduce complexity and enable external communications before materialization.

An example with Carbon Intensity: How does it work?

In this section, we will demonstrate how to integrate an external API to retrieve the current Carbon Intensity of the UK power grid. The goal is to illustrate how the feature works, and perhaps explore how the scheduling of data transformations at different times can potentially reduce their carbon footprint, making them a greener choice. We will be leveraging the API from the UK National Grid ESO to achieve this.

To start, we need to set up a network rule (Snowflake instructions here) to allow access to the external API. Specifically, we'll create an egress rule to permit Snowflake to communicate with api.carbonintensity.org.

Next, to access network locations outside of Snowflake, you need to define an external access integration first and reference it within a dbt Python model. You can find an overview of Snowflake's external network access here.

This API is open and if it requires an API key, handle it similarly to managing secrets. More information on API authentication in Snowflake is available here.

For simplicity’s sake, we will show how to create them using pre-hooks in a model configuration yml file:

models:
  - name: external_access_sample
    config:
      pre_hook: 
        - "create or replace network rule test_network_rule type = host_port mode = egress value_list= ('api.carbonintensity.org.uk:443');"
        - "create or replace external access integration test_external_access_integration allowed_network_rules = (test_network_rule) enabled = true;"

Then we can simply use the new external_access_integrations configuration parameter to use our network rule within a Python model (called external_access_sample.py):

import snowflake.snowpark as snowpark
def model(dbt, session: snowpark.Session):
    dbt.config(
        materialized="table",
        external_access_integrations=["test_external_access_integration"],
        packages=["httpx==0.26.0"]
    )
    import httpx
    return session.create_dataframe(
            [{"carbon_intensity": httpx.get(url="https://api.carbonintensity.org.uk/intensity").text}]
    )

The result is a model with some json I can parse, for example, in a SQL model to extract some information:

{{
    config(
        materialized='incremental',
        unique_key='dbt_invocation_id'
    )
}}

with raw as (
    select parse_json(carbon_intensity) as carbon_intensity_json
    from {{ ref('external_access_demo') }}
)

select
    '{{ invocation_id }}' as dbt_invocation_id,
    value:from::TIMESTAMP_NTZ as start_time,
    value:to::TIMESTAMP_NTZ as end_time,
    value:intensity.actual::NUMBER as actual_intensity,
    value:intensity.forecast::NUMBER as forecast_intensity,
    value:intensity.index::STRING as intensity_index
from raw,
    lateral flatten(input => raw.carbon_intensity_json:data)

The result is a model that will keep track of dbt invocations, and the current UK carbon intensity levels.

Preview in dbt Cloud IDE of output

dbt best practices

This is a very new area to Snowflake and dbt -- something special about SQL and dbt is that it’s very resistant to external entropy. The second we rely on API calls, Python packages and other external dependencies, we open up to a lot more external entropy. APIs will change, break, and your models could fail.

Traditionally dbt is the T in ELT (dbt overview here), and this functionality unlocks brand new EL capabilities for which best practices do not yet exist. What’s clear is that EL workloads should be separated from T workloads, perhaps in a different modeling layer. Note also that unless using incremental models, your historical data can easily be deleted. dbt has seen a lot of use cases for this, including this AI example as outlined in this external engineering blog post.

A few words about the power of Commercial Open Source Software

In order to get this functionality shipped quickly, EQT opened a pull request, Snowflake helped with some problems we had with CI and a member of dbt Labs helped write the tests and merge the code in!

dbt now features this functionality in dbt 1.8+ and the "Latest" release track in dbt Cloud (dbt overview here).

dbt Labs staff and community members would love to chat more about it in the #db-snowflake slack channel.

Putting Your DAG on the internet

Why did EQT require this functionality?

An example with Carbon Intensity: How does it work?

dbt best practices

Comments

Missed Coalesce 2024?

Why did EQT require this functionality?​

An example with Carbon Intensity: How does it work?​

dbt best practices​

Comments

Missed Coalesce 2024?

Why did EQT require this functionality?

An example with Carbon Intensity: How does it work?

dbt best practices