📓 Fusion Diaries: Build conformance, cleaner logging, and a new Databricks driver

📓 Fusion Diaries: Build conformance, cleaner logging, and a new Databricks driver

Author: Anders Swanson, Senior DX Advocate at dbt Labs

Long time no talk! Here’s what we’ve been up to since Coalesce last month in Las Vegas.

Coalesce on the road

Last week we were in Stockholm, Paris, London, and Madrid. In December we’ll be in Munich, Amsterdam, and Tokyo.

Article content
Article content

We’d love to say hi in person, so sign up if your city is on the list!

If you can’t make it, be sure to at least check out Elias’s demo linked below.

TL;DR

Velocity

  • 131 issues closed as completed across the dbt-fusion and internal repos
  • 26 new preview releases (preview.47 to preview.72)

Per usual, for specifics, check out dbt-fusion’s CHANGELOG for more information.

Big rocks

  • A new Fusion readiness checklist
  • Cleaner logging! Check out the logs/dbt.log now
  • Improvements to dbt-autofix
  • New Databricks driver

In progress

  • Python models (in beta for Snowflake w/o static analysis)
  • Fusion-readiness for package ecosystem

Big themes

We continue to march towards general availability! There’s two main themes to our work right now:

  1. “build conformance”
  2. delivering against our committed roadmap

Conformance

The Fusion team uses “conformance” to mean does a team’s dbt project work with the dbt Fusion engine, with only automated changes via dbt-autofix?

When we began the dbt Fusion engine, our original conformance goal was on parse. Can the Fusion engine parse all of a project’s .sql and .yml files and create a manifest.json? Once we achieved a significant enough of parse conformance across dbt projects, we graduated to compile conformance.

Since Coalesce, our focus has been on build conformance. Does a project build with no errors but also produces the same state on the data warehouse? This is no easy task!

The bugs we’re discovering in our quest for build conformance tend to be tangled up in SQL understanding, differences b/w Rust and Python, and state deferral. This work has even surfaced several issues in dbt Core (e.g. dbt-core#12152)

Roadmap

In the Path to GA blog from May, we spelled out five milestones that are still relevant today. The Milestones in the dbt-fusion repo largely still correspond to our original roadmap and are a great place to check in on our progress. Largely all of our current feature work falls into the “Fast follows” and “Feature parity” buckets, such as: logging UX, model governance, semantic layer, state modified conformance, and python models.

Article content

Big rocks: What shipped this week

Readiness checklist

Our amazing docs team shipped a great Fusion readiness checklist that spells out the below steps as the recommended path to getting Fusion working for you.

Please check it out and share with anyone you know who hopes to start using Fusion soon!

Preparing for Fusion:

  1. Upgrade to the latest dbt Core version
  2. Resolve all deprecation warnings
  3. Validate and upgrade your dbt packages to the latest versions
  4. Check for known Fusion limitations
  5. Review jobs configured in the dbt platform
  6. Stay informed about Fusion progress

Logging experience

We’ve asked for your feedback many times about the UX of logging, and y’all have delivered! The discussion created to collect feedback, dbt-fusion#584, has ~20 asks from the community. In the past month, we’ve shipped four of them related to logs/dbt.log, namely that it now:

  • Again, has more information than stdout
  • No longer gets overwritten by every invocation. Instead, it appends to the end just like dbt Core
  • Contains the SQL executed, which was previously only in logs/query_log.sql
  • Contains timestamps for (almost) all events

What’s keeping us from further improvements is polishing the underlying platform. We're making strong progress on the OpenTelemetry-inspired tracing and telemetry system that replaces dbt Core's structured logging. This work not only brings much-needed improvements to log rendering and UX, but serves as a platform upon which we can continue to deliver improvements beyond what Core even can do.

Looking ahead, we plan to release open-source tooling that will let the community build integrations on top of this telemetry using strictly typed, well-defined Python APIs. More details coming soon.

In the meantime, if you’d like a taste of what the new logging provides, I’d encourage you to experiment with new flags like --otel-file-name or --otel-parquet-file-name.

I’m personally very excited at the prospect of really drilling down into the performance of dbt invocations to discover the biggest bottlenecks.

dbtf compile --otel-parquet-file-name compile_otel.jsonl
        

dbt-autofix improvements

dbt-autofix is already a great tool for finding (and fixing!) code that contains deprecated functionality, but the team has been hard at work making it even better for getting your project ready to run on Fusion. Fixes and new functionality we’ve added lately include:

  • New —include-private-packages flag in dbt-autofix deprecations to identify and fix deprecations in private packages you’ve installed in your project. If any of the private packages have deprecations, this flag tells you if your project will run once those packages are fixed - and if you own those packages, you can fix them and make your project Fusion-compatible.
  • (BETA) 🚧 New dbt-autofix packages command to identify which package dependencies in your project are Fusion-compatible and optionally upgrade incompatible versions to a Fusion-compatible version
  • Better handling of semantic models: Support for merging complex/layered metrics No more duplicated descriptions when you merge a semantic model and a dbt model that both have descriptions
  • Improved —all option in dbt-autofix deprecations to fix invalid YAML, deprecation warnings, and more complex behavior changes all in one go
  • Autofix can now automatically remove extra spaces in dbt_project.yml tags (e.g. + tags -> +tags)
  • If you love using “fancy quotes” in column and table descriptions, autofix now loves them too and won’t remove them
  • Smarter handling of custom keys in configs

To get started with dbt-autofix, run dbt-autofix in the Studio IDE or install the dbt-autofix Python package on your computer (see the README for more details).

If you have problems you think we could autofix, please let us know by opening an issue in Github.

In the coming weeks, we plan to launch the full version of dbt-autofix packages to upgrade all your packages to Fusion-compatible versions and continuing to expand the range of deprecation warnings that we can fix for you.

A new Databricks Arrow ADBC Driver

In partnership with our colleagues at both Databricks and Columnar, we’ve contributed a new driver to Arrow ADBC.

This week we plan to swap over to the new driver which unblocks Fusion on Databricks to support both Python models as well as targeting All-Purpose Compute Clusters and not just SQL warehouses.

The work that our Fusion adapters team is doing is phenomenal. This doesn’t just benefit users of the dbt Fusion engine, but also the data ecosystem at large. We soon expect the data industry to begin to use ADBC. The opportunity to collaborate in the industry on this emergent standard allows data tooling to focus more on what “makes their beer taste better” rather than reinventing the wheel. End users benefit as well from tools that more performantly and consistently connect to data warehouses.

🚧 Work in progress

Python models

Python models are the oldest open issue on the dbt-fusion repo (dbt-fusion#3)!

However, it’s not so simple as shipping support for Python models in Fusion is a two-step process:

  1. Materialize like Core does across the adapters that support them
  2. Ensure models downstream of Python models can be statically analyzed.

The first part is a lower-hanging fruit. In fact, we’ve already shipped support for Python models on Snowflake behind a flag. To try them out, set the following environment variable

DBT_ENABLE_BETA_PYTHON_MODELS=true
        

The second step is more challenging! Currently, we can infer all the columns datatypes when they’re defined in SQL, but Python is of course a different language! It’s still early, but more than likely it will require some annotation in YAML of what the output column types will be so that that information can be used in downstream.

Likely what needs to come first is that we have a path forward on our approach for storing table schema information. See Looking for Feedback for more info!

Packages and require-dbt-version

One of the biggest efforts in getting to general availability is ensuring that dbt’s amazing package ecosystem works with the new engine! Many package maintainers have done heroic work to get their packages working with Fusion.

What we don’t have yet is a way for users to know for certain if their package will work with the dbt Fusion engine.

Grace Goheen does a great job explaining our solution:

To signal your package’s compatibility with the dbt fusion engine, include 2.0.0 or greater in your require-dbt-version range. For example:

# dbt_project.yml
require-dbt-version: [">=1.10.0", "<3.0.0"]        

If your range excludes 2.0.0 (for example, ">=1.6.0,<2.0.0"), Fusion will soon start issuing a warning if you try to use that package with Fusion; and, in a later release, this will be an error.

dbt-autofix autoupgrading your packages

dbt-autofix is already a great tool for finding (and fixing!) code that contains deprecated functionality, but the team has been hard at work making it even better for getting your project ready to run on Fusion. Fixes and new functionality we’ve added lately include:

  • New —include-private-packages flag in dbt-autofix deprecations to identify and fix deprecations in private packages you’ve installed in your project. If any of the private packages have deprecations, this flag tells you if your project will run once those packages are fixed - and if you own those packages, you can fix them and make your project Fusion-compatible.
  • New dbt-autofix packages command to identify which package dependencies in your project are Fusion-compatible and optionally upgrade incompatible versions to a Fusion-compatible version (currently in beta)
  • Better handling of semantic models: Support for merging complex/layered metrics No more duplicated descriptions when you merge a semantic model and a dbt model that both have descriptions
  • Improved —all option in dbt-autofix deprecations to fix invalid YAML, deprecation warnings, and more complex behavior changes all in one go
  • Autofix can now automatically remove extra spaces in dbt_project.yml tags (e.g. “+ tags” -> “+tags”)
  • If you love using “fancy quotes” in column and table description, autofix now loves them too and won’t remove them
  • Smarter handling of custom keys in configs

In the coming weeks, we plan to launch the full version of dbt-autofix packages to upgrade all your packages to Fusion-compatible versions and continue to expand the range of deprecation warnings that we can fix for you. If you have problems you think we could autofix, please let us know by opening an issue in Github.To get started with dbt-autofix, run dbt-autofix in the Studio IDE or install the dbt-autofix Python package on your computer (see the README for more details).

More flexible CSV parsing for seeds

In our conformance work, we’ve seen Fusion struggle to seed some .csvs. The reason is that the Fusion’s current csv parser is much more strict than what dbt Core uses.

The real reason is that .csv is woefully underspecified! A great explanation of the common csv headaches is in this blog post from a decade ago: So You Want To Write Your Own CSV code?

While we could choose to force dbt users to make their csv's prettier in order to use Fusion, this isn’t very practical. So dbt-fusion#1004 describes the work underway to support csvs that dbt Core could previously parse without issue.

Looking for feedback!

In the last Fusion diary, I mentioned how often we’ve heard users encounter dbt-fusion#615. While trying to figure out a way to address this, we actually found ourselves rather disappointed with the schema cache in general and landed on a proposal to not only solve the below paper cut, but create a much better user experience.

Click on the GitHub discussion below and leave your feedback. We’d love to hear from as many folks as possible.

👓 Stuff you should watch

If you missed the keynotes, you should watch Elias’ demo of the dbt Fusion engine! Seriously, stop what you’re doing and check it out! This demo is such a validation of all the hard work put in by so many over that past year to build the future tooling of analytics engineers. A huge shoutout is in order not just to the Fusion team but more so, the community, with whom we have collaborated hand-in-hand for the past year. This demo should validate all the love and sweat that’s gone into the product.

🏁 Made it to the meme

This almost 10 year old meme certainly makes me feel old! Still the point stands. A lot of Fusion’s magic is aggressive caching! There’s no free lunch, though. It’s on us to make this experience as smooth and intuitive as possible. To that end, here’s a work-in-progress docs page that explains the various ways that dbt and the Fusion engine cache things (PR: docs.getdbt.com#8084).

But above all, check out GH discussion: source schemas should be first-class, versioned artifacts and share your thoughts!

Article content


Is there an RSS feed for the fusion diaries? Thanks

To view or add a comment, sign in

More articles by dbt Labs

Others also viewed

Explore content categories