Apache Datafusion is a silver bullet for your product

Published September 26, 2024

Do you know what Apache Datafusion is? It is a project part of the Apache Arrow family such as Arrow IPC, Apache Parquet.

It sits on top of your data and can be used to provide a query engine. But what is a query engine? It is the component that takes a query, validate, enriches, build a plan, schedule your query. It can do many other things. Furthermore, it may interact with indexes to figure out the pages that need to be loaded if you want to do an aggregation and filtering. I mean, it is a complicated piece of software that can be found on top of databases, usually but not anymore!

Because DataFusion with a couple of lines of pretty Rust code enables you to read CSV, Parquet and more to run SQL query on top. Very powerful, such pretty Rust code hide a lot of complexity and a lot of code, you will notice how the compile time changes as soon as you import the project but as I said, it is powerful.

I am not a database engineer, I don’t care

Wrong!

Why?

For sure, it is not something that can be put in front of users not familiar with SQL without guidance, but it can be a way to provide an advanced interface that users can use to look at their data. Everything I do can be summarized as an interface sometime JSON, HTML on top of a subset of data. Imagine an empty page where customers can do JOIN their data, or SELECT from multiple sources. Very powerful.

It is different if you provide a tool for developers. I work with an AI company that deals with customers dataset in table format. I worked in databases for a few years, but I am not a database engineer BUT the API that helps customers to visualize their dataset relays on DataFusion. Since Dataset are in the “couple of Gb big” order of magnitude I don’t even need to build indexes for those. Or at least not yet but DataFusion provides hooks for when I will need.

Conclusion

This snippet is here to show you how easy it is to run a SQL query against a file stored in AWS S3 with DataFusion. I hope it is easy to follow but let’s do a bit of pseudocode:

  1. Create a new session
  2. Register an object store in my case S3
  3. Push files from S3 bucket and make them to look like a table
  4. Query your table and collect the result.
    let ctx = SessionContext::new_with_config(SessionConfig::new().with_information_schema(true));

    ctx.runtime_env().register_object_store(
        &Url::parse(format!("file://{}", bucket_name).as_str()).unwrap(),
        object_store,
    );
    let file_format = ParquetFormat::default().with_enable_pruning(Some(true));
    let listing_options = ListingOptions::new(Arc::new(file_format));
    ctx.register_listing_table(
        "my_table",
        "file://".to_owned() + bucket_name.as_str() + prefix.as_str(),
        listing_options,
        None,
        None,
    )
    .await?;

    let df = ctx.sql(&query).await?;
    let results = df.collect().await?;

Andrew Lamb spoke at the CMU Database Group about DataFusion and since he is one of the core maintainers you can miss it

Pretty straightforward to use! I think you should try it out and if you have a use case that you need help with let me know because there is a lot more you can do, register functions, hook index files and so on.

Are you having trouble figuring out your way to building automation, release and troubleshoot your software? Let's get actionables lessons learned straight to you via email.