Traverse a SQL query to get what you like with Rust

Published November 27, 2024

Recently I had to parse a SQL query to get the list of tables it requires, so I can hunt for them in my catalog and load them dynamically to Apache Datafusion.

Most likely there is a better way to do this, but I want to validate how the user will interact with this superpower of being able to query across their datasets. Good for us currently we manage many small datasets, so it is not a problem to load them all up to Datafusion.

I quickly experimented with what I suppose is “the right way” to do it, but I am yet not ready to embrace the work of writing a CatalogProvider, then a SchemaProvider and so on, mainly because I didn’t yet understand how they map to my internal structure. This is my opportunity to learn about it moving one step and the time but still being able to serve something to the user. I am user with a CatalogProvider users will be able to show tables and see their datasets all into datafusion, but I don’t need all of that right now. I just want to give them the ability to JOIN. Sometimes it is good to do what it takes to validate a feature instead of building it all from the ground up. Understanding why you can do it requires a solid understanding of the context you are into. As I said, a lot of small datasets, with already an understood visualization from customers around listing and picking them. This is why I feel ok doing just enough.

sqlparser is the crate I obviously used. It is the same one used internally by Datafusion so it is already a dependency for my application even if it is under the hood.

sqlparser provides visitors that are the perfect subject to traverse and extract from an AST what I am looking for. In theory, you can also mutate it now, but I don’t need to do that right now.

use sqlparser::ast::{Visitor, ObjectName};
use core::ops::ControlFlow;

#[derive(Default, Debug)]
pub struct GetTables {
   visited: Vec<String>,
}

impl Visitor for GetTables {
  type Break = ();
   fn post_visit_relation(&mut self, relation: &ObjectName) -> ControlFlow<Self::Break> {
    self.visited.push(relation.to_string());
    ControlFlow::Continue(())
  }
 }

#[cfg(test)]
mod tests {
    use crate::datafusion::GetTables;
    use sqlparser::parser::Parser;
    use sqlparser::dialect::GenericDialect;
    use sqlparser::ast::Visit;

    #[test]
    fn test_sql() {
         let tests = vec![
            (
                "SELECT * from table_name",
                vec![
                    "table_name",
                ],
            ),
            (
                "SELECT * from t1 join t2 on t1.id = t2.t1_id",
                vec![
                    "t1",
                    "t2",
                ],
            ),
        ];
        let dialect = GenericDialect {}; // or AnsiDialect, or your own dialect ...
        for (sql, expected) in tests {
            let ast = Parser::parse_sql(&dialect, sql).unwrap();
            let mut get_tables = GetTables::default();
            ast.visit(&mut get_tables);
            assert_eq!(get_tables.visited, expected);
        }
    }
}

Are you having trouble figuring out your way to building automation, release and troubleshoot your software? Let's get actionables lessons learned straight to you via email.