Skip to the content.

Overview

Most of the world’s structured data lives in relational databases. Banks, hospitals, retailers, and governments keep their records in collections of tables linked by keys: customers point to orders, orders point to products, products point to suppliers. This is where the data actually is, and it rarely moves.

Machine learning, however, expects a single flat table: one row per example, one column per feature. To bridge the gap, practitioners spend most of their effort flattening the database into that shape, joining tables together, aggregating across relationships (“how many orders did this customer place in the last 30 days?”), and hand-crafting the features that carry the relational structure into a form a model can read. This step is laborious, it discards much of the structure that made the database useful in the first place, and it forces the data out of the system it lives in, which is expensive and often impossible when privacy or governance rules keep the data in place.

Relational machine learning offers a different path: learn directly on the relational structure, without flattening and without moving the data. Recent relational deep learning work (Fey et al. 2024; Robinson et al. 2024) treats the database as a graph, the rows are nodes and the foreign keys are edges, and trains a model over that graph in place. The promise is to skip the brittle, manual feature-engineering step entirely and let the model exploit the schema as it stands.

This reframes a long-standing problem. Automated Machine Learning (AutoML) was built to take the burden of model selection and hyperparameter tuning off the human (Thornton et al. 2013; Feurer et al. 2015; Hutter, Kotthoff, and Vanschoren 2019). But today’s AutoML tools still assume the data arrives as a fixed, single table. They automate the search over algorithms; they do not reason about the data — how it could be reshaped, joined, aggregated, or enriched, which is where real performance gains usually come from (Kanter and Veeramachaneni 2015). On a relational database, the data side is the problem: which tables to bring in, which relationships to traverse, which aggregates to compute. An AutoML system that understands relational structure has a far richer space of moves available to it than one staring at a single table.

This is where the current wave of agentic AI becomes interesting. An agent can plan, take actions, and call tools on its own. Pointed at a relational database, an agent could in principle read the schema, decide which relationships are worth exploring, derive new features across tables, augment or enrich the data, and then run the model search, all without a human flattening the database by hand and without copying the data out. But to do this in a principled way, we need to ask a basic question: what is the right way to describe and organize this whole process? What formalism lets an agent reason about schema traversal, feature enrichment, and model search together, rather than as disconnected steps?

A second, very practical question is the infrastructure underneath. When an agent experiments with many versions of a relational dataset (materializing a join, adding a derived table, augmenting rows, trying a transformation) it needs a fast and cheap way to create, modify, and discard these versions, ideally in place so the data never leaves the database. Database clones, lightweight copy-on-write branches of a database that can be spun up and thrown away, may be a natural fit. Each experiment becomes a clone the agent can play with without disturbing the original or moving anything. Whether database clones are actually a good backbone for agentic relational AutoML is an open and practical question.

Research Questions

We organize the investigation around three questions:

Approach

The following is a tentative plan that can be refined as the project moves forward.

Survey first.

The core deliverable is a literature review, essentially a small survey paper, mapping out the state of AutoML and relational machine learning. The student would read and organize recent work on automated model selection, hyperparameter optimization, automated feature engineering, relational deep learning, and data augmentation, and place agentic approaches in that context.

Identify the gaps.

From the survey, we want to pin down concretely where today’s tools stop short, with attention to the data side (relational feature enrichment, augmentation, learning in place) rather than only the algorithm side.

Sketch the formalism.

Based on what the survey reveals, we will try to outline what a formal description of agentic relational AutoML might look like. This is exploratory and conceptual; the goal is a clear sketch, not a finished theory.

Consider the infrastructure angle.

Where time allows, we will examine database clones as a possible substrate for managing relational dataset experiments in place, and discuss the practical trade-offs.

Feasibility and Resources

The bulk of the project is reading, organizing, and writing, which is very tractable for a motivated student. Some lightweight hands-on experimentation with existing AutoML and relational learning libraries and database tooling may be involved. Access to modest computing power will be needed for any experiments.

Logistics and Collaboration

Minimum Requirements

An ideal candidate should meet the following:

Nice to Have

The following can be picked up during the project:

References