A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Dataemia
3 Min Read


View a PDF of the paper titled KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes, by Eugenie Lai and 18 other authors

View PDF
HTML (experimental)

Abstract:Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at this https URL.

Submission history

From: Gerardo Vitagliano [view email]
[v1]
Fri, 6 Jun 2025 21:18:45 UTC (374 KB)
[v2]
Tue, 7 Oct 2025 18:15:23 UTC (327 KB)
[v3]
Thu, 5 Mar 2026 19:25:53 UTC (1,118 KB)



Source link

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!