Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

[Submitted on 28 Jul 2025 (v1), last revised 10 Mar 2026 (this version, v2)]

View a PDF of the paper titled MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs, by Xueyao Wan and 1 other authors

View PDF
HTML (experimental)

Abstract:Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths.

To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.

Submission history

From: Xueyao Wan [view email]
[v1]
Mon, 28 Jul 2025 13:16:23 UTC (3,791 KB)
[v2]
Tue, 10 Mar 2026 11:12:47 UTC (7,541 KB)

Source link

Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Simpler, Clearer, and More Modular

Intervening on early readouts for mitigating spurious features and simplicity bi

How to Build Progress Monitoring Using Advanced tqdm for Async, Parallel, Pandas, Logging, and High-Performance Workflows

Yann LeCun’s New AI Paper Argues AGI Is Misdefined and Introduces Superhuman Adaptable Intelligence (SAI) Instead