CMDR: Contextual Multimodal Document Retrieval

Overview

We introduce Contextual Multimodal Document Retrieval (CMDR), a new multimodal document retrieval task in which a system must retrieve relevant pages from a multi-page document (i.e., an ordered set of document images) while leveraging the document context.

CMDR-Bench is designed to measure four skills in retrieval models:

Text Completion (TC): Key textual content is split across pages, requiring accurate reconstruction of the semantic flow to maintain coherence.

Coreference Resolution (CR): References such as pronouns or abbreviations must be resolved using contextual cues found only in other pages.

Structured Understanding (SU): Understanding how structured components relate to each other across page boundaries (e.g., a table header defined on one page and its rows continued on the next), independent of entity resolution or multi-entity aggregation.

Multi-hop Reasoning (MR): An inference path that connects the entities in the query and the relevant page via the document context. Unlike CR, which resolves referential links between an entity and its mentions, MR involves multi-entity semantic reasoning.

Statistics

Dataset distribution of the CMDR-Bench.

Key statistics of the CMDR-Bench.

Distribution of queries in the CMDR-Bench.

Model architecture

Unlike non-contextual retrievers (e.g., ColPali), which encode each page independently and thus fail to capture cross-page context, CMDR-Embed encodes not only the content of the page itself but also explicitly incorporates contextual information.

To enable cross-page interactions within a document, we adopt a chunk-then-split strategy. Consecutive pages are first encoded together, and the resulting representations are then split into page-level embeddings (i.e., sequences of token-level embeddings).

Contextual Multimodal Contrastive Learning (CMCL)

Because CMDR-Embed encodes multiple pages jointly, representations from different pages in the same document can be mixed, which weakens page-level discriminability. To address this issue, we propose a new contrastive learning framework, Contextual Multimodal Contrastive Learning (CMCL). Specifically, we reinforce the standard InfoNCE objective by introducing two types of context-aware hard negatives: In-Chunk Negatives, which are different pages within the same chunk, and In-Document Negatives, which are other chunks from the same document. Intuitively, In-Chunk Negatives are more effective when similar context appears on nearby pages, whereas In-Document Negatives are particularly helpful when information similar to the relevant page appears on distant pages outside the current chunk. This design prevents neighboring pages from becoming overly similar and ensures that each page maintains a distinct representation, even when contextual information is shared.

Main Results

CMDR-Embed, which leverages contextual information, significantly outperforms models that do not consider the document context. Importantly, this performance gain cannot be attributed solely to the training data. Even when trained on the same data, the non-contextual models ColPali and ColQwen still underperform compared to our contextual models CMDR-EmbedPali and CMDR-EmbedQwen. CMDR-EmbedQwen emerges as the strongest retriever in this setting, achieving an average improvement of 16.2 points over the best non-contextual baseline. These results demonstrate that explicitly modeling document context yields benefits beyond what existing approaches can capture.

BibTeX


          @inproceedings{tanaka2026cmdr,
            title={CMDR: Contextual Multimodal Document Retrieval},
            author={Ryota Tanaka and Taku Hasegawa and Kyosuke Nishida},
            booktitle={Proceedings of ECCV},
            year={2026},
          }

CMDR

Contextual Multimodal Document Retrieval

ECCV 2026

CMDR-Bench