CMDR-Bench focuses on indirectly retrieving relevant pages by leveraging document context, and CMDR-Embed jointly encodes multiple pages to capture cross-page contextual relationships.

cmdr overview

CMDR-Bench

Overview

We introduce Contextual Multimodal Document Retrieval (CMDR), a new multimodal document retrieval task in which a system must retrieve relevant pages from a multi-page document (i.e., an ordered set of document images) while leveraging the document context.

cmdr-bench example

CMDR-Bench is designed to measure four skills in retrieval models:

  • Text Completion (TC): Key textual content is split across pages, requiring accurate reconstruction of the semantic flow to maintain coherence.
  • Coreference Resolution (CR): References such as pronouns or abbreviations must be resolved using contextual cues found only in other pages.
  • Structured Understanding (SU): Understanding how structured components relate to each other across page boundaries (e.g., a table header defined on one page and its rows continued on the next), independent of entity resolution or multi-entity aggregation.
  • Multi-hop Reasoning (MR): An inference path that connects the entities in the query and the relevant page via the document context. Unlike CR, which resolves referential links between an entity and its mentions, MR involves multi-entity semantic reasoning.

  • Statistics

    CMDR-Embed

    Model architecture

    Unlike non-contextual retrievers (e.g., ColPali), which encode each page independently and thus fail to capture cross-page context, CMDR-Embed encodes not only the content of the page itself but also explicitly incorporates contextual information.

    cmdr-embed overview

    To enable cross-page interactions within a document, we adopt a chunk-then-split strategy. Consecutive pages are first encoded together, and the resulting representations are then split into page-level embeddings (i.e., sequences of token-level embeddings).

    Contextual Multimodal Contrastive Learning (CMCL)

    cmcl overview

    Because CMDR-Embed encodes multiple pages jointly, representations from different pages in the same document can be mixed, which weakens page-level discriminability. To address this issue, we propose a new contrastive learning framework, Contextual Multimodal Contrastive Learning (CMCL). Specifically, we reinforce the standard InfoNCE objective by introducing two types of context-aware hard negatives: In-Chunk Negatives, which are different pages within the same chunk, and In-Document Negatives, which are other chunks from the same document. Intuitively, In-Chunk Negatives are more effective when similar context appears on nearby pages, whereas In-Document Negatives are particularly helpful when information similar to the relevant page appears on distant pages outside the current chunk. This design prevents neighboring pages from becoming overly similar and ensures that each page maintains a distinct representation, even when contextual information is shared.

    Experiment Results

    Main Results

    CMDR-Embed, which leverages contextual information, significantly outperforms models that do not consider the document context. Importantly, this performance gain cannot be attributed solely to the training data. Even when trained on the same data, the non-contextual models ColPali and ColQwen still underperform compared to our contextual models CMDR-EmbedPali and CMDR-EmbedQwen. CMDR-EmbedQwen emerges as the strongest retriever in this setting, achieving an average improvement of 16.2 points over the best non-contextual baseline. These results demonstrate that explicitly modeling document context yields benefits beyond what existing approaches can capture.

    main_results

    BibTeX

    
              @inproceedings{tanaka2026cmdr,
                title={CMDR: Contextual Multimodal Document Retrieval},
                author={Ryota Tanaka and Taku Hasegawa and Kyosuke Nishida},
                booktitle={Proceedings of ECCV},
                year={2026},
              }