Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2603.26259 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2603.26259,
author = {Antoine Edy, Max Conti, Quentin Macé},
title = {Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2603.26259}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
⚖️ Nexus Index V2.0
💬 Index Insight
FNI V2.0 for Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).
Verification Authority
📝 Executive Summary
❝ Cite Node
@article{Unknown2026Working,
title={Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2603.26259},
year={2026}
} Abstract & Analysis
Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
Title:
Content selection saved. Describe the issue below:
Description:
License: CC BY 4.0
arXiv:2603.26259v1 [cs.IR] 27 Mar 2026
\copyrightclause Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
\conference Late Interaction Workshop (LIR) @ ECIR 2026, April 02, 2026. Colocated with ECIR 2026.
[email=antoine.edy@illuin.tech, ] \cormark [1] \fnmark [1]
[email=max.conti@illuin.tech, ] \fnmark [1]
[email=quentin.mace@illuin.tech, ]
\cortext [1]Corresponding author. \fntext [1]These authors contributed equally.
Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
Antoine Edy
Illuin Technology
Max Conti
Quentin Macé
(2026)
Abstract
While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.
keywords:
Information Retrieval \sep Late-interaction \sep Multi-Vector Retrieval \sep Causal Encoders \sep Bi-directional Encoders
1
Introduction
Neural late-interaction retrieval models, such as ColBERT [ khattab2020colbertefficienteffectivepassage ] , use a token-level interaction while computing similarity between text passages. While this approach allows for finer semantic matching between queries and documents, some of its underlying dynamics are yet to be thoroughly studied.
In these notes, we analyze two key behaviors that provide elements for a better understanding of Late Interaction performance:
(a) Length bias: Causal encoders, when used with multi-vector MaxSim scoring, exhibit a monotonic bias that favors longer chunks, regardless of their true relevance. (b) Similarity distribution: Given a query token, the MaxSim operator is insensitive to similarity scores of document tokens beyond the highest, collapsing information to a single maximum value.
We perform small-scale experiments on the NanoBEIR [ nanobeir ] benchmark to bring more insights into how current state-of-the-art models behave along these axes.
2
Length Bias In Multi-Vector Retrieval
In this section, we explore the length bias that arises in late-interaction retrieval frameworks. We highlight two key observations: multi-vector causal models appear to suffer from a strict, monotonic length bias, and while bi-directional architectures theoretically avoid this flaw, empirical insights suggest they remain sensitive to length disparities at extreme margins. Details on the experimental setup are available in Appendix A .
2.1
Theoretical Motivation
Let a chunk c c be a sequence of tokens represented by contextualized embeddings. In late-interaction retrieval, the MaxSim score [ khattab2020colbertefficienteffectivepassage ] between a query q q and a chunk c c is defined as:
S q , c = ∑ i ∈ [ | E q | ] max j ∈ [ | E c | ] E q i ⋅ E c j T S_{q,c}=\sum_{i\in[|E_{q}|]}\max_{j\in[|E_{c}|]}E_{q_{i}}\cdot E_{c_{j}}^{T}
where E q E_{q} and E c E_{c} are the respective sets of query and chunk embeddings. When utilizing a causal encoder with a multi-vector representation, appending tokens to a chunk yields a strict superset of embeddings. As a result, the maximum inner product for each query token can only increase or remain constant. This dynamic introduces a theoretical monotonic length bias that artificially favors longer chunks regardless of their true relevance.
Bi-directional and single-vector models theoretically avoid this strict bias. In bi-directional models, appending new tokens alters the attention context for all preceding tokens, allowing scores to naturally decrease if the semantic focus is diluted. Similarly, single-vector models aggregate tokens into a fixed-length representation that does not inherently benefit from added tokens. However, to understand their practical robustness to length differences, we complement this theoretical intuition with an empirical analysis.
2.2
Multi-Vector Architectures Induce A Length Bias
Figure 1 : Mean length comparison between the retrieved false positive chunks, the relevant ground-truth documents, and the global corpus average. Queries are grouped into quantiles on the x-axis based entirely on the average token length of their corresponding relevant chunks.
Figure 1 isolates the impact of the pooling mechanism in causal architectures by comparing a multi-vector model ( jina-embeddings-v4 [ günther2025jinaembeddingsv4universalembeddingsmultimodal ] ) with a single-vector dense model ( Qwen3-Embedding-4B [ qwen3embedding ] ). Queries on the horizontal axis are partitioned into quantiles based on the token length of their true relevant documents. For each quantile, we compare the mean length of retrieved false positives against the true relevant documents, as well as the global corpus mean (199 tokens).
While false positives are statistically expected to be slightly longer on average due to the inherently wider semantic scope of longer texts, this length gap should ideally remain marginal. However, the multi-vector causal model disproportionately retrieves false positives that are significantly longer than the relevant documents. Conversely, the single-vector causal model tracks the relevant document length much more closely. This confirms that within causal architectures, the multi-vector setup is the primary driver of length bias.
2.3
Bi-Directional Models Mitigate But Do Not Eliminate Bias
(a) jina-embeddings-v4
(b) Qwen3-Embedding-4B
(c) GTE-ModernColBERT-v1
(d) ColBERT-Zero
Figure 2 : Expected decrease in retrieval performance (nDCG@10) when a chunk of a specific length is added to the corpus. Chunks are categorized into equal-sized quantile bins by token length on the x-axis. The solid line plots the average nDCG penalty incurred by the presence of a chunk from that bin, evaluated against a random baseline (dashed line) and its 90% confidence interval (shaded area).
Having identified the multi-vector setup as a primary driver of length bias in causal architectures, we now investigate whether bi-directional attention mechanisms can effectively neutralize this issue. We evaluate the expected unconditional decrease in retrieval performance (nDCG) when a chunk of a specific length is added to the corpus. To isolate the effect of length, we compute a random baseline and a 90% confidence interval using a permutation test that assumes no correlation between chunk length and retrieval harm. Any deviation outside this interval demonstrates a statistically significant length bias, indicating that adding chunks of that size disproportionately harms ranking quality compared to random chance.
The results reveal distinct architectural behaviors. The causal multi-vector model (Figure 2(a) ) exhibits a near-monotonic length bias; adding longer chunks consistently translates to a greater expected decrease in nDCG. In contrast, the single-vector dense model (Figure 2(b) ) displays no significant length bias, remaining safely within the random baseline and corroborating the findings of the previous section. Notably, while bi-directional multi-vector models successfully dampen the aggressive bias of causal models, they remain vulnerable at length extremes (Figures 2(c) and 2(d) ). For these models, adding unusually short chunks is significantly less harmful than expected by random chance, whereas introducing exceptionally long chunks disproportionately degrades overall ranking quality. Thus, while bi-directional attention refines token representations, it remains unable to fully recover the systemic length bias introduced by the MaxSim operation across substantial token variations.
3
Similarity Distribution: What Happens Beyond The Top-1 Document Token
The MaxSim operator described in subsection 2.1 , by construction, only considers the single most similar document token for each query token, ignoring the number and density of relevant tokens in a chunk. This could lead to hypersensitivity to the highest similarity score of document tokens. Intuitively, a single query token that has several strong matches in a document A is more similar to it than to a document B, where only one token has a high similarity, a nuance that is lost during the MaxSim operation.
To understand how document token similarity behaves beyond the top-1, we focus on queries where retrieval fails (i.e., when the positive document is outside of the top-10), to understand whether there are unseen similarity trends that could be exploited through alternative functions. We compare the sorted document token scores on these queries, aggregated over all query tokens, and across the failing queries on the NanoBEIR dataset, for both ColBERT-Zero and jina-embeddings-v4 . We compare the similarity distribution for the positive sample that was not retrieved, and multiple negatives: the top-1 (best), the one ranked directly below the positive, and the worst.
(a) Similarities aggregated across datasets (201 queries).
(b) Similarities on NanoArguAna (15 queries).
Figure 3 : ColBERT-Zero document token similarities on failed queries. While some datasets exhibit interesting results (e.g., NanoArguAna, right), no clear tendency emerges for positive documents across NanoBEIR (left).
As 3(b) shows, on NanoArguAna, the positive document has better document similarity than the top-1 negative beyond the first tokens (starting around 10%). In such cases, leveraging score distribution beyond the best document token could help identify positive chunks. However, this result does not hold on average across all datasets of NanoBEIR, suggesting that such techniques would not generalize well enough. We also analyze the behavior on successful retrieval samples, with similar conclusions: positive retrieved samples do not have a significantly different similarity distribution in document tokens than negatives. jina-embeddings-v4 exhibits the same trends ( Appendix C ), that further hold across all individual datasets.
4
Conclusion And Future Work
This work provides insights into Late Interaction model dynamics, highlighting a strict length bias in multi-vector causal architectures that bi-directional models mitigate. Similarly to previous work [ teiletche2025modernvbertsmallervisualdocument ] , this suggests that causal models are a poor fit for Late Interaction, encouraging the development of stronger bi-directional models trained for this paradigm. Furthermore, we observed that on standard retrieval benchmarks, no significant similarity trends emerge beyond the top-1 document token, suggesting that current models do not provide information that could be leveraged beyond a MaxSim operator.
Interesting future analyses include testing for a length bias in a controlled setting, using synthetic data to precisely adjust text lengths and semantic relevance. Additionally, document token score distributions should be analyzed on a broader set of tasks, to see if task-dependent trends could emerge (e.g., in long-context retrieval). It remains an important analysis to do on newly released models, as training recipes can highly impact these behaviors.
Future work could also explore translating these insights to mitigate bias and enhance performance through interventions at training time, during indexing, or by refining the retrieval similarity operator.
References
Appendix
Appendix A
Experimental Setup
A.1
Datasets
To empirically validate length bias, we evaluated retrieval performance using NanoBEIR [ nanobeir ] , a subset of the BEIR benchmark [ thakur2021beirheterogenousbenchmarkzeroshot ] comprising 13 diverse datasets with 50 queries each. To ensure a wide distribution of chunk lengths, we pooled the 13 datasets into a single unified corpus prior to retrieval. After removing five outlier chunks exceeding 3,000 tokens (which lacked associated queries), the final merged corpus contained 56,718 chunks and 649 queries.
Figure 4 : Chunk Length Distribution across the merged NanoBEIR corpus.
A.2
Setup
Chunk sizes were computed using the jina-embeddings-v4 [ günther2025jinaembeddingsv4universalembeddingsmultimodal ] tokenizer, a Byte-Pair Encoding tokenizer inherited directly from its base model, Qwen2.5-VL-3B-Instruct [ bai2025qwen25vltechnicalreport ] . We evaluated four distinct model configurations representing combinations of encoder architectures (causal vs. bi-directional) and pooling strategies (single-vector vs. multi-vector):
Model
Pooling Strategy
Architecture
Parameters
jina-embeddings-v4 [ günther2025jinaembeddingsv4universalembeddingsmultimodal ]
Multi-vector Causal 4B
Qwen3-Embedding-4B [ qwen3embedding ]
Single-vector Causal 4B
GTE-ModernColBERT-v1 [ GTE-ModernColBERT ]
Multi-vector Bi-directional 0.15B
ColBERT-Zero [ chaffin2026colbertzeropretrainpretraincolbert ]
Multi-vector Bi-directional 0.15B
Table 1 : Summary of the evaluated models, including their pooling strategies, architectures, and sizes.
Appendix B
Retrieval Errors by Chunk Length
(a) jina-embeddings-v4
(b) Qwen3-Embedding-4B
(c) GTE-ModernColBERT-v1
(d) ColBERT-Zero
Figure 5 : Absolute occurrences of irrelevant chunks ranked above the highest-ranked true positive passage. Bin limits are defined to contain an equal number of chunks. The dashed line plots the no-bias expected baseline, bounded by a 90% variance interval.
Figure 5 illustrates the raw volume of retrieval errors mapped to document chunk lengths.
While a general increase in false positives for longer chunks is expected as longer texts naturally contain more semantic coverage, the models exhibit distinct architectural vulnerabilities. The causal multi-vector model ( jina-embeddings-v4 ) is the only configuration to display a strictly monotonic increase in errors starting from zero, corroborating the length bias of causal multi-vector models described in Section 2.2 .
Conversely, bi-directional models demonstrate non-monotonic error distributions. Instead, distinct peaks emerge at the extreme ends of the length spectrum (for both very short and very long chunks), serving as an additional indicator of the marginal vulnerabilities discussed in Section 2.3 . As a side note, both bi-directional models show a high absolute volume of errors despite yielding strong overall nDCG metrics, indicating that while they rank well on average, their failures are notably severe. This contrast in error severity can largely be explained by the substantial gap in parameter count compared to the two causal models (0.15B vs. 4B).
Appendix C
Similarity Distribution For jina-embeddings-v4
(a) Similarities aggregated across datasets (229 queries).
(b) Similarities on NanoArguAna (5 queries).
Figure 6 : jina-embeddings-v4 document token similarities on failed queries. The positive document has a larger distance to the top-1 negative, and remains constantly worse across all document tokens.
BETA
AI Summary: Based on hf metadata. Not a recommendation.
🛡️ Paper Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- arxiv-paper--unknown--2603.26259
- slug
- unknown--2603.26259
- source
- hf
- author
- Antoine Edy, Max Conti, Quentin Macé
- license
- ArXiv
- tags
- paper, research
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.