第6章量化与压缩 — ColPali 完全指南

一页 1024 向量真的都有用吗?

论文 Token Pooling for ColPali (2025) 发现——文档 patch 向量高度冗余。相邻 patch 常表示同一个区域(同一张图表的左右两半、一段话的上下两行),可以聚合成一个。

方案一:Token Pooling(4x 压缩)

对每页 1024 个向量做 hierarchical clustering,K=256 聚类中心,每个 patch 用最近簇心代替:

from sklearn.cluster import AgglomerativeClustering

def token_pool(page_vecs, target=256):
    # page_vecs: (1024, 128)
    cluster = AgglomerativeClustering(n_clusters=target, linkage="average")
    labels = cluster.fit_predict(page_vecs)
    pooled = np.stack([
        page_vecs[labels == k].mean(axis=0) for k in range(target)
    ])
    # L2 normalize
    pooled /= np.linalg.norm(pooled, axis=1, keepdims=True)
    return pooled

实测 nDCG@5 下降 < 1 点,存储立刻 4 倍压缩。官方 colpali-engine 1.3+ 内置 HierarchicalTokenPooler:

from colpali_engine.compression import HierarchicalTokenPooler

pooler = HierarchicalTokenPooler()
pooled_vecs = pooler.pool_embeddings(
    page_vecs, pool_factor=4,
)   # (256, 128)

方案二:Binary Quantization(32x 压缩)

128 维向量 → 每维取符号 → 128 bit = 16 字节。打分用 Hamming 距离近似:

def binarize(vecs):
    # vecs: (N, 128) float
    return np.packbits((vecs > 0).astype(np.uint8), axis=1)  # (N, 16) uint8

def hamming_score(q_bin, d_bin):
    # 异或再 popcount,越少越相似
    return -np.unpackbits(q_bin ^ d_bin, axis=-1).sum(-1)

惊人数字
bfloat16 的 256 KB/页 → binary pooled 的 4 KB/页 = 64 倍压缩。一张 5GB 盘可以存 130 万页。

方案三:Matryoshka 降维(2x 压缩)

Matryoshka Representation Learning(MRL)训练时就让前 K 维子向量也能单独用。ColPali 对 128 维取前 64 甚至前 32 维,精度损失很小:

# 训练时加 MRL loss
for k in [128, 64, 32]:
    q_k, d_k = q[..., :k], d[..., :k]
    loss += colbert_loss(q_k, d_k)

上线时把前 32 维做粗筛,前 128 维做精排——两段式天然匹配。

两阶段检索:粗筛 + 精排

查询到来 │ ▼ 【粗筛】100 万页 × binary 16KB 用 Hamming MaxSim 找 top-1000 │ 耗时:~20ms ▼ 【精排】1000 页 × bfloat16 256KB 用 cosine MaxSim 重排 │ 耗时:~10ms ▼ 最终 top-5 返回

两段检索的精妙:99% 的存储用在大规模粗筛(便宜),只有 0.1% 的访问涉及完整向量(昂贵)。

def two_stage_search(query, binary_index, full_index, k=5):
    # 粗筛
    q_binary = binarize(query)
    candidates = binary_index.search(q_binary, top_k=1000)
    # 精排(从磁盘/内存拉出 1000 页的全精度向量)
    full_vecs = full_index.fetch([c.id for c in candidates])
    scores = maxsim(query, full_vecs)
    return topk(scores, k)

组合效果

方案	每页大小	ViDoRe nDCG@5	查询延迟 P99
原生 bfloat16	256 KB	0.823	180ms
+ Token Pool 4x	64 KB	0.817	50ms
+ Binary	4 KB	0.790	20ms
+ 两段式精排	4 KB + 256 KB on-demand	0.820	30ms

最后一行就是"生产最优配方":粗筛省空间、精排保精度,延迟还比原生快 6 倍。

硬件侧优化

SIMD Hamming

x86 的 POPCNT 指令一次计算 64 位 popcount。Qdrant、Vespa 都已用,你只管开启 binary。

GPU batched MaxSim

精排阶段把 1000 个候选页的向量一次灌到 GPU,用 bmm 批量算 MaxSim,亚毫秒完成。

内存分层

binary 向量常驻内存(热),bfloat16 向量放 NVMe SSD(冷),读写混合负载友好。

检查单

    ✅ 测了 token pooling 之后 nDCG 没掉超过 1 点
✅ Binary 粗筛 top-1000 命中率 > 95%
✅ 两阶段检索的精排 latency 占比 < 30%
✅ Matryoshka 训练(如果自己练新模型)前 32 维也能独立用
✅ SIMD popcount 和 GPU 精排都启用了

  

本章小结

    Token Pooling:1024→256 patch 压缩 4 倍,精度几乎无损
Binary Quantization:128 维 → 16 字节,32 倍压缩
Matryoshka:前 32/64 维可独立使用,两段式检索的天然基石
生产配方:binary 粗筛 + bfloat16 精排 = 64 倍存储压缩 + 6 倍速度
完整 ViDoRe 精度保持在 0.82 以上

  

生产规模化:三板斧把存储砍到 1/128

一页 1024 向量真的都有用吗?

方案一:Token Pooling(4x 压缩)

方案二:Binary Quantization(32x 压缩)

方案三:Matryoshka 降维(2x 压缩)

两阶段检索:粗筛 + 精排

组合效果

硬件侧优化

检查单

本章小结