Skip to content
Back to projects

RAG Platform - Contributor

ColiVara

PythonTypeScriptRAGVision EmbeddingsOpenAPI GeneratorBenchmarking (NDCG@5)LangChainDocker

ColiVara is TJM Labs' open source document retrieval platform. Instead of the usual OCR plus text chunking plus embedding pipeline, it retrieves directly on visual embeddings of document pages, which means no OCR errors, no broken tables, no missing images. The flagship repo has 1,477 stars and 121 forks.

I joined TJM Labs as the 2nd engineer in November 2024, shortly after ColiVara's first research release, and contributed across five of its repositories.

ColiVara-Eval (github.com/tjmlabs/ColiVara-Eval) is the benchmark harness I built for evaluating retrieval quality on the ViDoRe benchmarks using NDCG@5. It ships an upsert script that loads and embeds 10 public datasets (ArxivQA, DocVQA, InfoVQA, TabFQuad, TATQ, Shift Project, and four SyntheticDocQA corpora), an evaluator that computes relevance scores, and a collection manager. ColiVara scores 87.6 average NDCG@5 against 65.5 for OCR + BM25 and 17.7 for Jina-CLIP, and reaches 1.7 points of the ViDoRe leader while serving production queries. Release 1.5.0 cut average latency roughly in half versus 1.0.0 (ArxivQA 11.1s to 3.2s, DocVQA 9.3s to 2.9s, TabFQuad 8.1s to 3.7s) by introducing hierarchical clustering to the retrieval path.

I also built the Python (colivara-py) and TypeScript (colivara-ts) SDKs with OpenAPI Generator, plus contributed to the main ColiVara repo and the docs site. Generating both SDKs from the server's OpenAPI spec kept them in sync with every release and saved roughly $30k per year in maintenance effort versus hand written client libraries.