Baidu released Unlimited-OCR on June 22, an open-source model that parses entire multi-page documents in a single inference pass. The paper, posted to arXiv on June 23, describes a system that takes a full PDF as input and outputs structured text without chunking the document into individual pages or running separate OCR passes.

The headline capability is what the team calls “one-shot long-horizon parsing.” A model with a 32,768-token context window ingests every page of a document as a single image sequence and produces the extracted text in one go. The GitHub repository includes inference code for both Huggingface Transformers and SGLang, with a batch inference script that handles image directories and PDF files directly.

This is not a marginal improvement on existing OCR. It is a structural change in how document extraction works.

The dominant approach today, from cloud APIs like Google Document AI, Amazon Textract, and Azure AI Document Intelligence, is page-by-page. Each page is an API call. Each call costs money. For a 100-page contract, a 50-page invoice batch, or a 300-page regulatory filing, the cost scales linearly with page count. Latency scales the same way: the API processes pages sequentially or in limited parallel batches, and the user stitches the results together on the backend.

Unlimited-OCR collapses that pipeline. One image tensor, one forward pass, one output. The cost is the compute for that single pass, not the sum of 100 passes.

The model builds on DeepSeek-OCR and DeepSeek-OCR-2, which the authors acknowledge in their repository. The architecture uses a vision encoder feeding into a language model backbone, with a custom logit processor that enforces a no-repeat n-gram constraint during generation. The inference code shows two image processing modes: “gundam” at 640px resolution with cropping for single images, and “base” at 1024px without cropping for multi-page inputs. The n-gram window expands from 128 tokens for single images to 1,024 tokens for multi-page documents, suggesting the model needs more context to avoid repeating text across page boundaries.

The practical implications for AI builders are immediate.

Any application that ingests documents at scale — contract analysis, invoice processing, compliance review, academic paper ingestion, medical record digitization — currently budgets for per-page API costs. A startup processing 10,000 PDFs a month at 20 pages each pays for 200,000 API calls. At $0.015 per page for a typical cloud OCR service, that is $3,000 a month in extraction costs alone. Unlimited-OCR, running on a single NVIDIA GPU with bfloat16 precision, replaces that with a fixed hardware cost.

The repository specifies a minimum of CUDA 12.9 and PyTorch 2.10.0, with inference tested on python 3.12.3. The SGLang deployment path uses a custom logit processor and a context length of 32,768 tokens. The batch inference script supports concurrency of eight requests, which suggests the model can handle multiple documents in parallel on a single GPU.

There are caveats. The model is released under a research license, not a commercial one. The paper has not been peer reviewed. The benchmark results on standard OCR datasets are not yet published in the repository, so independent verification of accuracy is pending. The 32,768-token context window limits document length: at roughly 4 tokens per word, a 300-page document may exceed the window depending on image tokenization density.

But the direction is clear. The cost of document extraction is trending toward zero at the margin. Cloud OCR APIs will need to justify their per-page pricing against a model that runs on a local GPU and handles an entire document in one shot. The API business model for OCR, built on the assumption that extraction is an expensive per-unit operation, faces the same pressure that per-token pricing for language models faced when open-source models matched GPT-3.5-class performance.

The authors — Youyang Yin, Huanhuan Liu, and 14 others affiliated with Baidu — have published the model on Huggingface and ModelScope. The arXiv paper is available at 2606.23050. The repository includes a citation for the paper and acknowledges PaddleOCR alongside the DeepSeek projects.

The open question is how well the model handles the long tail of document layouts: tables with merged cells, handwritten annotations, rotated pages, low-quality scans, and non-Latin scripts. The repository’s example prompt is “document parsing.” The inference code includes a no_repeat_ngram_size parameter set to 35, which suggests the model has a tendency to repeat text on long documents and needs explicit suppression.

For AI builders, the calculation is now a concrete one. Run a benchmark on your document corpus. Compare the output quality against your current OCR provider. Multiply your monthly page volume by your per-page cost. Then compare that to the cost of a GPU for a month. The answer may not favor the cloud API.