Most AI tools do not actually watch a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude will not accept a video file at all. Even Gemini, which can read video natively, samples frames at a fixed interval — one per second by default — so fast cuts slip past and static slides generate hundreds of near-identical frames.
A new open-source tool called claude-real-video does it differently. Point it at a URL or a local file, and it pulls the frames that actually matter: every scene change, not a fixed quota. It throws away near-duplicates, transcribes the audio with Whisper, and hands you a clean folder any LLM can read — on your own machine, with nothing uploaded to a cloud.
The core insight is simple. Most “let an LLM watch a video” pipelines grab frames at a fixed interval — one per second, say. That over-samples a static screencast (600 near-identical frames from a 10-minute slide deck) and under-samples a fast-cut reel (missing frames between samples). Claude-real-video uses ffmpeg’s scene-change detection with a density floor: at least one frame every N seconds, but also a frame at every detected scene cut. Then a sliding-window dedup compares each candidate frame against the last four kept frames using real pixel difference — not a perceptual hash, which goes blind on flat colours and equal-luma hue changes. An A-B-A cutaway does not re-send a shot the model has already seen.
The result is fewer, more meaningful frames. Cheaper context. Better understanding.
What the tool actually does
Run crv "https://www.youtube.com/watch?v=..." and it produces a folder with frames/*.jpg, a transcript.txt, and a MANIFEST.txt that summarises everything for the model. Drop the folder into Claude, ChatGPT, or Gemini and ask away. The tool supports YouTube, Instagram, TikTok, local files, and login-gated sources via a Netscape cookie file.
Key parameters expose the tradeoffs. --scene controls scene-change sensitivity (default 0.30, lower means more frames). --fps-floor guarantees at least one frame every N seconds (default 1.0). --max-frames hard-caps the total at 150 by default. --dedup-threshold sets the percentage of pixels that must change for a frame to count as new (default 8%). --dedup-window sets how many previous frames to compare against (default 4). A --report flag writes a visual HTML file showing every keep/drop decision with its diff percentage, for tuning.
The audio pipeline is smart too. If the video already has subtitles — a sidecar .srt or .vtt next to a local file, or an embedded subtitle track — those are used as the transcript. Faster and more accurate than re-transcribing. Only when there are no subtitles does it fall back to Whisper on the audio. An optional --keep-audio flag saves the full original soundtrack as audio.m4a, so a model that can listen — Gemini, GPT-4o — can actually hear the music and tone, not just read the words.
What this means for AI builders
The tool exposes a blind spot in how the industry thinks about video understanding. The frontier labs are racing to build native video models — Gemini 2.0 Flash can process hours of video, OpenAI’s GPT-4o can watch a live camera feed. But those capabilities are locked inside proprietary APIs, priced per token, and subject to whatever sampling strategy the provider chose. The user has no control over which frames get seen.
Claude-real-video is the opposite. It runs locally, costs nothing beyond compute, and gives the user full control over the sampling strategy. That matters for any application where frame selection is critical: surveillance review, sports analysis, UI testing, medical video, any domain where a missed frame is a missed diagnosis or a missed bug.
The tool also points to a deeper architectural question. If a cheap Python script with ffmpeg and Whisper can extract meaningful frames and transcripts from any video, how much of the value in a “video model” is actually in the video processing pipeline versus the language model that reads the output? The frontier labs are spending billions training models that can watch video natively. This tool suggests that for many use cases, the bottleneck is not the model’s ability to understand frames — it is the ability to select the right frames in the first place.
The open-source angle
Claude-real-video is MIT-licensed and installs with pip install claude-real-video. It depends on ffmpeg and, optionally, OpenAI Whisper. The code is a single Python module with clear parameter names and a straightforward pipeline: fetch, extract, dedup, transcribe, manifest.
The tool’s author, Huang Chih-Hung Leo, built it to solve a specific problem: Claude will not accept a video file, and even when it does — via Gemini or GPT-4o — the fixed-interval sampling misses too much. The solution is a pre-processing step that runs on the user’s machine, not in the cloud. That is a pattern that will likely replicate across other modalities: audio, 3D, point clouds, anything where the model’s native sampling is too coarse or too expensive.
What to watch
The tool works today for any LLM that accepts images and text. The user drops in a folder of frames and a manifest, and the model sees what the user chose to show it. That is a fundamentally different capability from “the model watches the video itself” — it is more constrained, but also more predictable, more debuggable, and cheaper.
The open question is whether the frontier labs will respond by offering more flexible sampling — letting users specify scene-change thresholds, dedup windows, and frame caps — or whether they will continue to lock the video processing pipeline inside their APIs. If the labs open up the sampling controls, tools like claude-real-video become a bridge to better native video understanding. If they do not, the tool becomes a permanent workaround: a pre-processing step that every serious video-analysis pipeline will need to run before touching an LLM.
Either way, the hack works. Point it at a video, get back the frames that matter, and ask your model what it sees.