CLI tool for video analysis using LLMs, CV, and ASR
Top 39.0% on sourcepulse
This project provides a tool for analyzing video content using Large Language Models (LLMs), Computer Vision, and Automatic Speech Recognition. It's designed for researchers and developers who need to extract detailed, natural language descriptions from videos, leveraging either local LLM deployments or cloud-based APIs.
How It Works
The system operates in three stages: frame extraction and audio processing, frame analysis, and video reconstruction. It uses OpenCV for intelligent keyframe extraction and Whisper for high-quality audio transcription. Each keyframe is then analyzed by a vision LLM (like Llama3.2 Vision) to capture details, with context from previous frames maintained. Finally, these analyses are combined chronologically with the audio transcript to generate a comprehensive video description.
Quick Start & Requirements
pip install .
or pip install -e .
for development.ollama pull llama3.2-vision
).Highlighted Details
Maintenance & Community
The project welcomes contributions and provides guidelines in docs/CONTRIBUTING.md
.
Licensing & Compatibility
Licensed under the Apache License 2.0, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The project is primarily designed for Linux and macOS; Windows compatibility for local LLM execution might require additional setup. Performance is heavily dependent on the chosen LLM and hardware.
3 months ago
1 day