Vlogger  by Vchitect

AI system for minute-level vlog generation from user descriptions

Created 1 year ago
427 stars

Top 69.2% on SourcePulse

GitHubView on GitHub
Project Summary

Vlogger is an AI system designed to generate minute-level video blogs (vlogs) from user descriptions. It targets users who need to create longer, narrative-driven video content, offering a structured approach to complex video generation tasks. The system aims to simplify vlog creation by mimicking human production workflows, enabling coherent and engaging long-form video output from simple text prompts.

How It Works

Vlogger employs a modular architecture, leveraging a Large Language Model (LLM) as a "Director" to orchestrate the generation process. This Director decomposes the vlog creation into four stages: Script generation, Actor selection, ShowMaker (video snippet generation), and Voicer (audio generation). The core innovation is the "ShowMaker," a novel video diffusion model that acts as a videographer. ShowMaker enhances spatial-temporal coherence by incorporating textual and visual prompts from the Script and Actor stages, utilizing a mixed training paradigm for both text-to-video (T2V) generation and prediction.

Quick Start & Requirements

  • Environment Setup: Use conda create -n vlogger python==3.10.11 and conda activate vlogger, then pip install -r requirements.txt.
  • Model Downloads: Requires Stable Diffusion v1.4, OpenCLIP-ViT-H-14, and the Vlogger ShowMaker checkpoint. All should be placed in a ./pretrained directory.
  • Dependencies: Python 3.10.11, PyTorch, Hugging Face libraries, and an OpenAI API key for LLM planning.
  • Inference:
    • Script/Actor generation: python sample_scripts/vlog_write_script.py
    • Vlog generation: python sample_scripts/vlog_read_script_sample.py
  • Resources: Requires downloading several large model checkpoints.
  • Links: Project Page, Hugging Face Models, Hugging Face Space.

Highlighted Details

  • Achieves state-of-the-art performance on zero-shot T2V generation and prediction.
  • Capable of generating vlogs exceeding 5 minutes with maintained coherence.
  • Integrates LLMs for directorial planning and foundation models for professional roles.
  • ShowMaker model enhances snippet coherence using text and image prompts.

Maintenance & Community

The project is associated with researchers from institutions like PJLab. Contact information for key contributors is provided. The code is built upon existing libraries like SEINE, LaVie, diffusers, and Stable Diffusion.

Licensing & Compatibility

The code is licensed under Apache-2.0. Model weights are fully open for academic research and permit free commercial usage. For commercial licensing inquiries, contact zhuangshaobin@pjlab.org.cn.

Limitations & Caveats

The system is not trained for realistic representation of people or events, and its use for generating demeaning, harmful, or violent content is prohibited. Users are solely liable for their actions.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jiaming Song Jiaming Song(Chief Scientist at Luma AI).

MoneyPrinterTurbo by harry0703

0.4%
40k
AI tool for one-click short video generation from text prompts
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.