Awesome-Video-Diffusion-Models
ModelFree[CSUR] A Survey on Video Diffusion Models
Capabilities12 decomposed
hierarchical-taxonomy-based-research-organization
Medium confidenceOrganizes video diffusion research into a three-pillar taxonomy (video generation, video editing, video understanding) using a hub-and-spoke model where the survey document serves as the central organizing principle. The taxonomy implements nested subcategories (e.g., Text-to-Video subdivided into Training-based and Training-free approaches) with structured tables that systematically link to external papers, GitHub repositories, and project websites, enabling researchers to navigate the research landscape through semantic categorization rather than chronological or alphabetical ordering.
Implements a three-pillar taxonomy (generation, editing, understanding) with nested subcategories and external linkage tables rather than a flat list or chronological archive. The hub-and-spoke model positions the survey paper as the authoritative organizing principle while maintaining distributed links to external implementations and papers, creating a living research index that bridges academic literature and open-source implementations.
More comprehensive and systematically organized than GitHub awesome-lists that rely on alphabetical sorting; provides semantic structure comparable to academic surveys but with direct links to code repositories and live projects rather than citations alone
text-to-video-generation-method-comparison
Medium confidenceProvides structured comparison of text-to-video generation approaches by categorizing them into training-based methods (e.g., Make-A-Video, CogVideoX) and training-free methods, with linked papers and implementations for each. The capability enables researchers to understand the trade-offs between approaches that require fine-tuning on video datasets versus those that leverage pre-trained image diffusion models without additional training, facilitating architectural decision-making for practitioners building text-to-video systems.
Explicitly bifurcates text-to-video methods into training-based and training-free subcategories with separate tables for each, making the computational and data requirements distinction immediately visible. This binary classification helps practitioners quickly identify whether they need to invest in dataset curation and fine-tuning or can leverage existing pre-trained models.
More structured than a flat list of text-to-video papers; provides explicit categorization by training approach rather than requiring readers to infer computational requirements from paper abstracts
research-paper-and-implementation-cross-referencing
Medium confidenceMaintains bidirectional cross-references between research papers and their implementations, enabling practitioners to navigate from a paper to its GitHub repository and vice versa. The capability uses structured table entries that link papers (with arXiv/conference links) to corresponding GitHub repositories and project websites, creating a unified view of research and its practical instantiation. This supports practitioners who want to understand both the theoretical approach and the implementation details.
Explicitly maintains bidirectional links between papers and implementations in structured tables, rather than treating them as separate resources. This enables practitioners to navigate seamlessly between research and code, supporting both top-down (paper-to-implementation) and bottom-up (implementation-to-paper) discovery.
More practical than paper-only surveys or code-only repositories; provides unified access to both research and implementations, enabling practitioners to understand both theoretical and practical aspects
survey-paper-citation-and-academic-usage
Medium confidenceProvides citation information and academic usage guidance for the survey paper itself, enabling researchers to properly cite the comprehensive video diffusion survey in their own work. The capability includes BibTeX entries, citation formats, and information about the paper's publication in ACM Computing Surveys (CSUR), supporting academic reproducibility and proper attribution. This enables the survey to be used as an authoritative reference in academic work.
Explicitly provides citation information and academic usage guidance for the survey itself, recognizing that comprehensive surveys serve as authoritative references in academic work. This enables the survey to be properly cited and used in literature reviews and related work sections.
More academically rigorous than informal awesome-lists; provides proper citation information and publication venue (CSUR) that enables use as an authoritative reference in academic work
conditional-video-generation-taxonomy
Medium confidenceOrganizes conditional video generation methods into pose-guided, motion-guided, sound-guided, and multi-modal control subcategories, with linked papers and implementations for each. The taxonomy enables practitioners to identify which conditioning modality (skeletal pose, motion vectors, audio, or combined inputs) best fits their use case, and to discover methods like AnimateAnyone and FollowYourPose that implement specific conditioning approaches. This capability maps user intents (e.g., 'animate a character from a pose sequence') to specific research papers and implementations.
Implements a four-way taxonomy of conditioning modalities (pose, motion, sound, multi-modal) rather than treating conditional generation as a monolithic category. This enables practitioners to quickly identify which conditioning approach matches their input data and use case, and to discover methods like AnimateAnyone that specialize in specific modalities.
More granular than generic 'conditional video generation' categorization; provides modality-specific organization that maps directly to practitioner input data (pose sequences, audio, motion vectors) rather than requiring inference about which method accepts which inputs
image-to-video-synthesis-method-discovery
Medium confidenceCatalogs image-to-video (I2V) synthesis and animation methods with links to papers and implementations like Stable Video Diffusion and DynamiCrafter. The capability enables practitioners to discover methods that generate video sequences from static images, with subcategories distinguishing between pure I2V synthesis (generating motion from a single image) and animation approaches (bringing static artwork or illustrations to life). This supports use cases like creating video from photographs or animating artwork.
Distinguishes between I2V synthesis (generating motion from single images) and animation (bringing static artwork to life) as separate but related subcategories, recognizing that these approaches have different architectural requirements and use cases despite both operating on static image inputs.
More specific than generic 'video generation' categorization; provides explicit focus on image-conditioned generation methods rather than requiring practitioners to filter through text-to-video and other approaches
text-guided-video-editing-method-catalog
Medium confidenceOrganizes text-guided video editing methods into a structured catalog with links to papers and implementations that enable users to modify videos using natural language descriptions. The capability maps text prompts to video editing operations (e.g., 'change the sky to sunset', 'make the character smile'), enabling practitioners to discover methods that support semantic video manipulation without frame-by-frame manual editing. This differs from video generation by operating on existing video content rather than creating from scratch.
Explicitly separates text-guided video editing from text-to-video generation, recognizing that editing existing video content requires different architectural approaches (e.g., preserving unedited regions, maintaining temporal consistency across edits) than generating video from scratch. This distinction helps practitioners understand which methods apply to their use case.
More focused than generic 'video diffusion' categorization; provides explicit organization of editing-specific methods rather than requiring practitioners to filter through generation approaches
multi-modal-video-editing-integration
Medium confidenceCatalogs multi-modal video editing methods that combine multiple input modalities (text, images, sketches, masks) to enable fine-grained control over video editing. The capability links to methods that support combined conditioning signals, enabling practitioners to discover approaches that go beyond text-only editing to incorporate visual constraints, spatial masks, or reference images. This supports complex editing workflows where text descriptions alone are insufficient.
Recognizes multi-modal video editing as a distinct category beyond text-guided editing, acknowledging that combining multiple input modalities (text, image, mask, sketch) enables more precise control than single-modality approaches. This reflects the architectural complexity of methods that must reconcile multiple conditioning signals.
More granular than generic 'video editing' categorization; explicitly organizes multi-modal methods separately from text-only approaches, helping practitioners understand which methods support their specific input modality combinations
video-understanding-and-analysis-research-index
Medium confidenceProvides a structured index of video understanding and analysis research methods, enabling practitioners to discover approaches for video classification, action recognition, temporal reasoning, and semantic understanding. The capability catalogs papers and implementations that analyze video content rather than generate or edit it, supporting use cases like video captioning, action detection, and scene understanding. This represents the third pillar of the survey alongside generation and editing.
Positions video understanding and analysis as a co-equal pillar alongside video generation and editing, rather than treating it as secondary. This reflects the survey's comprehensive scope across the full video diffusion research landscape, including both generative and analytical approaches.
More comprehensive than generation-focused surveys; includes video understanding research alongside generation and editing, providing a complete view of video diffusion applications
dataset-and-evaluation-metric-reference
Medium confidenceCatalogs datasets and evaluation metrics used in video diffusion research, enabling practitioners to understand how video generation, editing, and understanding methods are evaluated. The capability provides links to benchmark datasets (e.g., UCF101, Kinetics) and evaluation metrics (e.g., FVD, LPIPS, temporal consistency measures) used across the field, supporting practitioners in selecting appropriate evaluation approaches for their own systems. This enables informed comparison of methods and reproducible evaluation.
Centralizes dataset and evaluation metric information as a dedicated section of the survey, recognizing that reproducible evaluation is critical for comparing video diffusion methods. This provides practitioners with a single reference point for understanding how methods are evaluated rather than requiring them to extract this information from individual papers.
More comprehensive than individual paper evaluations; provides a unified view of datasets and metrics used across the field, enabling practitioners to understand standard evaluation practices and select appropriate benchmarks
external-ecosystem-integration-and-linking
Medium confidenceImplements a hub-and-spoke architecture that connects the survey to external resources including academic papers, GitHub repositories, project websites, and commercial platforms. The capability uses structured link patterns in README.md tables to systematically reference external implementations and research, creating a distributed knowledge network where the survey serves as the organizing principle while actual code and papers reside in external repositories. This enables practitioners to navigate from research concepts to implementations without leaving the survey context.
Implements a hub-and-spoke model where the survey acts as the central organizing principle while maintaining distributed links to external implementations and papers, rather than attempting to host all code and papers locally. This architecture enables the survey to remain lightweight and current while providing comprehensive access to the ecosystem.
More practical than academic surveys that only cite papers; provides direct links to implementations and code repositories, enabling practitioners to move from research concepts to working code without manual searching
visual-demonstration-and-example-curation
Medium confidenceCurates a collection of visual demonstrations (GIFs, video clips) that illustrate key concepts and capabilities in video diffusion research. The capability organizes visual assets by type (algorithm demonstrations, motion examples, generation results, comparative examples) to provide practitioners with concrete examples of what different methods produce. This supports learning and evaluation by showing actual outputs rather than relying solely on text descriptions and paper figures.
Organizes visual assets by demonstration type (algorithm visualization, motion examples, generation results, comparisons) rather than simply embedding random examples, creating a structured visual learning experience that complements the textual taxonomy. This enables practitioners to quickly understand method capabilities through concrete visual examples.
More pedagogically useful than text-only surveys; provides visual examples that enable quick evaluation of method capabilities without reading full papers or running code
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Awesome-Video-Diffusion-Models, ranked by overlap. Discovered automatically through the match graph.
PaperTalk.io
PaperTalk.io is a platform that uses Generative AI technology to enhance the understanding of research...
Paperguide
AI-driven platform for research discovery, writing, and...
Diffusion-Models-Papers-Survey-Taxonomy
Diffusion model papers, survey, and taxonomy
*data-to-paper*
is a framework for systematically navigating the power of AI to perform complete end-to-end
genei
Summarise academic articles in seconds and save 80% on your research times.
Awesome-Text-to-Image
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
Best For
- ✓researchers conducting literature reviews in video diffusion
- ✓practitioners evaluating which video diffusion approach fits their use case
- ✓students learning the taxonomy and landscape of video generation methods
- ✓teams building video diffusion systems who need to understand competing approaches
- ✓ML engineers building text-to-video generation systems
- ✓researchers comparing architectural approaches for video synthesis
- ✓teams evaluating whether to implement training-based or training-free approaches
- ✓practitioners with limited compute budgets deciding between fine-tuning and zero-shot methods
Known Limitations
- ⚠taxonomy is static and requires manual updates as new research categories emerge
- ⚠no algorithmic ranking or recommendation of papers within categories based on citation count or recency
- ⚠does not capture interdependencies between categories (e.g., how video editing techniques relate to generation methods)
- ⚠external links may become stale as projects are archived or moved
- ⚠does not provide quantitative benchmarks or performance comparisons (e.g., FVD scores, inference time)
- ⚠no implementation tutorials or code walkthroughs — only links to external repositories
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: Apr 15, 2026
About
[CSUR] A Survey on Video Diffusion Models
Categories
Alternatives to Awesome-Video-Diffusion-Models
Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
Compare →Are you the builder of Awesome-Video-Diffusion-Models?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →