AI Video in 2024: The Models and Research That Changed Everything

2024 was the year AI video generation went from impressive demos to something creators could actually build workflows around. The technical gap between what research labs were showing and what was available to users closed significantly. Here’s a review of the most important developments — both the research papers and the products that shipped.

OpenAI Sora: The Benchmark-Setting Announcement

The year’s most-discussed release was arguably the one most people still can’t fully access. OpenAI’s Sora model was announced in February 2024 with demo videos that genuinely changed what the industry thought was possible: 60-second photorealistic clips, multiple characters interacting, objects casting accurate shadows, and camera movements that felt like a seasoned DP was behind the lens.

The technical paper that accompanied the announcement described Sora as a video diffusion transformer — using the same transformer architecture that scaled image generation past earlier U-Net designs, but applied to “spacetime patches” of video data. Instead of treating video as a sequence of images, Sora processes overlapping 3D chunks of video that allow the model to reason about motion coherently over time.

OpenAI stated Sora was trained on both licensed and publicly available video, curated specifically for quality and motion variety. The full model has not been open-sourced, and the public API launched in December 2024 operates under tight rate limits and content policies. The publicly available product is impressive but notably more conservative than the February demos suggested — a common pattern in AI product launches.

Runway Gen-3 Alpha: The Practical Standard

While Sora attracted the headlines, Runway’s Gen-3 Alpha, released in July 2024, became the practical tool that film-adjacent creators actually adopted. Runway’s focus has always been on tools that can integrate into real production pipelines, and Gen-3 delivered on that with improved temporal consistency, more controllable camera behaviour, and a noticeable jump in output realism compared to Gen-2.

Runway published limited technical details about Gen-3’s architecture, but stated it was trained on a proprietary dataset curated for “high-motion quality and cinematic character.” Industry reports suggest Gen-3 uses a transformer-based diffusion architecture similar to Sora but with the training emphasis shifted toward controllability.

Gen-3 Alpha is notable for being the first mainstream video model where professional directors and VFX artists openly acknowledged using it for pre-visualisation and concept development on real projects. A number of short films and music videos released in late 2024 publicly credited Runway.

Google’s VideoPoet and Lumiere

Google Research published two significant papers in 2024.

VideoPoet (December 2023, widely discussed through early 2024) introduced an autoregressive LLM approach to video generation — treating video, audio, and text tokens in a single unified model. Unlike diffusion models, VideoPoet can generate video with synchronized audio from the same model, predict future frames, and apply style transformations. The paper demonstrated compelling results on tasks like “generate a 30-second video with matching ambient audio from this text description.” VideoPoet has not been released as a public product.

Lumiere, published as a research paper in January 2024, proposed a space-time U-Net architecture that generates the entire video at once rather than frame-by-frame. The Lumiere paper showed strong results on consistent large-motion generation — running figures, vehicles, and water — that exceeded prior published benchmarks. Like VideoPoet, Lumiere remains a research demonstration without public product access.

Meta’s Movie Gen

In October 2024, Meta published Movie Gen, one of the most ambitious video AI research papers of the year. Movie Gen is a 30-billion parameter transformer model capable of generating 16-second, 1080p videos with synchronized audio — the first published model to handle both modalities in a single end-to-end system at that resolution.

The paper demonstrated video generation driven by text prompts, video editing guided by natural language instructions (“make the background a sunset instead of midday”), and audio generation from visual content. The evaluation showed Movie Gen outperforming Runway Gen-3, Pika, and Kling on a range of quality benchmarks.

Meta stated Movie Gen will power future features in its products but has not specified a timeline or access model. Given Meta’s pattern with AI releases (LLaMA, Segment Anything), a partial or full open-source release is plausible but unconfirmed.

Kling and the Chinese Model Surge

The most significant competitive shift in 2024 was the emergence of high-quality AI video models from Chinese developers. Kling, released by Kuaishou in June 2024, arrived with a capability that no Western model had demonstrated: smooth, consistent generation of clips up to 5 minutes long at 1080p.

Kuaishou has not published detailed technical papers about Kling’s architecture, but the outputs suggest strong temporal coherence mechanisms and a training corpus that emphasises longer-form motion. The international release of Kling brought genuine competition to Runway’s dominance among professional creators.

CogVideoX, from Tsinghua University and Zhipu AI, also launched as an open-source release in August 2024. CogVideoX-5B can run locally on consumer-grade hardware and produced outputs that were widely compared favorably to commercial tools, representing a meaningful advance in open-source video generation.

Stable Video Diffusion and the Open-Source Ecosystem

Stability AI released Stable Video Diffusion (SVD) in November 2023, making it the first high-quality open-source video model accessible to individual developers. SVD XT, released alongside, generates 25 frames at up to 1024×576 pixels. The model runs on a GPU with 16GB VRAM — for the first time putting serious video generation capability in the hands of people who couldn’t or wouldn’t pay cloud API fees.

SVD spawned a wave of fine-tunes and extensions in 2024. AnimateDiff v3 built on diffusion model animation principles to integrate with Stable Diffusion workflows, allowing creators to animate any SD-generated image with motion LoRAs — small, shareable models trained on specific motion patterns.

The open-source ecosystem matters because it provides access to model internals, enables training on custom data, and develops capabilities (like longer clip lengths and specific motion types) that commercial platforms may not prioritise.

What the Research Points to in 2025

The consistent finding across the 2024 papers is that scale is still the primary driver of quality — larger models, trained on more diverse and higher-quality video data, consistently outperform smaller ones. This suggests that the gap between research-grade models and publicly available tools will continue to narrow as compute costs fall and training datasets improve.

The Meta Movie Gen paper’s joint video-and-audio generation is likely to become a product in 2025 — the technical capability is demonstrated; it’s a matter of productisation. Similarly, the long-form generation demonstrated by Kling will likely spread to Western platforms.

The more uncertain question is whether the open-source ecosystem will keep pace with commercial development. The release of CogVideoX in mid-2024 suggested it could; the several months of gap before another comparable open release suggested commercial labs maintained meaningful leads.

AI Video in 2024: The Models and Research That Changed Everything

OpenAI Sora: The Benchmark-Setting Announcement

Runway Gen-3 Alpha: The Practical Standard

Google’s VideoPoet and Lumiere

Meta’s Movie Gen

Kling and the Chinese Model Surge

Stable Video Diffusion and the Open-Source Ecosystem

What the Research Points to in 2025

Tags

Related Articles

AI Video Platform Updates: What Changed in Early 2025

AI Video Generation and Deepfakes: The Real Ethical and Legal Issues in 2024