M
MercyNews
Home
Back
Two Brothers Build Text-to-Video Model from Scratch
Technology

Two Brothers Build Text-to-Video Model from Scratch

Hacker News7h ago
3 min read
📋

Key Facts

  • ✓ Sahil and Manu, two brothers, spent two years training a text-to-video model entirely from scratch, releasing it under the Apache 2.0 license.
  • ✓ The 2B parameter model generates 2-5 seconds of footage at either 360p or 720p resolution, with capabilities comparable to Alibaba's Wan 2.1 1.3B model.
  • ✓ Development focused heavily on building effective curation pipelines, including hand-labeling aesthetic properties and fine-tuning VLMs for large-scale filtering.
  • ✓ The model uses T5 for text encoding, Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching.
  • ✓ Current strengths include cartoon/animated styles, food and nature scenes, and simple character motion, while complex physics and fast motion remain challenging.
  • ✓ The brothers view this as a stepping stone toward state-of-the-art capabilities, with future plans for post-training, distillation, and audio integration.

In This Article

  1. Quick Summary
  2. The Two-Year Journey
  3. Technical Architecture
  4. Capabilities & Limitations
  5. Why Build Another Model?
  6. Future Roadmap
  7. Looking Ahead

Quick Summary#

Two brothers have completed a two-year journey to build a text-to-video model entirely from scratch, releasing it as open-source software. The project, led by Sahil and Manu, demonstrates that independent developers can compete in the advanced AI space without massive corporate resources.

The resulting model contains 2 billion parameters and can generate short video clips from text descriptions. While not claiming to match the performance of commercial systems like Sora or Veo, the brothers view their work as a crucial stepping stone toward state-of-the-art capabilities.

The Two-Year Journey#

The brothers began their work in early 2024, shipping their first model in January of that year—before OpenAI's Sora made headlines. Their initial release was a 180p, 1-second GIF bot that was bootstrapped off Stable Diffusion XL. However, they quickly discovered fundamental limitations with using image-based models for video generation.

Image VAEs don't understand temporal coherence, and without the original training data, it's impossible to smoothly transition between image and video distributions. At some point, the brothers determined they were better off starting over rather than trying to patch existing solutions.

Their second version represents a complete rebuild from the ground up. The model uses:

  • T5 for text encoding
  • Wan 2.1 VAE for compression
  • A DiT-variant backbone trained with flow matching

Interestingly, while they built their own temporal VAE, they ultimately used Wan's smaller version because it offered equivalent performance while saving on embedding costs. The brothers have committed to open-sourcing their VAE shortly.

"We're not claiming to have reached the frontier. For us, this is a stepping stone towards SOTA - proof we can train these models end-to-end ourselves."

— Sahil and Manu, Model Developers

Technical Architecture#

The model generates 2-5 seconds of footage at either 360p or 720p resolution. In terms of model size, the closest comparison is Alibaba's Wan 2.1 1.3B model, though the brothers report that their model achieves significantly better motion capture and aesthetics in their testing.

The bulk of their development time wasn't spent on the model architecture itself, but on building curation pipelines that actually work. This involved hand-labeling aesthetic properties and fine-tuning Vision-Language Models (VLMs) to filter training data at scale.

When asked about their approach, the brothers explained their philosophy:

Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck. To build the product we want, we need to update the model itself.

This perspective drives their decision to own the entire development process, despite the significant computational costs involved.

Capabilities & Limitations#

The model demonstrates particular strengths in specific domains. Through extensive testing, the brothers identified what works best:

  • Cartoon and animated styles
  • Food and nature scenes
  • Simple character motion

However, the model still faces challenges with more complex scenarios. Areas that don't work well include:

  • Complex physics simulations
  • Fast motion sequences (gymnastics, dancing)
  • Consistent text rendering

The brothers are transparent about their model's position in the current landscape. They explicitly state: "We're not claiming to have reached the frontier." Instead, they view this release as proof of concept—demonstrating they can train these models end-to-end themselves.

Why Build Another Model?#

With commercial offerings like Google's Veo and OpenAI's Sora already available, the brothers' decision to build from scratch might seem counterintuitive. Their reasoning centers on product control and flexibility.

When commercial models don't support specific features, developers are limited by what those models can do. The brothers believe that to build the product they envision, they need to update the model itself. This requires owning the development process rather than relying on external APIs.

It's a significant bet that requires substantial GPU compute resources and time to pay off, but they believe it's the right long-term strategy. Their approach allows them to:

  • Customize capabilities for specific use cases
  • Iterate quickly on model improvements
  • Control the entire technology stack
  • Build features that commercial models don't support

Future Roadmap#

The brothers have outlined a clear roadmap for future development. Their immediate priorities include:

  • Post-training for physics and deformations
  • Distillation for speed optimization
  • Audio capabilities integration
  • Model scaling for improved performance

They've also maintained a detailed "lab notebook" of all their experiments in Notion, which they're willing to share with others interested in the technical details of building models from zero to one.

The model is released under the Apache 2.0 license, making it freely available for commercial and non-commercial use. This open-source approach aligns with their goal of democratizing access to advanced AI capabilities.

Looking Ahead#

The release of this 2B parameter model represents more than just a technical achievement—it demonstrates that independent developers can compete in the advanced AI space with sufficient dedication and resources. The brothers' two-year journey from a 180p GIF bot to a sophisticated text-to-video model shows what's possible with focused effort.

While the model may not yet match the performance of commercial giants, it serves as a stepping stone toward state-of-the-art capabilities. The brothers' commitment to open-source development and transparent documentation could inspire other independent researchers to pursue similar projects.

As the AI landscape continues to evolve, projects like this highlight the importance of diversity in development approaches. Rather than relying solely on large corporate research labs, the field benefits from contributions from independent developers who bring different perspectives and priorities to the table.

"Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck."

— Sahil and Manu, Model Developers

"To build the product we want, we need to update the model itself. That means owning the development process."

— Sahil and Manu, Model Developers

Continue scrolling for more

AI Transforms Mathematical Research and Proofs
Technology

AI Transforms Mathematical Research and Proofs

Artificial intelligence is shifting from a promise to a reality in mathematics. Machine learning models are now generating original theorems, forcing a reevaluation of research and teaching methods.

Just now
4 min
351
Read Article
Microsoft 365 Outage Hits Outlook, Defender Services
Technology

Microsoft 365 Outage Hits Outlook, Defender Services

Microsoft is investigating a widespread outage affecting several Business and Enterprise Microsoft 365 services, including Outlook. Here are the details.

1h
3 min
6
Read Article
Tesla's Robotaxi 'Safety Monitor' Shift Revealed
Technology

Tesla's Robotaxi 'Safety Monitor' Shift Revealed

Elon Musk announced Tesla's Robotaxi drives in Austin with no safety monitor, causing a stock jump. However, reports indicate the monitors were simply moved to a trailing vehicle.

1h
5 min
6
Read Article
BYD Unveils New Flagship EV Lineup for 2026
Automotive

BYD Unveils New Flagship EV Lineup for 2026

BYD is preparing to launch several new flagship EVs in early 2026, including a pair of electric SUVs and a sedan. With their official debut just around the corner, we are getting our first look at the upcoming models.

1h
3 min
6
Read Article
JBL Launches AI-Powered Practice Amps with Stem Technology
Technology

JBL Launches AI-Powered Practice Amps with Stem Technology

JBL has unveiled two AI-powered practice amps featuring Stem AI technology that separates vocals and instruments from any Bluetooth stream, allowing musicians to practice with their favorite tracks.

1h
5 min
6
Read Article
Massachusetts Proposes 'Right to Know' for Smart Device Lifespans
Politics

Massachusetts Proposes 'Right to Know' for Smart Device Lifespans

A pair of bills in Massachusetts would require manufacturers to tell consumers when their connected gadgets are going dark. It should be a boon for cybersecurity as connected devices grow obsolete.

2h
5 min
6
Read Article
Vimeo Lays Off Staff After Bending Spoons Acquisition
Technology

Vimeo Lays Off Staff After Bending Spoons Acquisition

Just months after a $1.38 billion acquisition by Italian software company Bending Spoons, Vimeo is conducting significant layoffs across its global workforce, according to former employees.

2h
5 min
6
Read Article
Webb Telescope Spots Cosmic 'Feeding Frenzy' of Massive Black Holes
Science

Webb Telescope Spots Cosmic 'Feeding Frenzy' of Massive Black Holes

New observations from the James Webb Space Telescope reveal a cosmic 'feeding frenzy' that may explain the birth of the universe's most massive black holes, offering unprecedented insight into early galaxy formation.

2h
5 min
6
Read Article
Nasdaq Seeks to Remove Bitcoin, Ether ETF Options Limits
Cryptocurrency

Nasdaq Seeks to Remove Bitcoin, Ether ETF Options Limits

Nasdaq has formally requested the US Securities and Exchange Commission to eliminate position limits on Bitcoin and Ether ETF options, a move designed to correct perceived inequalities in the derivatives market.

2h
5 min
6
Read Article
Solana Treasury Firm Blames Sniper for Suspicious Trades
Cryptocurrency

Solana Treasury Firm Blames Sniper for Suspicious Trades

A Solana treasury firm launched a meme coin on Thursday, only to face immediate insider trading allegations. The company has pointed the finger at a sniper for the suspicious activity.

2h
5 min
12
Read Article
🎉

You're all caught up!

Check back later for more stories

Back to Home