BlobGEN-Vid: Compositional Text-to-Video Generation
with Blob Video Representations


Weixi Feng1
Chao Liu2
Sifei Liu2

William Yang Wang1
Arash Vahdat2
Weili Nie2

1 UC Santa Barbara 2 NVIDIA
Paper (arXiv) Code (Coming soon!)

Introducing BlobGEN-Vid in 2 minutes. (Best viewed on Chrome.)

Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives -- blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.


BlobGEN-Vid Method Overview

BlobGEN-Vid proposes a model-agnostic design for controllable video generation. It inserts and finetunes Masked Spatial Cross-Attention and Masked 3D Self-Attention layers to a pre-trained video diffusion model. Our Masked 3D Self-Attention layers significantly improves object-level consistency across frames. BlobGEN-Vid can be adapted to both U-Net (w/ spatial-temporal attentions) and DiT-based (w/ full 3D attentions) video diffusion models.

caption


BlobGEN-Vid Generation Results

We validate BlobGEN-Vid's performance on multiple tasks and benchmarks, from multi-view image generation to text/layout-to-video generation. BlobGEN-Vid outperforms all layout-guided video generation models in controllability and object consistency. It also outperforms proprietary video generators like Gen-3 or Kling 1.0 in compositional correctness. See our paper for detailed experimental results.

Clink the tabs below to visualize results of each task.

 
BlobGEN-3D
BlobGEN-Vid (Ours)
BlobGEN-Vid w/ visualized blobs (Ours)
 
Pika
Dream Machine
Kling
Gen-3
BlobGEN-Vid (Ours)

Spatial Relationships: A cat sitting on the left of a fireplace.

Spatial Relationships: A sheep grazing on the left of a surfboard on a sandy beach.

Motion Binding: A robot walking from right to left across the moon with a car driving left to right in the background.

Motion Binding: A toy car drives from left to right, passing miniature buildings and trees

Dynamic Attribute Binding: A timelapse of a piece of bread initially fresh, then growing moldy.

Dynamic Attribute Binding: Clear ice cube melts into shapeless water

 
CogVideoX-5B
Pika
Dream Machine
Kling
Gen-3
BlobGEN-Vid (Ours)

A cake frosting changing from vanilla white to sunset orange.

A piece of fruit dropping from a tree into a basket underneath.

A leaf falls from a tree, landing on a floating lake surface.

Dew drops fall from a leaf to the ground.

A statue with sunrise changing to sunset around it.

 
LVD
VideoTetris
TrackDiffusion
BlobGEN-Vid (Ours)

The video shows a tiger lying on the ground, looking directly at the camera. It appears to be in a zoo enclosure, and there are trees and a building in the background. The tiger is seen licking its paw, and the camera zooms in on its face.

The video shows a close-up of a zebra's face, with its eyes and nose clearly visible. The zebra appears to be in a natural habitat, possibly a savannah or grassland, with trees and a blue sky in the background. The zebra's stripes are distinct and its eyes are open, giving a clear view of its facial features.

The video shows a group of colorful birds, including parrots and parakeets, perched on a wooden stand and eating from a tray of seeds. One bird is yellow, another is green, and the third is blue. They are in a cage, and the camera zooms in on the yellow bird as it eats.

The video shows a snowboarder performing a trick on a ramp, launching off a rail, and landing on the snow. The snowboarder is wearing a white jacket and black pants, and the ramp has a red roof. The scene is set against a backdrop of a snowy mountain with clouds in the sky.

The video shows two tigers walking in a zoo enclosure. One tiger is walking towards the camera while the other is walking away. They are surrounded by rocks and trees, and the ground is covered in dirt.



BibTex:

@article{feng2025blobgen-vid,
  title={BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations},
  author={Feng, Weixi and Liu, Chao and Liu, Sifei and Wang, William Yang and Vahdat, Arash and Nie, Weili},
  year={2025},
}

Back to top