Pages

Tuesday, 2 July 2024

Video-Infinity: Multi-GPU Open-Source AI for Long Video Generation

Presentational View

Introduction

The field of generative AI has undergone a complete transformation with the advent of diffusion models, which are more broadly applicable than ever in the generation of images and videos. These models progressively refine noisy data to produce high-quality, realistic images and short video clips. However, the path of these models is not free from obstacles.

The first challenge, then, is how to generate long videos. Video generation is computationally expensive both concerning memory and time on single GPUs, which has limited the output of existing video diffusion models to short sequences, typically a few seconds in length.

A team of researchers at the National University of Singapore created a model, Video-Infinity, with the precise intent of not just breaking free from the limitations of current diffusion models but also developing a model that could generate videos of much longer lengths. The prime goal for the team was to create a model that generated very long videos without compromising quality and needing impractically immense additional training.

What is Video-Infinity?

Video-Infinity is a cutting-edge AI model designed for long video generation. It stands out from conventional models by leveraging a distributed inference pipeline. This unique approach enables the model to parallelize the processing of video frames across multiple GPUs. As a result, Video-Infinity can generate videos that extend far beyond the typical few seconds produced by other models.

Key Features of Video-Infinity

  • Clip Parallelism: This innovative feature optimizes the gathering and sharing of context information across GPUs. By doing so, it minimizes communication overhead, enhancing the model’s efficiency.
  • Dual-Scope Attention: This mechanism modulates temporal self-attention. It efficiently balances local and global contexts across devices, ensuring the model’s output remains coherent and high-quality.
  • High-Speed Generation: Video-Infinity is capable of generating up to 2,300 frames in approximately 5 minutes under an 8 x Nvidia 6000 Ada GPU setup. This makes it 100 times faster than prior methods, marking a significant advancement in the field.
    Multiple GPUs parallelly generate a complete video, producing 2300 frames in 5 minutes
    source -  https://arxiv.org/pdf/2406.16260

Capabilities/Use Case of Video-Infinity

  • Film Production: The creation of long, high-quality videos can be done very fast with Video-Infinity, and, therefore, it can be a convenient tool for film production.
  • Video Game Development: The model can be used to create realistic, immersive environments.
  • Virtual Reality: In virtual reality, long sequences of high-quality video are crucial, and Video-Infinity plays an important role here.
  • Machine Learning Training: Large-scale video data from Video-Infinity could enhance machine learning training datasets.

How Does Video-Infinity Work?

Video-Infinity operates on a divide-and-conquer strategy, breaking down the task of long video generation into smaller segments. These segments are then distributed across multiple GPUs, enabling parallel processing.

(a) Pipeline of Video-Infinity   (b) Illustration of Clip parallelism
source -  https://arxiv.org/pdf/2406.16260

The core of Video-Infinity’s architecture, as depicted in figure above, is the segmentation of the video latent into chunks. These chunks are then distributed across multiple devices, each handling non-overlapping frames. This partitioning allows for parallel denoising on different devices, a process known as Clip parallelism. Clip parallelism is a mechanism that efficiently synchronizes temporal information across devices, ensuring coherence among clips distributed on different devices.

The diffusion model predicts noise in parallel with communication, and the noises are concatenated to produce the final output. In each layer of the video diffusion module, spatial modules operate independently, whereas temporal modules synchronize context elements. Peer-to-peer and collaborative communications are employed to facilitate this process.

Furthermore, Video-Infinity incorporates Dual-scope attention, which modulates temporal attention to ensure training-free long video coherence. This attention module revises the computation of Key-Value pairs to incorporate both local and global contexts into the attention, reducing the communication overhead and enhancing the coherence of long videos.

Performance Evaluation of Video-Infinity

To assess the performance of Video-Infinity, it has undergone a series of tests. For these experiments, the base model chosen was VideoCrafter2, which is a text-to-video model that performs well. The evaluation metrics were obtained from VBench, a tool that evaluates different dimensions of videos.

Comparison of maximum frames and generation times for different methods.
source -  https://arxiv.org/pdf/2406.16260

As it can be seen from the results presented in table above, Video-Infinity is superior to other methods in terms of capacity and efficiency. It was able to produce videos of 2300 frames of 512 × 320 resolution which is equivalent to 95 seconds at 24 frames per second. Surprisingly, the whole computation process took roughly 5 minutes due to good interconnections and utilization of multi-GPU parallelism. In this case, Video-Infinity delivers the final videos in the shortest time when compared to other methods for both short videos of 128 frames and long videos of 1024frames.

Evaluation metrics
source -  https://arxiv.org/pdf/2406.16260

Regarding the quality of the videos, Video-Infinity has better video quality consistency and has more motion in the videos created. As compared to the base model VideoCrafter 2 in table above, Video-Infinity has slightly dropped in all the parameters except the dynamic. Still, when generating longer 192-frame videos, which is the only other method capable of generating videos of this length, Video-Infinity surpasses StreamingT2V across most of the metrics. These results prove that Video-Infinity is more effective than other models in generating long videos.

Comparison of Video Generation Models

Within AI, the fast-growing models are pushing the boundary of possibilities regarding video generation: Video-Infinity, FREENOISE, VideoCrafter2, and Open-Sora. Each model has specific capabilities to cater to different challenges in this field.

Video-Infinity is unique in its distributed inference pipeline in how it allows for the generation of long videos by distributing the workload over multiple GPUs.

FREENOISE, on the other hand, offers tunable noise without losing time to create longer videos, which is a great way to boost generative power for pre trained video diffusion models. VideoCrafter2 addresses data limitations in the previous work on high-quality video diffusion models and can take advantage of low-quality videos and synthetic high-quality images. Open-Sora is a text-to-video generative AI model capable of creating up to one-minute-long videos with a focus on generating high-quality videos efficiently.

However, in long video generation, Video-Infinity is at the cutting edge. This is because it contains an inbuilt architecture that can distribute work to several GPUs, hence quick in long video generations. It outdoes other models like FREENOISE, VideoCrafter2, and Open-Sora, which do not have such distributed processing capability. 

So, all these models have their strengths. But Video-Infinity has the unique approach of generating very long videos. Distributing the workload over many GPUs makes this possible in very little time, and thus, it will be a favorite to work on jobs that need to be generated in lengthy videos very fast. As AI continues to advance, models such as Video-Infinity keep pushing the frontier and paving the way for new possibilities in video generation.

How to Access and Use this Model?

Video-Infinity is an open-source model that you can access on GitHub. If you're interested in using this model, you can find it in the repository along with step-by-step instructions on how to set it up and use it on your own machine. The model is licensed under CC BY 4.0, which means you can use it for both academic and commercial purposes as long as you meet the conditions specified in the license.

If you would like to read more details about this AI model, the sources are all included at the end of this article in the 'source' section.

Limitations 

Despite its advancements, Video-Infinity faces challenges:

  1. Dependency on Multiple GPUs: Effective utilization of Video-Infinity requires access to multiple GPUs.
  2. Scene Transitions: The model struggles with generating videos that involve scene transitions.

Conclusion

Video-Infinity has been a major step forward in terms of AI and diffusion models. It is aimed at removing the limits that are put on current models and allows for the production of extended videos in high quality. Nevertheless, there are still obstacles to be conquered. Consequently, with further research and growth, Video-Infinity has numerous possibilities and it is fascinating.


Source
research paper : https://arxiv.org/abs/2406.16260
research document : https://arxiv.org/pdf/2406.16260
GitHub repo : https://github.com/Yuanshi9815/Video-Infinity
Project details : https://video-infinity.tanzhenxiong.com/

No comments:

Post a Comment

DeepSeek-V3: Efficient and Scalable AI with Mixture-of-Experts

Introduction Scalable and efficient AI models are among the focal topics of the current artificial intelligence agenda.  The purpose is to d...