Generative AI has rapidly evolved from being focused on text-based models to incorporating multimodal capabilities that can handle tasks like image captioning and visual question answering. This shift aims to make AI more human-like by enabling it to interpret and generate content across different modalities, such as images, video, and text. Currently, the expansion of generative AI is reaching into the realm of video, which is set to unlock new potential in various industries like robotics, automotive, and retail.
In the field of robotics, video AI is enhancing autonomous navigation in complex environments, improving processes like warehouse management and manufacturing. The automotive sector also stands to benefit, with video AI being used to propel autonomous driving, enhance vehicle perception, and increase safety. These advancements contribute to a future where AI models are able to process and understand complex visual data, providing more efficient and predictive systems.
To develop high-performing image and video foundation models, developers need to curate and preprocess massive amounts of training data. The quality of this data is crucial, as it directly impacts the performance of generative models. The data must then be tokenized with high fidelity, models trained efficiently at scale, and finally, high-quality images and videos must be generated during inference.
In a recent announcement, NVIDIA introduced the expansion of its NeMo platform to support the full end-to-end pipeline for building multimodal generative AI models. NeMo is designed to help developers easily curate high-quality visual data, accelerate training, and streamline customization. NeMo’s tools include tokenizers and parallelism techniques that enhance data processing speed, making it possible to reconstruct high-quality visuals during inference.
One of the major challenges in developing generative AI models is building efficient data processing pipelines that can handle large volumes of data. To address this, NVIDIA has introduced the NeMo Curator, which accelerates the data curation process and reduces the total cost of ownership. NeMo Curator is designed to scale, enabling developers to process petabytes of data with ease. It also features a highly optimized pipeline that improves the speed of video processing by up to 7 times compared to previous GPU-based implementations.
The Curator tool provides optimized models for high-throughput filtering, captioning, and embedding, which significantly enhance the quality of datasets. These improvements make it easier to generate more accurate AI models by creating high-quality training data. For example, the NeMo Curator’s optimized captioning model delivers much higher throughput compared to traditional models, which can improve the overall efficiency of training processes.
In addition to data curation, NVIDIA also introduced Cosmos tokenizers, which play a key role in the tokenization of visual data. These tokenizers map redundant and implicit visual information into compact, semantic tokens, enabling generative models to process large-scale data with better efficiency and fewer computational resources. Traditional video and image tokenizers have often resulted in low-quality reconstructions with distorted images and temporally unstable videos, but Cosmos tokenizers offer superior visual representation with faster encoding and decoding, making them ideal for building high-performance generative AI models.
The Cosmos tokenizer employs an advanced encoder-decoder structure with 3D causal convolution blocks that effectively process spatiotemporal data. This method improves learning efficiency and enables faster tokenization with reduced computational costs. During inference, the Cosmos tokenizer is able to reconstruct visuals up to 12 times faster than leading open-weight tokenizers while maintaining high image and video quality.
NVIDIA’s innovations in data curation and tokenization, through tools like NeMo Curator and Cosmos tokenizers, are empowering developers to build powerful multimodal AI models capable of processing complex visual data. By enabling more efficient and high-quality video and image AI systems, these tools bring developers closer to creating advanced generative AI solutions that can impact industries such as robotics, automotive, and beyond.
Read More