Michelangelo: Using a Shape-Image-Text-Aligned Space to Create and Translate 3D Shapes

Introduction

A new model was developed by a team of researchers from ShanghaiTech University, Tencent PCG, Fudan University, Shanghai Engineering Research Center of Intelligent Vision and Imaging and Shanghai Engineering Research Center of Energy Efficient and Custom AI IC. These institutions are leading centers of research and innovation in China, with expertise in computer vision, natural language processing, artificial intelligence and 3D graphics. This new model is a novel deep learning model that can generate realistic and diverse 3D shapes from different modalities, such as images, text or sketches. The motivation behind this model was to enable creative exploration and manipulation of 3D shapes using natural language and visual cues. This new model is called 'Michelangelo'.

What is Michelangelo?

Michelangelo is a conditional generative adversarial network (GAN) that learns a shared latent representation for 3D shapes, images and text. It can then use this representation to generate 3D shapes that match the given condition, such as an image, a text description or a sketch. Michelangelo can also perform cross-modal translation, such as converting an image to a text description or a text description to a sketch.

Key Features of Michelangelo

Some of the key features of Michelangelo are:

It can generate high-quality 3D shapes that are realistic, diverse and consistent with the given condition.
It can handle complex and fine-grained conditions, such as multiple objects, attributes, poses and viewpoints.
It can generate 3D shapes from different modalities, such as images, text or sketches, and perform cross-modal translation between them.
It can generate 3D shapes in various formats, such as point clouds, voxels or meshes.
It can generate 3D shapes for different categories, such as animals, cars or chairs.

Capabilities/Use Case of Michelangelo

Michelangelo has many potential applications in various domains, such as:

Computer graphics and animation: Michelangelo can be used to create realistic and diverse 3D models for games, movies or virtual reality.
Computer vision and robotics: Michelangelo can be used to recognize and manipulate 3D objects from images or text.
Education and art: Michelangelo can be used to teach and learn about 3D shapes and their properties using natural language and visual cues.
Design and engineering: Michelangelo can be used to explore and prototype new 3D designs using sketches or descriptions.

How does Michelangelo work?

Michelangelo is a model that can create 3D shapes from different kinds of inputs, such as pictures, words or drawings. It can also change one kind of input into another, such as turning a picture into words or words into a drawing. To do this, Michelangelo uses two main parts: a SITA-VAE and an ASLDM.

source - https://neuralcarver.github.io/michelangelo/

The SITA-VAE (Shaped-Image-Text-Aligned Variational Auto-Encoder) is like a translator that can speak three languages: 3D shapes, pictures and words. It can take any of these inputs and turn them into a code that can be understood by the other parts. For example, it can take a picture of a cat and turn it into a code that can be used to make a 3D shape of a cat. The SITA-VAE has three sub-parts: a shape encoder, an image encoder and a text encoder. Each sub-part can take one kind of input and turn it into a code. The codes are all in the same format, so they can be mixed and matched. This means that the SITA-VAE can translate between 3D shapes, pictures and words.

The ASLDM (Aligned Shape Latent Diffusion Mode) is like an artist that can make 3D shapes from codes. It can take any code and use it to make a 3D shape that matches the code. The ASLDM works by adding some randomness to the code and then making a 3D shape from the random code. The randomness is added slowly, so that the ASLDM can learn to make 3D shapes that are smooth and realistic.

The ASLDM can also make 3D shapes from codes that are translated from pictures or words. This is because the SITA-VAE can translate between 3D shapes, pictures and words. This means that the ASLDM can make 3D shapes from pictures or words that match the pictures or words.

Performance Evaluation

In order to truly understand the capabilities of Michelangelo, the researchers conducted a series of tests on various datasets featuring diverse 3D shapes like ShapeNet, ModelNet, and COCO. In addition, they compared Michelangelo against several other methods designed to generate 3D shapes from images or text, including Occ, ConvOcc, IF-Net, 3DILG, and 3DS2V. The findings revealed that Michelangelo outperformed the other methods in terms of creating highly accurate and diverse 3D shapes.

source - https://arxiv.org/pdf/2306.17115.pdf

To delve deeper into Michelangelo's potential, the researchers examined how effectively it could recreate a 3D shape based on a code from the ShapeNet dataset. Furthermore, they explored Michelangelo's ability to generate 3D shapes using both images and text, utilizing a combined dataset of ShapeNet and 3D Cartoon Monster. It is important to note that the same codes and inputs were employed for the other methods as well. The results unequivocally demonstrated that Michelangelo exhibited superior performance across most categories. Moreover, it successfully produced 3D shapes that closely resembled the original ones within each category.

Through comprehensive evaluations on a variety of datasets and a thorough comparison with other methods, Michelangelo has proven its prowess in creating accurate and diverse 3D shapes. This breakthrough technology showcases the remarkable potential of Michelangelo in the realm of 3D shape generation.

How to access and use this model?

Michelangelo is also open-source and can be used locally. You can find the source code, the pre-trained models and the instructions on how to run the model on GitHub Website. The model is licensed under the MIT License, which means you can use it for any purpose, as long as you give credit to the original authors.

The page dedicated to the Michelangelo project showcases visually appealing images that are produced through the utilization of 3DS2V, 3DILG, and Michelangelo itself. Users have the opportunity to evaluate the image quality by effortlessly maneuvering the images in various directions, thereby examining their three-dimensional perspective. These images are thoughtfully displayed side by side, facilitating a convenient means of comparing the shapes that are generated. Users are actively encouraged to engage with the images and explore the extensive range of shapes that are generated, ultimately enabling them to discern the superior performance of Michelangelo in contrast to other cutting-edge models used for 3D shape generation and cross-modal translation.

If you are interested to learn more about Michelangelo model, all relevant links are provided under the 'source' section at the end of this article.

Limitations

Michelangelo is a remarkable model that can generate realistic and diverse 3D shapes from different modalities, but it also has some limitations, such as:

It requires a large amount of data and computational resources to train and run the model.
It may generate shapes that are not semantically or physically plausible, especially for complex or rare conditions.
It may not capture all the details or variations of the input condition, especially for fine-grained attributes or poses.
It may not generalize well to unseen categories or modalities that are not in the training data.

Conclusion

Michelangelo is a novel and powerful model that can generate realistic and diverse 3D shapes from different modalities, such as images, text or sketches. However, it also has some limitations that need to be addressed in future work. Michelangelo is a creative and inspiring model that opens new possibilities for 3D shape generation and manipulation.

source
research paper - https://arxiv.org/abs/2306.17115
project details- https://neuralcarver.github.io/michelangelo/
GitHub Repo - https://github.com/NeuralCarver/michelangelo

SocialViews From TechWorld

Pages

Monday, 3 July 2023

Michelangelo: Using a Shape-Image-Text-Aligned Space to Create and Translate 3D Shapes

No comments:

Post a Comment

Kimi K2: Open-Weight Agentic RL for Autonomous Tool Use