Point-E is OpenAI’s new system which produces 3D models from prompts in only 1-2 minutes on a single GPU.
Generating 3D models was previously very different from the image generation models such as Dall-E because those can typically produce images within seconds or minutes while a state-of-the-art 3D model required multiple GPU hours to produce a single sample, according to a group of OpenAI researchers in a post.
The method behind the project builds on a growing body of work on diffusion-based models and to build off of two main categories for text-to-3D synthesis.
One is a method that trains generative models directly on paired which can leverage existing generative modeling approaches to produce samples efficiently, but it is difficult to scale to diverse and complex text prompts.
The other method leverages pre-trained text-image models to optimize differentiable 3D representations which can handle complex and diverse text prompts, but require expensive optimization processes to produce each sample.
Point-E aims to get the best of both worlds by pairing a text-to-image model with an image-to-3D model.
Once a user puts in a prompt such as “a corgi wearing a red santa hat,” the model will first generate an image synthetic view 3D rendering and then use diffusion models to create a 3D, RGB point cloud of the image.
The researchers say that even though the model performs worse in quality than state-of-the-art techniques, it does it in a fraction of the time.