We are excited to introduce the Pathways Autoregressive Text-to-Image model (Parti), a groundbreaking autoregressive text-to-image generation model. Parti achieves exceptional performance in generating high-fidelity photorealistic images while supporting content-rich synthesis with complex compositions and world knowledge.
In recent times, diffusion models like Google’s Imagen have demonstrated impressive capabilities and state-of-the-art performance in text-to-image generation on research benchmarks. Parti and Imagen explore two different families of generative models – autoregressive and diffusion, respectively. This opens up exciting possibilities for combining these powerful models to unlock even greater potential.
Parti approaches text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation. This enables Parti to leverage the advancements in large language models, especially those unlocked through the scaling of data and model sizes. In Parti’s case, instead of translating between different languages, it generates sequences of image tokens based on textual prompts. To encode images as sequences of discrete tokens, Parti utilizes the powerful image tokenizer, ViT-VQGAN. Leveraging this capability, Parti can reconstruct image token sequences into visually diverse and high-quality images. Here are some of the results we have observed:
- Consistent quality improvements by scaling Parti’s encoder-decoder up to 20 billion parameters.
- Achieving a state-of-the-art zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO.
- Demonstrating effectiveness across a wide variety of categories and difficulty aspects through analysis on Localized Narratives and PartiPrompts, our new holistic benchmark comprising over 1600 English prompts released as part of this work.
Scaling from 350M to 20B parameters
Parti is implemented in Lingvo and scaled with GSPMD on TPU v4 hardware for both training and inference. This enabled us to train a 20B parameter model, which has achieved record-breaking performance across multiple benchmarks. We conducted detailed comparisons of four scales of Parti models: 350M, 750M, 3B, and 20B. Our observations revealed consistent and substantial improvements in model capabilities and output image quality.
In particular, when comparing the 3B and 20B models, human evaluators preferred the latter in terms of image realism/quality (63.2%) and image-text match (75.9%). The 20B model excels at prompts that are abstract, require world knowledge, specific perspectives, or involve writing and symbol rendering. To compare Parti models across different scales, click on any of the following prompts:
- A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House, holding a sign on the chest that says “Welcome Friends!”
- A green sign that says “Very Deep Learning” at the edge of the Grand Canyon. Puffy white clouds fill the sky.
- A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies.
- A map of the United States made out of sushi. It is on a table next to a glass of red wine.
- A squirrel giving an apple to a bird.
- The back of a violin.
Explore these prompts to witness the capabilities of Partimodels across different scales: 350M, 750M, 3B, and 20B.
Composing real-world knowledge
The most fascinating aspect of text-to-image generation is the ability to create scenes that have never been seen before. Parti excels in managing long and complex prompts that require accurate reflection of world knowledge, composition of multiple participants and objects with fine-grained details and interactions, and adherence to specific image formats and styles.
Let’s explore some examples of prompts and their corresponding output images to observe how Parti responds to changes in participants, activities, descriptions, locations, and formats:
- A raccoon wearing formal clothes, a top hat, and holding a cane. The raccoon is also carrying a garbage bag. The output can be generated in various styles, such as oil painting in the style of Rembrandt, Vincent Van Gogh, Hokusai, pixel art, abstract cubism, or Egyptian tomb hieroglyphics.
- A portrait of a tiger wearing a train conductor’s hat and holding a skateboard with a yin-yang symbol on it. The output can be generated in different styles, including photograph, comic book illustration, oil painting, marble statue, charcoal sketch, woodcut, child’s crayon drawing, color ink-and-wash drawing, or Chinese ink and wash painting.
- A person standing in front of Loch Awe with Kilchurn Castle behind them. Another prompt involves driving a speed boat near the Golden Gate Bridge, car surfing on a taxi cab in New York City, or riding a motorcycle in Rio de Janeiro with Dois Irmãos in the background.
- A teddy bear wearing a motorcycle helmet and cape, captured in a DSLR photo.
- A composition of a maple leaf, palm tree, four-leaf clover, lotus flower, panda, teddy bear, crocodile, and dragonfly, formed into a photo made of water.
- A photo of an Athenian vase with a painting depicting various sports, such as playing tennis, soccer, basketball, in the style of Egyptian hieroglyphics.
- An imaginative scenario featuring a tornado made of sharks, tigers, and bees crashing into a skyscraper. The output can be rendered in the style of Hokusai, abstract cubism, or watercolor.
As part of this work, we present PartiPrompts (P2), an extensive collection of over 1600 English prompts. P2 serves as a valuable benchmark for measuring the capabilities of text-to-image models across various categories and challenge aspects.
The prompts range from simple ones, allowing us to assess the progress achieved through scaling, to complex descriptions that require a higher level of creativity and understanding. For instance, we created a detailed 67-word description for Vincent van Gogh’s “The Starry Night” (1889), an oil-on-canvas painting capturing a blue night sky with roiling energy, a bright yellow crescent moon, exploding yellow stars, radiating swirls of blue, a distant village, a flame-like cypress tree, and a church spire rising over rolling blue hills. P2 prompts offer a comprehensive assessment of model capabilities.
Discussion and limitations
While we present selected images from a large set of examples generated during prompt exploration and modification interactions, it is important to acknowledge the limitations of Parti. In the paper, we extensively discuss these limitations, provide examples of failure cases, and identify areas for future improvement. One of the challenges observed is the improper handling of negation or indication of absence, leading to suboptimal outputs. We take the responsibility to address these limitations and mitigate risks associated with bias, safety, disinformation, and visual communication.
Responsibility and broader impact
Text-to-image models like Parti present a plethora of opportunities and risks, impacting areas such as bias and safety, visual communication, disinformation, and creativity and art. We recognize the potential risks of encoding harmful stereotypes and representations, similar to Imagen. The training data used for models like Parti often contains biases, affecting the representation of people from diverse backgrounds. This can result in stereotypical depictions of individuals in certain professions or events. To address these concerns, we have chosen not to release our Parti models, code, or data for public use without implementing further safeguards.
Moving forward, we plan to focus on measuring and mitigating model biases, applying strategies like prompt and output filtering, and recalibrating the model. We believe it is possible to leverage text-to-image generation models to uncover biases in large image-text datasets and explore new artistic styles in collaboration with artists. Our goal is to ensure that these models augment human creativity and productivity while delivering responsible and diverse visual experiences. We remain committed to conducting further research and development to create a positive impact on the field of text-to-image generation.
Parti is the result of collaborative efforts among authors across multiple Google Research teams. We extend our gratitude to all individuals who have contributed to this project, including researchers, reviewers, and support teams. Their valuable insights and assistance have been instrumental in the development and refinement of Parti. Special thanks to the Imagen team for sharing their insights and findings, which greatly influenced the final Parti model. We also acknowledge the importance of safeguarding against biases and ensuring responsible use of these models in art and creativity.