LOGO

DALL-E: AI Image Generator by OpenAI

January 5, 2021
DALL-E: AI Image Generator by OpenAI

OpenAI’s newest and remarkably intriguing development is DALL-E, which can be briefly described as “GPT-3 for visual content.” It generates illustrations, photographs, renderings, or any visual representation you prefer, based on descriptions you provide – ranging from “a feline adorned with a bow tie” to “a Japanese radish in a ballet skirt leading a canine on a leash.” However, it’s premature to anticipate the decline of stock photography and illustration industries.

As is typical, OpenAI’s explanation of its innovation is easily understood and avoids excessive technical jargon. Nevertheless, some background information is beneficial.

The researchers behind GPT-3 developed an artificial intelligence that, when given a textual prompt, attempts to create a reasonable interpretation of the described scenario. For instance, if you input “a narrative concerning a youngster who encounters a sorceress in the forest,” it will endeavor to compose such a story—and if you activate the process again, it will generate a different version, and so on repeatedly.

The quality of these generated outputs will vary; some may lack coherence, while others could be almost indistinguishable from content created by a human author. Importantly, it doesn’t produce nonsensical results or significant grammatical errors, making it applicable to a wide range of applications, as both startups and researchers are currently investigating.

DALL-E (a portmanteau of Dali and WALL-E) expands upon this concept. The conversion of text into images has been pursued by AI systems for years, with progressively improving results. In this instance, the system leverages the linguistic comprehension and contextual awareness of GPT-3 and its underlying architecture to produce an image that aligns with the given prompt.

As OpenAI explains:

This signifies that an image-generating system of this kind can be controlled intuitively, simply by instructing it on what to create. While it’s possible to delve into its internal workings and identify the tokens representing color, decoding its pathways to activate and modify them—similar to stimulating neurons in a living brain—it’s unnecessary when requesting a staff artist to alter a color. You would simply state, “a blue automobile” instead of “a green automobile,” and the artist would understand.

This applies to DALL-E, which interprets these prompts and rarely encounters substantial difficulties, although it should be noted that even among the best of numerous attempts, many images it produces are somewhat… unusual. Further discussion on this will follow.

In the OpenAI publication, the researchers present numerous interactive examples demonstrating how the system can be instructed to create slight variations of the same concept, yielding plausible and often impressive results. These systems can be quite sensitive, as the researchers acknowledge DALL-E is in certain respects, and requesting “a green leather handbag shaped like a pentagon” may yield the desired outcome, while “a blue suede handbag shaped like a pentagon” could produce unsettling imagery. The reason for this remains unclear, given the opaque nature of these systems.

Image Credits: OpenAI

However, DALL-E demonstrates remarkable resilience to such alterations, consistently generating outputs that closely match your requests. A torus composed of guacamole, a sphere of zebra stripes; a large blue cube resting on a small red cube; a frontal view of a contented capybara, an isometric view of a melancholy capybara; and so on. You can explore these examples within the original post.

The system also exhibited some unexpected yet beneficial behaviors, employing logical reasoning to fulfill requests like generating multiple sketches of the same (imaginary) cat, with the original depicted above the sketch. This functionality was not intentionally programmed: “We did not anticipate this capability and made no alterations to the neural network or training process to encourage it.” This outcome is positive.

Notably, another recent system developed by OpenAI, CLIP, was utilized in conjunction with DALL-E to interpret and rank the generated images, although it is a more complex and challenging concept to grasp. Further information about CLIP can be found here.

The potential implications of this capability are extensive and diverse, to the point where a comprehensive analysis is beyond the scope of this discussion. Even OpenAI refrains from making definitive predictions:

Currently, like GPT-3, this technology is impressive but making precise forecasts about its future is challenging.

Importantly, much of its output doesn’t appear entirely “polished”—meaning I couldn’t instruct it to create a featured image for my recent writing and expect a directly usable result. Even a cursory review reveals various AI-related anomalies (a subject of expertise for Janelle Shane), and while these imperfections will undoubtedly be refined over time, it’s not yet reliable in the same way that GPT-3 text can be used, with editing, in place of human-authored content.

Generating numerous options and selecting the best few is a helpful approach, as illustrated in the following collection:

The top eight out of a total of X generated, with X increasing to the right. Image Credits: OpenAI

This is not to diminish OpenAI’s achievement. This is a remarkably interesting and powerful undertaking, and, like the company’s other projects, it will likely evolve into something even more remarkable and intriguing in the future.

#DALL-E#OpenAI#AI image generator#text to image#artificial intelligence#image creation