I recently took the plunge into generative AI. My goal was to make a cover image for the short story “Sofi”. This YA fantasy involves three main characters: a boy and two goddesses who are fighting over his soul. There is one magic battle in the story, which naturally lends itself to an intriguing cover.
To get started, I read some online reviews about which tool is best. The reviews tend to be pretty bland and non-technical, so I want to offer some notes on my experience.
Midjourney — This was the first tool I tried. Paid for the cheapest subscription level ($10/month), which was sufficient to explore possibilities for one final product. One great strength of Midjourney is the ability to refine results. Each time you generate, you get back 4 pictures. You can pick any of these to generate more variants, zoom, resize and recenter. This lets you guide Midjourney towards what you really want. I suspect that Midjourney uses all this feedback to further train the AI, and it shows in the quality of the art.
Unfortunately, Midjourney is not able to distinguish multiple objects in the scene. If I ask for two human characters, all the words in the prompt affect the outcome of both characters. Sometimes the descriptions get jumbled. For example, I asked for Sofi to have a white shirt and Raca to have a black dress, but sometimes the colors or even outfits got reversed. Better to use Midjourney for single objects.
Stable Diffusion — I tried the web demo on Hugging Face. The art for a similar prompt was noticeably lower quality than Midjourney. Not just extra fingers, but extra arms. Faces melted and blurred. I don’t think this was so much an issue with prompt-engineering as with the tool itself. You get what you pay for. I didn’t pursue it further.
DALL-E 3 — I started by paying OpenAI’s lowest-level subscription ($20/month). This gave access to the latest GPT-4 and indirectly to DALL-E 3. Immediately after that, I learned that you can access DALL-E 3 for free via Microsoft’s Bing Image Creator. That only requires a registered Microsoft account. My experiences with these two tools were a bit different.
GPT-mediated DALL-E 3 — This produced the most satisfactory results. I was able to name the two goddesses and describe each one in detail. For the most part, the tool did not mix up race, hair color, clothing color or style. However, when I tried to describe differences in female figure, they came out looking the same.
GPT took my prompt and generated a more embellished one for DALL-E. Not sure how much the embellishments contributed to the result.
Original prompt: Photorealistic image of clash between two teenage goddesses. Sofi, the good goddess, represents wisdom. Sofi forms a ball of glowing white magic. Sofi has an Asian face, long silky black hair and white shirt. Raca, the evil goddess, is Caucasian with red hair, and black lacy dress.
GPT-4 embellishment: A photorealistic image depicting an epic clash between two teenage goddesses. The good goddess, Sofi, embodies wisdom and is creating a ball of glowing white magic. Sofi has an Asian appearance, with long, silky black hair and is wearing a white shirt. Opposing her is Raca, the evil goddess, who is Caucasian with striking red hair and is dressed in a black, lacy dress. The scene is dynamic, illustrating the stark contrast between the two goddesses, both in appearance and the powers they wield.
One serious deficiency with the GPT interface is that you can’t vary or refine images. You can follow up with another prompt, and GPT will try again, but the new image is completely different. I made a point of saving any of them that were good, because they would never come again. It would have been nice to take the good ones and tweak them just a little to be perfect.
In general, all the AI tools had trouble with fingers. One of the very best images from GPT-4 gave Raca six fingers on one hand and 4 on the other. Too few fingers is forgivable, because the viewer might imagine one is hidden. Too many fingers is a disaster. I wanted to rerun this image and say, “But Raca has 5 fingers.” However, as noted above, GPT-4 doesn’t do true image refinement.
Bing DALL-E 3 — I believe this gives direct access to DALL-E. I tried feeding it the prompts listed above, both original and GPT-embellished, but it never produced quite the same level of results. Not sure if this is just luck of the random-number draw, or if there is some difference in the tools.