Meet Flux: New Open-Source AI Image Generator Beats Midjourney, SD3 and Auraflow - Decrypt
08/02/2024 01:01Flux is an advanced, open-source text-to-image model with 12 billion parameters. We compare it to three top contenders, and one comes out on top.
Black Forest Labs—the team that helped develop the original Stable Diffusion—has launched Flux, the largest open-source text-to-image model to date. With a staggering 12 billion parameters, Flux can delivery visuals that rival those of Midjourney, and possibly beat any other model currently available—be it open or closed source.
Flux comes in three variations: Flux Dev, which is open source with a non-commercial license for community development; Flux Schnell, which is a faster, distilled version operating up to ten times quicker, available under an Apache 2 license and the top-of-the-line model; and Flux Pro. which is a closed-source version available via an API.
Flux Dev and Flux Schnell are available for download on Hugging Face. ComfyUI has also been updated to support the new models in local workflows.
Black Forest Labs made the announcement Thursday, emphasizing the team’s proven track record of advancing generative AI for media.
“Our innovations include creating VQGAN and Latent Diffusion, Stability AI’s Stable Diffusion models for image and video generation (Stable Diffusion XL, Stable Video Diffusion, Rectified Flow Transformers), and Adversarial Diffusion Distillation for ultra-fast, real-time image synthesis,” the team said.
The launch follows a successful seed funding round of $31 million, led by Andreessen Horowitz and supported by notable investors including Brendan Iribe, Michael Ovitz, and Garry Tan.
In benchmarking tests, Flux says its models have set new standards in image synthesis, surpassing models like Midjourney v6.0, Dall-E 3 (HD), and SD3 Ultra in visual quality, prompt following, size/aspect variability, typography, and output diversity. Black Forest’s charts claim that its Pro and Dev models are the best image generators to date, and its less powerful Schnell ranks between Midjourney v5 and Ideogram.
Users with smaller GPUs may be out of luck, however. The open-source models weigh around 23GB, which means it would probably require nearly 24GB of VRAM to run until a quantized version is released—if ever. But even so, it seems like users with GPUS with 6 and 8 GB of VRAM will soon have to say goodbye to the thrill of testing new AI models.
However, Black Forest has partnered with Fal AI—developers of fellow open-source model Auraflow—to support cloud generations. The models are also available for testing free on Replicate.com. Once users meet their daily quota, it costs $1 to generate 33 images with Flux Pro or 333 with Flux Schell.
This is a better value proposition than Midjourney or Ideogram. Midjourney’s Basic plan costs $96/year and lets users generate around 200 images per month, which is something like 25 images per dollar. Ideogram’s basic plan costs $84 a year, and provides up to 400 images per month or 50 images per dollar.
Testing Flux
Flux looks great in benchmark tests, but how good do its creations look? We have compared it against the most prominent open-source image generators available to date, and can confirm that we were impressed. Let's compare Flux, SD3 Medium, and Auraflow—then put it head to head against Midjourney.
Illustrations
Prompt 1: “Hand-drawn illustration of a giant spider chasing a woman in the jungle, extremely scary, anguish, dark and creepy scenery, horror, hints of analog photography influence, sketch.”
Flux showed an excellent use of atmospheric lighting and shadows. The spider's design is truly menacing, with its sharp legs and frightening face. The woman's vulnerable posture conveys anguish well. It is the most accurate representation of anatomy.
Auraflow’s teal color palette gives an eerie, otherworldly feel, but doesn't fully capture the "dark and creepy" requirement. The spider design is less scary and more stylized.
SD3 Medium’s black-and-white style gives a strong sketch-like quality. The spider's design is detailed and menacing but has some morphological flaws in the limbs
Our Ranking:
- Flux: Best captures the horror, anguish, and creepy atmosphere. It is the most accurate creation with no morphological flaws.
- SD3 Medium: While visually striking, it's the least aligned with the "analog photography" aspect of the prompt. The horror style is noticeable.
- Auraflow: Closest to the sketch and analog photography as a whole. However, it is the least creepy, least scary, and is the one that least conveys the overall atmosphere of the scene.
Spatial Awareness
Prompt 2: “A dog standing on top of a TV showing the word ‘Decrypt’ on the screen. On the left there is a woman in a business suit holding a coin, on the right there is a robot standing on top of a first aid box. The overall scenery is surreal.”
Flux is the model that most closely matches the prompt's requirements. It features all the elements in the required positions. The composition is well-balanced, and the unexpected placement of elements and the retro-futuristic clash enhance the surreal quality. Although it generated a glimpse of an additional hand, this version captures the prompt's essence most accurately.
SD3 Medium is the second best. It understood all the elements but also had some variations—like the cartoonish style and the dog sitting instead of standing. It captures some elements of the prompt but misses others, falling between Flux and Auraflow in terms of accuracy.
Auraflow takes some liberties with the prompt. The dog is on the TV but is sitting not standing, the woman has a more vintage 1950s look rather than a modern business suit, the robot is on a blue pedestal, not a first aid box, and the overall style is more retro and colorful, less surreal. The words were also poorly rendered.
While creative, it deviates more from the original prompt than the Flux version.
Our Ranking:
- Flux: Most accurate to the prompt and achieves a surreal quality.
- SD3 Medium: Captures main elements but misses some details.
- Auraflow: Creative interpretation but deviates most from the original prompt.
Realism
Prompt 3: “A high-resolution photograph of a bustling city street at night, neon signs illuminating the scene, people walking along the sidewalks, cars driving by, a street vendor selling hot dogs, reflections of lights on wet pavement, the overall style is hyper-realistic with attention to detail and lighting, a neon sign says ‘Decrypt.’”
Flux closely matches the prompt's requirements. It features a bustling city street at night with neon signs illuminating the scene, people walking along the sidewalks, and cars driving by. The reflections of lights on the wet pavement are realistic, and the "Decrypt" sign is prominently displayed.
Auraflow takes some liberties with the prompt. The vibrant neon lighting creates a bustling atmosphere, and the reflections on the wet pavement add to the realism. The street vendor is clearly visible and interacts with the scene. However, the image appears slightly over-saturated and the street vendors look cartoonish, which detracts from the hyper-realistic style. The neon signs are blurry and there is no clear distinction between the sidewalk and the street since the model generated a weird perspective.
SD3 Medium also captures the main elements of the prompt but has some variations. The balanced composition focuses on both pedestrians and the environment, with realistic lighting and reflections enhancing the night-time city feel. The "Decrypt" sign is prominent, and the street vendor contributes to the lively atmosphere. However, upon closer inspection, it is easy to spot some elements that make the scene unrealistic. For example, people walk on the street, and the sidewalk expands to fit the hot dog stand.
Our Ranking:
- Flux: Detailed and well-lit. Captures the busy street well, the signs are easy to read and the pedestrians are well represented.
- SD3 Medium: Captures the prompt's requirements with a balanced composition, realistic lighting, and well-integrated elements, including the "Decrypt" sign and street vendor. But the pedestrians are not as realistically represented as in the Flux generation.
- Auraflow: Creative interpretation with vibrant lighting, but deviates from the hyper-realistic style with its cartoonish street vendors and the messy neon signs. It has some issues with the perspective, which is a problem if the objective is photorealism.
Boss level: Flux v. Midjourney
We also compared Flux against Midjourney. But instead of using our own generations, we copied the prompts for Midjourney’s top picks according to their “discovery” page. Here is how the two models stack against each other.
Realism
Prompt 1: A black and white photo of a woman with long straight hair, wearing an all-black outfit that accentuates her curves, sitting on the floor in front of a modern sofa. She is posing confidently for the camera, showcasing her slender legs as she crouches down... See the full prompt here.
Midjourney closely matches the requirements. It features a woman in a dynamic, crouched pose on a soft surface, capturing the essence of a high-fashion photograph. The detail in her hair, facial features, and clothing is rendered with high precision, enhancing realism. However, the pose, while dynamic, is unnatural. The woman’s right hand looks like a mixture of a hand and a foot, her right leg disappears out of nowhere, and what would be her left foot has also a shape that mimics a hand.
On the other hand, Flux captures the main elements of the prompt with a balanced composition. The woman is seated on the floor with her legs crossed, in a more relaxed and natural pose. The high precision in rendering facial features, hair, and clothing contributes to a realistic appearance. The lighting is soft and diffused, providing gentle shadows and highlights that define the subject's features.
The generation was not without flaws, though. She seems to have an additional leg, though it can be easily fixed with inpainting or tools like Photoshop, since the overall dark scene makes it easy to work with.
Our Ranking:
- Flux: Captures the prompt's requirements with a natural pose, contextual background, and detailed rendering. It is the most accurate in terms of morphology.
- Midjourney: Features a dynamic pose and high level of detail, but lacks the contextual richness of the Flux image and the body was not as accurately represented as with Flux.
Prompt Adherence
Prompt 2: A white cat playing the piano, wearing sunglasses and a hat, wearing purple Hawaiian style, full body shot against a grey studio background, commercial video screengrab. Credit: Chestnutmuffin.
Midjourney's interpretation of the prompt captures the whimsical nature of the scene. The vibrant purple Hawaiian shirt adds a playful touch. The lighting is soft, emphasizing the textures and colors effectively. However, the close-up shot deviates from the "full body shot" specified in the prompt, and the background is not the grey studio setting requested, but rather a more natural and less controlled environment. The overall composition, while charming, excels in realism and style but misses some key elements of the prompt.
Flux delivers a closer adherence to the prompt with a full body shot of the white cat playing the piano capturing all the elements of the prompt. The composition is less stylish but includes the entire body of the cat, ensuring all specified details are visible. The lighting and rendering are well-executed, highlighting the cat's posture and the overall scene. However, while the image is highly detailed and accurate, it may lack the immediate charm and expressiveness of the close-up generated by Midjourney (which is known to favor beauty over accuracy).
Our Ranking:
- Flux: The full-body shot, grey studio background, and specified attire are captured accurately. The composition is professional and polished, aligning perfectly with the prompt's requirements.
- Midjourney: Delivers a charming and detailed close-up with expressive features, but misses key elements like the full-body shot and studio background. While visually appealing, it deviates from the prompt's specifics.
Conclusions
We were pleasantly surprised with Flux, which came out on top across all of our tests. Its “Pro” version definitely delivers great-quality results and can be a good competitor to Midjourney and other paid options. It requires richer prompting, but the results are very accurate, realistic, and true to what’s prompted.
For those willing to pay for a good image generator, Flux Pro seems to be the best value proposition. The “Dev” and “Schnell” versions are better than the base SD3 Medium and Auraflow, so even in the open-source space, Flux is a pretty strong competitor.
Flux renders human bodies better than SD3, which is a major point to consider. However, people with more modest GPUs could manage with SD3—or even fine-tuned versions of SDXL—given that new models like Auraflow or Flux are extremely heavy.
It bears noting that the Replicate platform has implemented a “safety” slider, and we can confirm that the model is somewhat uncensored for those who care. Oh, and women can also lie on the grass again.
Edited by Ryan Ozawa.
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.