Show HN: AI OmniGen – AI Image Generator with Consistent Visuals

126 points | by lcorinst 16 hours ago

32 comments

ed 14 hours ago
Elegant architecture, trained from scratch, excels at image editing. This looks very interesting!
From https://arxiv.org/html/2409.11340v1
> Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.
> OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.
> Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.
This enables prompts for edits like: "|image_1| Put a smile face on the note." or "The canny edge of the generated picture should look like: |image_1|"
> To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.
[-]
- nairoz 11 hours ago
  > trained from scratch
  Not exactly. They mention starting from the VAE from Stable Diffusion XL and the Transformer from Phi3.
  Looks like these LLMs can really be used for anything
  [-]
  - yieldcrv 7 hours ago
    Pretty cool, comfy ui and community is too cumbersome for me and still results in too much throwaway content
lelandfe 14 hours ago
I left all the defaults as is, uploaded a small image, typed in "cafe," and 15 minutes later I am still waiting on this finishing.
[-]
- cubefox 8 hours ago
  Same, I left running for half an hour but nothing happened.
  [-]
  - bob_1200 7 hours ago
    The author updated their code a couple of days ago, and it runs smoothly on my end, producing results in about one minute. https://github.com/VectorSpaceLab/OmniGen
    [-]
    - Citizen_Lame 38 minutes ago
      Left it running 1 hour nothing happens. Maybe this is a social experiment.
wwwtyro 12 hours ago
With consistent representation of characters, are we now on the precipice of a Cambrian explosion of manga/graphic novels/comics?
[-]
- haccount 3 hours ago
  I would say we already had one of those. There's more hand crafted human made content available than anyone cares to read.
  While this will enable a certain degree of more spam it will more importantly, on the positive side of things, democratize the creative process to those who want to tell a story in images but lack the skill and resources to churn it out traditionally.
- Multicomp 11 hours ago
  I sure hope so - at the very least I will use it for tabletop illustrations instead of having to describe a party's scenario result - I can give them a character-accurate image showing their success (or epic lack thereof).
- fullstackwife 12 hours ago
  not yet, still can't generate transparent images
  [-]
  - derefr 11 hours ago
    Why do you need that? For manga specifically, generate in greyscale and convert luminance to alpha; then composite; then color.
    Or, if you need solid regions that overlap and mask out other regions, then generate objects over a chroma-keyable flat background.
  - Vt71fcAqt7 11 hours ago
    From the controlnet author:
    Transparent Image Layer Diffusion using Latent Transparency
    https://arxiv.org/abs/2402.17113
    https://github.com/lllyasviel/sd-forge-layerdiffuse
block_dagger 11 hours ago
This looks promising. I love how you can reference uploaded images with markup - this is exactly what the field needs more of. After spending the last two weeks generating thousands of album cover images using DALL-E and being generally disappointed with the results (especially with the variations feature of DALL-E 2), I'm excited to give this a try.
ilaksh 14 hours ago
I think this type of capability will make a lot of image generation stuff obsolete eventually. In a year or two, 75%+ of what people do with ComfyUI workflows might be built into models.
101008 11 hours ago
I am working on a API to generate avatars/profile pics based on a prompt. I tried looking for train my own model bt I think it's a titanic task and impossible to do it myself. Is my best solution use an external API and then crop the face for what was generated?
[-]
- haccount 3 hours ago
  You can use a few controlnet templates and then whatever model you like and consistently get the posture correct. The diffusion plugin for Krita is a great playground for exploring this.
- ncoronges 10 hours ago
  The simplest commercial product for finetuning your own model is probably Adobe firefly, although there’s no API access support yet. But there are cheap and only slightly more involved options like Replicate or Civit.ai. Replicate has solid API support.
  Check out:
  https://replicate.com/blog/fine-tune-flux
  [-]
  - 101008 8 hours ago
    Is it Flux 1 possible to download and deploy to my own server? (And make a simple API on top of it?) I don't need fine tuning.
    [-]
    - spaceman_2020 5 hours ago
      The easiest flux api I’ve seen is with Fal.ai
      It is expensive though- Flux dev images are like $0.035/image
    - handfuloflight 8 hours ago
      If you have GPUs on your server that can handle it.
bob_1200 7 hours ago
https://github.com/VectorSpaceLab/OmniGen
[-]
- littlestymaar 2 hours ago
  Cool they even released the weights![1] didn't expect that from the tone of the release post to be honest.
  [1]: https://huggingface.co/Shitao/OmniGen-v1
KerryJones 13 hours ago
Love this idea -- you have a typo in tools "Satble Diffusion"
gremlinsinc 3 hours ago
Anyone know how it handles Text? That's kind of my deal breaker, I like Ideogram for it's ability to do really cool fonts, etc.
oatsandsugar 16 hours ago
I mean, I struggle even getting Dall-E to iterate on one image without changing everything, so this is pretty cool
anyi09881 13 hours ago
Curious what's the actual cost for each edit? Will this infra always be reliable?
[-]
- CamperBob2 10 hours ago
  I was able to clone the repo and run it locally, even on a Windows machine, with only minimal Python dependency grief. Takes about a minute to create or edit an image on a 4090.
  It's pretty impressive so far. Image quality isn't mind-blowing, but the multi-modal aspects are almost disturbingly powerful.
  Not a lot of guardrails, either.
empath75 15 hours ago
it seems like there's a lot of potential for abuse if you can get it to generate ai images of real people reliably.
kazishariar 15 hours ago
Hrmm, so this is how it's gonna be moving forward then? Use a smidgen of truth, to tell the whole falsehood, and nuttin' but the falsehoods. Sheesh- but, at least the subject is real? And that's that- nuttin' else doh.
[-]
- illumanaughty 14 hours ago
  We've been manipulating photos as long as we've been taking them.
- handfuloflight 8 hours ago
  Art is what you can get away with. (Andy Warhol)