System memory fallback for Stable Diffusion on Nvidia GPUs

I feel that his because of apple, other arm- and x86 laptop (than desktop) with dedicated AI chip that are able to use system memory to compete with.
 
No doubt implemented because of their mistake to ship reasonably high end GPU's in 2023 with only 8GB of VRAM.
Yeah…
But it’s not hard to have a small CUDA projects that easily creep into the 40+ GB range on just the demo libraries. That’s not feasible for somebody to download and tinker with at home. But upgrading your home machine to 64GB in ram, paired with a 4060, that’s at least a realistic option for the casual enthusiast unlike an RTX 6000.

This could also serve as a precursor to some shared memory pools for CUDA. CUDA requires extensive memory control and sharing resources across GPU memory is not easy, it’s one of the major complaints and one of the primary selling features of some of the other models out there mainly PyTorch. So this may be step 1 in a multi step process to address some of the more serious developer complaints as well.
 
I had never heard of Stable Diffusion before.

Seems like it is some sort of open source image generating AI model.

I'm curious to try it out and all, but I wonder how much of an actual use case this is for more than a corner case of people.
 
I had never heard of Stable Diffusion before.

Seems like it is some sort of open source image generating AI model.

I'm curious to try it out and all, but I wonder how much of an actual use case this is for more than a corner case of people.

So I gave it a try, via a web-interface called AUTOMATIC1111 or something. It's certainly interesting, but thus far it seems to be good for little more than laughs. The example images I have generated using text prompts aren't exactly realistic, and would best belong in a coffee table book of "hilarious AI-generated images"

Good for a few minutes of distraction, but that is about it.

Though to be fair, I don't really know what I am doing, so I may be missing something.
 
Interestingly enough my 4090 (24GB) seems to run out of VRAM if you exceed ~1024x768 resolution, after maxing out the number of sampling steps to 150.
 
I'm curious to try it out and all, but I wonder how much of an actual use case this is for more than a corner case of people.
You can have indirect use application that use it, say a photoshop/gimp/blender plugin that use it:
https://twitter.com/wbuchw/status/1...log/project-ideas-built-with-stable-diffusion


For direct use we can imagine a game dev, say that do a Magic The Gathering type of game, images made from the description of the cards, texture, and so on.
 
You can have indirect use application that use it, say a photoshop/gimp/blender plugin that use it:
https://twitter.com/wbuchw/status/1563162131024920576?ref_src=twsrc^tfw|twcamp^tweetembed|twterm^1563162131024920576|twgr^0cd5dccaeb0742107644a0931d8411a2ab8839d8|twcon^s1_&ref_url=https://www.banana.dev/blog/project-ideas-built-with-stable-diffusion


For direct use we can imagine a game dev, say that do a Magic The Gathering type of game, images made from the description of the cards, texture, and so on.

Well, for that it will need a little bit more training first.

At least if my "blue 1976 volvo 242 sedan" is any indication:

1699063279353.png


That said, there are LOTS of settings, and I have no idea what I am doing.

Also, on my system it seems CPU limited.

While it does load up the GPU, it absolutely pins a single thread on my CPU while running.


Here is Arnold Schwarzenegger drinking tea:

00002-2283563894.png
 
Last edited:
  • Like
Reactions: Nobu
like this
I am learning I need to add lots of models and scalers to get the most of it, and that it works best at low resolutions, and then using an AI upscalers to make them higher resolution.

Anyway, it seems the rabbit hole goes deep
 
Last edited:
Well, for that it will need a little bit more training first.

At least if my "blue 1976 volvo 242 sedan" is any indication:

View attachment 610823

That said, there are LOTS of settings, and I have no idea what I am doing.

Also, on my system it seems CPU limited.

While it does load up the GPU, it absolutely pins a single thread on my CPU while running.


Here is Arnold Schwarzenegger drinking tea:

View attachment 610824

I'm not sure exactly how you're using it, but there's usually a specific resolution a model is trained on, and if you generate at a different resolution you'll often get pretty poor results. I would guess that is your issue here, but there are also a bunch of other settings that you want to change per model. Usually the default settings are pretty good though.

I've used stable diffusion for a bunch of things. From just fooling around for fun to actual business stuff. It's pretty fun.
 
Interestingly enough my 4090 (24GB) seems to run out of VRAM if you exceed ~1024x768 resolution, after maxing out the number of sampling steps to 150.
There are upscale techniques to properly push you past 4/8K vs trying to generate an image at that resolution. But welcome to the club, it’s very likely to sit down and get lost for hours without realizing it.

To expand on the above post, most model authors will tell you the resolution, CFG, Scaler, Steps etc….

Also, there is comfyui, a node based SD setup that is compatible with SD 1.5/XL, but you need specific work flows. Allows for more customization, but requires a lot more tinkering to get right.
 
Last edited:
I am learning I need to add lots of models and scalers to get the most of it, and that it works best at low resolutions, and then using an AI upscalers to make them higher resolution.

Anyway, it seems the rabbit hole goes deep
SD1.5 models are trained on 512x512 resolution, so you must use that or a similar resolution to that ex 512x768 and 768x512 usually works well too. Then upscale.

But the most important bit for SD is the checkpoint you use, there are thousands of checkpoints out there all trained for different things.

I've been using SD for fun, and for generating illustrations for my white papers and documents for about 11 months now. Only the enemies of AI art think it is easy to get good results with it. Especially when you don't just want a random image, but something very specific.
 
The problem with Stable Diffusion right now is that SDXL is out, but very few character and\or person models are available on it. It's getting better, but atm I don't really use it. SD1.5 models are good enough for me due to the sheer variety. SDXL is "better", but it's also much harder to train, so the community is in a bit of a divide. I've read that training SDXL models requires 24GB vram minimum, depending on what you're doing, which essentially means people need a 4090 to do it. That's a high barrier of entry A lot of the better model trainers (I'm not one of them, I'm just a prompter) probably don't have a 4090. This update by Nvidia is probably quite welcome. Speed will end up being an issue, but maybe it will make the difference between being able to do something and not being able to do it at all.


Interestingly enough my 4090 (24GB) seems to run out of VRAM if you exceed ~1024x768 resolution, after maxing out the number of sampling steps to 150.

More steps isn't always better. I leave steps at 20, and many models or checkpoints in the SD1.5 space (granted you're using XL) were intended to use at 20-30. Maybe 40. The algorithm used for generation determines a lot of things, such as convergence or nonconvergence, and overall quality. I use DPM++ 2M Karras. For upscaling, various options. They can introduce random unintended details in the image, depending on denoising level and how high you're attempting to upscale. The higher the denoising level, the more the original generated image will be "replaced" to "fill in details". The details filled in depend a lot on your prompt and/or settings in the upscaler. CFG determines how closely it attempts to follow your prompt; higher settings generally mean it'll try harder, but you can lose coherency. Unless the base model had a significant amount of training with a character (or celebrity) specifically, you're very unlikely to get good results with them, without using a Lora/Lyco model for them.

Here's an easy example (no, this isn't really cherry picked):

1699091934118.png


The pose is the "guns akimbo" meme of Daniel Radcliffe origin. The character is Ha Yuri Zahard from Tower of God, who you're probably not aware of. The base model is some chibi-ish cute mix. All of these require downloading from CivitAI. The clothing was just prompt. Yuri model barely has any images for training, but even at 0.56 weight, does a pretty decent job of approximating her within the base model's art style. The fingers will usually be problematic, so that takes some cherry picking (actually I just realized I forgot the BetterHands lyco for this meme, too). You can "weight" a model's effect on the image; higher weights means it more closely follows training data, but the image quality tends to get worse since it might go against the base model's style too much. At higher weights, Yuri in this image would look more like her OG portrayal, but might not be able to do the pose properly (since it's based on an RL photograph). Everything is a balancing act when prompting, and it might take multiple tries. My script just has weights that I found are flexible enough to work well maybe 50-80% of the time, even with various combinations of Lora/Lyco.

I see a lot of people just straight off the bat jump into making 4k-8k high res megaimages and I don't really get it. I just generate at 512x512 (and many base models were trained at this) and then upscale to 920x920 (1.8x). I generally find this strikes a half decent balance between quality and speed (~1 image per 2-3 seconds on a 4090), since I have a script that generates characters in various poses/clothes/etc. I'm usually doing 10k++ images a day or something. If I like any result enough, I might upscale it higher by reusing the seed and then just tweaking the upscaling options.

Some more meme images, for fun:
(Kirino, same pose, different base model):
1699093672502.png

(Same Yuri, just Ace Attorney OBJECTION pose)
1699094383277.png

(Hatsune Miku with a PVC figurine base model, doing that conspiracy theory meme; just ignore the right hand fingers):
1699094745919.png


This is supposed to be Arnold with the "This is fine" meme, so everything is supposed to be on fire, but I guess it replaced it all with fireplaces; base model is Neverendingdream:
1699096819465.png
 
You can start looking into control net, it can greatly help with generating specific poses.

Also in A1111, there is a script drop down towards the bottom, you can plot the same seed/prompt across many different models in a x/y/z fashion. Also, you can set ranged to the CFG/steps to see how it impacts the image. Without manually having to do it yourself
 
I wonder what this means for OOC 3D renders. One of the current limitations within Octane, for example, is that the AI denoiser cannot be used on renders that go OOC. Perhaps the new memory management/model used to let SD go OOC would allow the various rendering engines to seamlessly do the same?
 
Before I dip out (since I've got some labbing and Robocop gaming lined up), I wanted to mention there's a 'styles.csv' file circulating online – you can find it on Reddit, YouTube, Google etc... It comes pre-filled with various positive and negative styles, and its drop-down menu feature makes it easy to select the style you're aiming for instead of having to recall from memory what combinations are effective.





1699113157235.png1699114647852.png00338-1213381834 (1).png
 
I had never heard of Stable Diffusion before.

Seems like it is some sort of open source image generating AI model.

I'm curious to try it out and all, but I wonder how much of an actual use case this is for more than a corner case of people.
Check these out:


View: https://youtu.be/GVT3WUa-48Y?si=gHxXk27eJHUKFkfK


View: https://youtu.be/tWZOEFvczzA?si=JxEGu6xWmeEPFeOq

The characters were generated with Stable Diffusion, though heavily tweaked. They have behind the scenes breakdowns of what they did.
 
You can start looking into control net, it can greatly help with generating specific poses.

Also in A1111, there is a script drop down towards the bottom, you can plot the same seed/prompt across many different models in a x/y/z fashion. Also, you can set ranged to the CFG/steps to see how it impacts the image. Without manually having to do it yourself
I've heard about ControlNets, but definitely never used them. The way I generate things at the moment is I have a python script that creates about 17-34k lines of text prompts that are then fed into the machine(s) that I have running SD, via the prompts_from_file.py that is in the script directory be default. They're based on simply stringing together various permutations of what I have in my script. I think a lot of these meme loras are supposed to sort of implicitly have a controlnet included? I'm not sure how compatible the controlnet extension is with this approach. I know I can do --prompt "some prompt" --negative_prompt "something else", and add some other options on a per-line basis (and I do that), but I don't really see any way to specify and use a controlnet via the prompt line fed into it. I had to actually modify the py script just to add denoising_strength parameter for the upscaler. I guess if it's possible via the API, I could manually modify the file to accept it. At least the Regional Prompter extension works without any real modification since it's purely prompt-based.

I'm having my third computer generate realistic Arnold padoru so everyone can suffer lol.
1699116508629.png
1699116780287.png
1699116705231.png
1699116428780.png
1699116626601.png
1699118216832.png


It's a little difficult, though. It's almost like these two things were never meant to mix. But that's what Stable Diffusion is for. And I couldn't upscale past 1.8x at all, at least not with the default upscaler steps, so I guess a lot of fine tuning is needed to even get past that, at the 8GB VRAM my 2080 has on it. I guess maybe that's where this new driver will help. Getting 64GB ram on a computer is not hard at all. If it can overflow into that, it's fine even if it finishes a bit slower since you're not likely to be batch generating at such a resolution; it'll be a one-off.
 
Last edited:
I've heard about ControlNets, but definitely never used them. The way I generate things at the moment is I have a python script that creates about 17-34k lines of text prompts that are then fed into the machine(s) that I have running SD, via the prompts_from_file.py that is in the script directory be default. They're based on simply stringing together various permutations of what I have in my script. I think a lot of these meme loras are supposed to sort of implicitly have a controlnet included? I'm not sure how compatible the controlnet extension is with this approach. I know I can do --prompt "some prompt" --negative_prompt "something else", and add some other options on a per-line basis (and I do that), but I don't really see any way to specify and use a controlnet via the prompt line fed into it. I had to actually modify the py script just to add denoising_strength parameter for the upscaler. I guess if it's possible via the API, I could manually modify the file to accept it. At least the Regional Prompter extension works without any real modification since it's purely prompt-based.

I'm having my third computer generate realistic Arnold padoru so everyone can suffer lol.

It's a little difficult, though. It's almost like these two things were never meant to mix. But that's what Stable Diffusion is for. And I couldn't upscale past 1.8x at all, at least not with the default upscaler steps, so I guess a lot of fine tuning is needed to even get past that, at the 8GB VRAM my 2080 has on it. I guess maybe that's where this new driver will help. Getting 64GB ram on a computer is not hard at all. If it can overflow into that, it's fine even if it finishes a bit slower since you're not likely to be batch generating at such a resolution; it'll be a one-off.

I am having a good laugh at the generations.

For the upscaling process, I would employ the CN tiling method. This approach involves breaking down the image into smaller tiles — typically 256x256 or 512x512 in size — when working with ultra-high resolutions exceeding 8K. These tiles are then upscaled individually and seamlessly stitched back together to form the complete image. Previously, I utilized this method in conjunction with the CN Tile model found here: ControlNet-v1-1 on Hugging Face, based on the multi-diffusion upscaler for automatic1111's script available at GitHub. This technique is particularly advantageous for GPUs with low VRAM, enabling them to achieve exceptionally high resolutions. In this setup, CN 1 represents the original image you wish to upscale, while CN 2 refers to the tile upscaling function. However, this was the solution I used several months ago, and there may be more advanced options available now.

Regarding general CN usaget, it all hinges on what you want to do. Take my case, I sought a particular pose, so I utilized it with the OpenPose model. I started by inputting a photo of someone seated on the sidewalk and then set a prompt for depicting an astronaut in outer space. My goal was to achieve consistent results with the seated pose and prompt everything else around that.

You have the option to prioritize the prompt over the ControlNet settings or vice versa. This is usually managed by adjusting a slider rather than altering the prompt. However, it's possible to configure ControlNet to place more emphasis on the prompt rather than on the base model itself by using a toggle feature. I've been experimenting with this for months and continue to discover new aspects regularly. It can be an incredible time sink.
 

Attachments

  • smevk97a.png
    smevk97a.png
    1.1 MB · Views: 0
  • 00041-2719513276.png
    00041-2719513276.png
    1.4 MB · Views: 0
Regarding general CN usaget, it all hinges on what you want to do. Take my case, I sought a particular pose, so I utilized it with the OpenPose model. I started by inputting a photo of someone seated on the sidewalk and then set a prompt for depicting an astronaut in outer space. My goal was to achieve consistent results with the seated pose and prompt everything else around that.

You have the option to prioritize the prompt over the ControlNet settings or vice versa. This is usually managed by adjusting a slider rather than altering the prompt. However, it's possible to configure ControlNet to place more emphasis on the prompt rather than on the base model itself by using a toggle feature. I've been experimenting with this for months and continue to discover new aspects regularly. It can be an incredible time sink.

What I'm asking is whether it's possible to fully control the controlnet process via the prompts_from_file.py script, with a single line. Like say, --controlnet "<some controlnet name>" --controlnet_strength <some integer> or something. If I can't do that, unfortunately it's going to be kind of impossible for me to integrate controlnets because I depend on doing huge batches of images, with high variety, all at once, at a pretty fast pace. I guess they're just not for me if I have to dial in a bunch of things and do them on a singleton basis, so I'll have to just make do with using Loras alone.

How does it handle multi-character images, by the way? Let's say it's all the same person, like in this case the Tom Cruise laughing meme, where it shows closeups of his face and stuff. The Lora's a bit weird with it:
1699127814008.png

A lot of times the other models don't generate. Wonder how a controlnet does with that.
 
No doubt implemented because of their mistake to ship reasonably high end GPU's in 2023 with only 8GB of VRAM.
It's more that SDXL and other recent AI models are using more RAM because they're generating at a higher resolution.
SD 1.5's 512x512 fit within 6-8 gigs pretty damn easy with some room to spare. SDXLs 1024x1024 or larger goes just to the limit of my 12GB 3080. I've had hard crashes over DWM using an extra 300-400mb due to leaving the image generation output open.

I've been messing a LOT with stable diffusion in the past year. I'm using ComfyUI now with a tiled rendering upscaler to get image output at an extremely detailed 4k.
1000004792.png

This is an example of my workflow (with Controlnet disabled for this particular image)
1000004818.jpg

1000004785.jpg
1000003884.jpg

I mainly use it to generate locations and stuff related to my tabletop campaigns.

The RAM fallback doesn't seem to be working with ComfyUI yet as I'm still getting hard crashes if I go over my VRAM limit but there's an extremely good chance that one of my many add-ons is causing the issue.
 
Last edited:
Back
Top