• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.

What do I get when I download Llama 4?

Those are quite big LLM (110b-400b param), the deepseek expert model save you compute and bandwidth during the work but you still need all of them loaded in the ram, they seem to be more enterprise level (or for people to make smaller destilled version of them for regular device)
https://apxml.com/posts/llama-4-system-requirements
Seem to be aiming for a H100 80GB for the smallest version.

Maybe some smaller memory footprint version based on them will get available ( I think it is talked a bit here):
https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/discussions/20
Let's wait until https://github.com/ggml-org/llama.cpp/issues/12774 is solved and a PR merged to master, then try llama.cpp's --n-gpu-layers, with the 4-bit quant https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit/tree/main which is <60G. That should be doable if you have 24G VRAM + 64G RAM.

Not sure if they are any particularly good too.
 
You can run big models slowly on a cpu based system with hundreds of gb of ram. Else you need datacenter class gpus.

What hardware do you have? a 12gb 3060 can do some fun stuff. 8b, 14b models are closer to what is possible with consumer gpus.

LLMs are limited by mem badwith and capacity. loading them in ram is pretty slow compared to vram.

I run llama3:8b on my intel phi server. It runs pretty well and is actually fairly impressive in terms of what it can do.
 
You can run big models slowly on a cpu based system with hundreds of gb of ram. Else you need datacenter class gpus.

What hardware do you have? a 12gb 3060 can do some fun stuff. 8b, 14b models are closer to what is possible with consumer gpus.

LLMs are limited by mem badwith and capacity. loading them in ram is pretty slow compared to vram.

I run llama3:8b on my intel phi server. It runs pretty well and is actually fairly impressive in terms of what it can do.

I have 2 machines with 384 GB RAM each, and if I pick up the DDR4 RDIMMs from the 4sale section here I can double one.

Slow is OK for this. I just want to see what the fuzz is about, aka is it justified.
 
Slow is OK for this. I just want to see what the fuzz is about, aka is it justified.
Alternative could be to use a cloud provider:

https://groq.com/llama-4-now-live-on-groq-build-fast-at-the-lowest-cost-without-compromise/

It is almost free just to try at those rate and you need 19 terabyte of ram for the 10 millions token window

the smaller scout can be tried for free, on groq (extremelly fast there, just ran at nearly 500 token seconds):
https://chat.groq.com/?model=meta-llama/llama-4-scout-17b-16e-instruct

Is there a strong buzz around Lama 4, seem like it is a bit of an overall deception (10 millions token window maybe a bit oversold, maybe that just on the coding side of things)
 
I usually use a 1.2*params for my memory sizing (whichever location), but I haven't loaded a llama 4 model yet (just had a new set of ram show up, so will hopefully have a chance to get my epyc rig running this weekend).

LukeTbk -- how are you calculating the 19TB req? (I'm legitimately curious, want to make sure the napkin math I am using is still appropriate for these newer models).

[edit]nvm -- saw this: https://apxml.com/posts/llama-4-system-requirements (though, I don't know why they bother with FP16 as I seem to recall llama 4 is FP8 native)

also this: https://simonwillison.net/2025/Apr/5/llama-4-notes/ (Llama 4 Scout @ Q3 is nearly 48GB memory usage -- oof!)

am I missing something? i thought the point of an ensemble model was to reduce memory requirements to "active" components.
 
Last edited:
I have 2 machines with 384 GB RAM each, and if I pick up the DDR4 RDIMMs from the 4sale section here I can double one.

Slow is OK for this. I just want to see what the fuzz is about, aka is it justified.
AI is pretty overhyped, for normal tasks it will be impossible to beat even free online options. I dont think it would nessisarily be worth it to max ram just to test a model (maybe pillage some memory from one of the 2 servers). just test ones that fit on your current setup. Running large models on cpu and ram is pretty dang slow. You may want to try a small model, especially one you can fit in gpu vram if you have it available. Llama3:8b would likely impress you more with its usability on a gpu then a huge model on a cpu server.
 
@LukeTbk -- how are you calculating the 19TB req? (I'm legitimately curious, want to make sure the napkin math I am using is still appropriate for these newer models).
I really not sure how to take account windows size into that type of calculation, it was from that link...

though, I don't know why they bother with FP16 as I seem to recall llama 4 is FP8 native)
It will probably quantized in many ways, Nvidia made an int4 vresion and a fp4 version for blackwell I think:

https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/

But there is little reason to run ai locally instead of a cloud provider, specially if it is just to test without anything specific in mind... (often it sound like people wanting a reason to build system, which being on hardocp we can understand..., not sure my plex server local content collection was worth it either)
 
Back
Top