My vllm workflow

1.1.2.3.5... · Jun 5, 2026

Putting this here now, will update when im off work with details of my workflow (not trying to type that out on my phone, haha).

Hash_Immutable_1337x · Jun 5, 2026

New to this LLM stuff as well... got baited in from the Gemma4 talk... (hopefully this isn't hijacking.. but could be useful for someone that is as new as I was.. I know this is most likely elementary for some of [H])

I'm on a Windows workstation; home rig with a 9070 XT. My main use case would be my own PowerShell collaboration partner to assist me with scripting, etc.. as my main job is a Windows system engineer...

So far I've downloaded/installed Ollama, Python (along with uvx) installed Open WebUI (this binds to localhost:8080)
I then launch Open WebUi in an isolated environment (whenever I want to utilize the LLM's via the web ui)

Then my AI virgin self learned to pull down the LLM's I need to run the following, which I have so far pulled down:

ollama run gemma4:latest ( this pulls down the E4B size I believe )
ollama run qwen2.5-coder:14b

And .. away we go...

1.1.2.3.5... · Jun 5, 2026

No hijack at all.

1_rick · Jun 5, 2026

Hash_Immutable_1337x said:
New to this LLM stuff as well... got baited in from the Gemma4 talk... (hopefully this isn't hijacking.. but could be useful for someone that is as new as I was.. I know this is most likely elementary for some of [H])

I'm on a Windows workstation; home rig with a 9070 XT. My main use case would be my own PowerShell collaboration partner to assist me with scripting, etc.. as my main job is a Windows system engineer...

So far I've downloaded/installed Ollama, Python (along with uvx) installed Open WebUI (this binds to localhost:8080)
I then launch Open WebUi in an isolated environment (whenever I want to utilize the LLM's via the web ui)
View attachment 807379
Then my AI virgin self learned to pull down the LLM's I need to run the following, which I have so far pulled down:

ollama run gemma4:latest ( this pulls down the E4B size I believe )
ollama run qwen2.5-coder:14b

View attachment 807383

And .. away we go...

View attachment 807385

This looks nearly exactly like what I'm looking for except not powershell.

Also, as OP said, it's not hijacking because this is literally what I was asking about on the other thread.

If you've got a 9070XT you probably ought to at least look into the just-released 12B model, because more Bs are better (I assume.)

1.1.2.3.5... · Jun 5, 2026

Okay, just sitting down. I will start with hardware im running for reference.

One box is 2x 3090s with nvlink. One is liquid cooled to keep the system thermals in check. 5950x cpu, 128gb ddr4 3600 360 aio. 4tb sn 850x black ssd main drive and 2tb 990 pro backup and 4x 4tb crucial ssd drives in linux raid .

Goals for main pc is to run inference and mcp servers as well as training smaller models, focusing mostly on vision models.

Second pc is a 5900x 128gb, and a v100 32 gb pcie with a 3d printed fan and 2 small 40mm fans.

Second pc is for any fp64 and as the mcp client.

I'll either push everything to my git and share or just add code here.

Just to be complete on the workflow, those typically sit in the basement and i ssh into them on vs code on my windows laptop upstairs.

1.1.2.3.5... · Jun 5, 2026

This is the bash script I run to start the vllm server. Setup shouldn't be too terrible because it is a docker. However, this is an Nvidia docker image, so you would need to tell me what GPUs you're running and we can find a ROCm container that will work for you.

To download the model I use hugging face cli with a command like
hf download havenoammo/Qwen3.6-27B-INT8-MTP --local-dir ./Qwen3.6_dense_hotd_int8_MTP

Instructions to install hugging face cli, and you can add a token for faster downloads if you want.
https://huggingface.co/docs/huggingface_hub/guides/cli

Note: some of these parameters are specific to my machine, and won't work unless you have a similar setup. So I will share all this and wait for more information from you on your specific setup.

Bash:

#!/usr/bin/env bash
set -euo pipefail

# echo "[*] Killing all GPU-using processes…"

# sudo systemctl isolate multi-user.target
# sudo systemctl mask --force display-manager.service
# sudo systemctl mask display-manager
# sudo systemctl stop display-manager
# sudo kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader) 2>/dev/null || true
# sudo fuser -k -9 /dev/nvidia*

# echo "[*] GPU VRAM should now be fully freed."

# Launch vLLM with Qwen3.6 dense 8‑bit
# --model /workspace/Qwen3.6_dense_8bit \

docker run --gpus all --rm \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /home/user/workspace:/workspace \ # Path to your local model
  vllm/vllm-openai:cu130-nightly \
    --model /workspace/Qwen3.6_dense_hotd_int8_MTP \
    --tensor-parallel-size 2 \
    --attention-backend FLASHINFER \
    --performance-mode interactivity \
    --max-model-len auto \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.93 \
    --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \
    -O3 \
    --async-scheduling \
    --language-model-only \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --default-chat-template-kwargs.preserve_thinking true \
    --mamba-cache-mode all \
    --mamba-block-size 8 \
    --enable-prefix-caching \
    --enable-chunked-prefill

When the server is done, you should see 'Application startup complete' and you're good and should be serving. You need to ensure the ports from the vllm server and the open webui server are lined up.

1.1.2.3.5... · Jun 5, 2026

You can also click the 'Use This model' button and you get this to come up with a script to run the model if you don't want docker.

Below is the bash script I use to launch open webui from docker too.

Bash:

#!/usr/bin/env bash
set -euo pipefail

echo "[*] Stopping old container (if running)…"
docker stop open-webui 2>/dev/null || true

echo "[*] Removing old container (if exists)…"
docker rm open-webui 2>/dev/null || true

echo "[*] Launching Open WebUI…"
docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL=http://172.17.0.1:8000/v1 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

echo "[*] Open WebUI is now running at: http://localhost:3000"

Once this and the vllm server are running you should get something like:

Test your model with whatever query you want:

Code:

To compute **Fourier coefficients** for a continuous-time (CT) signal, you're typically working with the **Continuous-Time Fourier Series (CTFS)**, which applies only to **periodic** signals. If your signal is aperiodic, you'd use the Fourier Transform instead (which yields a continuous spectrum, not discrete coefficients). I'll cover both, but focus on coefficients as requested.

---
### 🔹 1. When to Use Fourier Coefficients
Use Fourier coefficients when:
- The signal \(x(t)\) is **periodic**: \(x(t+T) = x(t)\) for all \(t\).
- It satisfies the **Dirichlet conditions**:
  - Absolutely integrable over one period: \(\int_T |x(t)| dt < \infty\)
  - Finite number of maxima/minima and discontinuities per period.

---
### 🔹 2. Core Formulas (Complex Exponential Form)
For a periodic CT signal with period \(T\) and fundamental frequency \(\omega_0 = \frac{2\pi}{T}\):

**Synthesis (reconstruction):**
\[
x(t) = \sum_{k=-\infty}^{\infty} a_k e^{j k \omega_0 t}
\]

**Analysis (coefficient extraction):**
\[
a_k = \frac{1}{T} \int_{t_0}^{t_0+T} x(t) e^{-j k \omega_0 t} \, dt
\]
- \(a_k\) are the **Fourier coefficients** (complex numbers).
- \(t_0\) can be any starting point; choose it to simplify the integral (e.g., \(0\) or \(-T/2\)).
- \(a_0\) is the **DC component** (average value over one period).

---
### 🔹 3. Step-by-Step Computation
1. **Find the period \(T\)** and compute \(\omega_0 = 2\pi/T\).
2. **Choose integration limits** covering exactly one period.
3. **Split the integral** if \(x(t)\) is defined piecewise.
4. **Evaluate \(a_0\) separately** (set \(k=0\) or use \(\frac{1}{T}\int_T x(t)dt\)).
5. **Compute \(a_k\) for \(k \neq 0\)** using the analysis formula.
6. **Simplify using symmetries** (saves work and reveals properties):
   - Real & even \(x(t) \Rightarrow a_k\) real & even
   - Real & odd \(x(t) \Rightarrow a_k\) purely imaginary & odd
   - Half-wave symmetry \(\Rightarrow a_k = 0\) for even \(k\)
7. **Check convergence**: At discontinuities, the series converges to the midpoint value. Gibbs phenomenon may appear in partial sums.

---
### 🔹 4. Worked Example
**Signal**: Periodic rectangular pulse
\(x(t) = \begin{cases} 1, & -T/2 < t < 0 \\ 0, & 0 \le t < T/2 \end{cases}\), repeated every \(T\).

- \(\omega_0 = 2\pi/T\)
- Choose limits: \([-T/2, T/2]\)

**DC component**:
\[
a_0 = \frac{1}{T} \int_{-T/2}^{0} 1 \, dt = \frac{1}{2}
\]

**For \(k \neq 0\)**:
\[
a_k = \frac{1}{T} \int_{-T/2}^{0} e^{-j k \omega_0 t} dt
= \frac{1}{T} \left[ \frac{e^{-j k \omega_0 t}}{-j k \omega_0} \right]_{-T/2}^{0}
= \frac{1}{T} \cdot \frac{1 - e^{j k \omega_0 T/2}}{j k \omega_0}
\]
Since \(\omega_0 T = 2\pi\), \(e^{j k \pi} = (-1)^k\):
\[
a_k = \frac{1}{j k 2\pi} \left(1 - (-1)^k\right)
\]
- If \(k\) is even \(\Rightarrow a_k = 0\)
- If \(k\) is odd \(\Rightarrow a_k = \frac{1}{j k \pi}\)

Final:
\[
a_k = \begin{cases}
\frac{1}{2}, & k = 0 \\
0, & k \text{ even}, \, k \neq 0 \\
\frac{1}{j k \pi}, & k \text{ odd}
\end{cases}
\]

---
### 🔹 5. Practical & Numerical Notes
- **Analytical vs Numerical**: Closed-form solutions work for piecewise smooth signals. For arbitrary waveforms, discretize the integral:
  \[
  a_k \approx \frac{1}{N} \sum_{n=0}^{N-1} x(nT/N) e^{-j k n 2\pi / N}
  \]
  This is essentially the **DFT/FFT** applied to one period of sampled data.
- **Windowing**: If you only have a finite observation window of a signal, multiply by a window function before computing coefficients to reduce spectral leakage.
- **Trigonometric Form**: If you prefer real coefficients:
  \[
  x(t) = a_0 + \sum_{k=1}^{\infty} \left( A_k \cos(k\omega_0 t) + B_k \sin(k\omega_0 t) \right)
  \]
  where \(A_k = a_k + a_{-k}\), \(B_k = j(a_k - a_{-k})\).

---
### 🔹 6. What If the Signal Isn't Periodic?
For aperiodic CT signals, use the **Continuous-Time Fourier Transform (CTFT)**:
\[
X(\omega) = \int_{-\infty}^{\infty} x(t) e^{-j\omega t} dt
\]
This yields a **continuous spectrum** \(X(\omega)\), not discrete coefficients. You can approximate Fourier coefficients by:
1. Truncating the signal to a large window \(T\)
2. Treating it as periodic
3. Computing \(a_k\) as above (introduces spectral leakage/aliasing if not careful)

---
### ✅ Quick Checklist
- [ ] Signal is periodic → use CTFS coefficients
- [ ] Found \(T\) and \(\omega_0\)
- [ ] Chose convenient integration limits
- [ ] Handled \(k=0\) separately
- [ ] Used symmetry to simplify
- [ ] Verified Dirichlet conditions / convergence behavior
- [ ] For numerical work: sampled uniformly over one period, used FFT

Let me know your specific signal or context (analytical vs numerical, real/complex, periodic/aperiodic), and I can tailor the steps or derive coefficients for it.

Note the above is latex output that renders perfectly nice in markdown.
And note the answer from Qwen3.6 27b is pretty damn good.

Tuning the model is a whole thing and one big reason I like to stick with docker containers and scripts as it makes this easy to play with.
Note: I am perfectly happy with both the token rate I am getting and the quality of the answers I get.

sharknice · Jun 5, 2026

What are you doing with them?

I just use Vane with some local models through ollama on a a machine with a 3090 running Ubuntu.

I tried openclaw but didn't like how it's meant to be used with paid cloud models.
I'm thinking about trying that PewDiePie thing built specifically for having everything local.

I also use comfyui for a bunch of stuff.

1.1.2.3.5... · Jun 5, 2026

https://rocm.docs.amd.com/projects/...ml?utm_source=copilot.com#docker-with-toolkit

That is the AMD ROCm container, and I would recommend running that for your RX 6800.

Note: you're running 16gb vram, so you're going to need to make sure you're running a model that is sized right in terms of both number of parameters and the quantization and which types will run natively on your RX 6800 gpu.

Additionally you are limited on parameter types on the Rx 6800 to BF16 and FP16, which will limit the overall size of the model you can load into vram. Also, vllm is great for performance, but it does not officially support the RX 6800 gpu, so you might get better mileage with ollama or on of the other LLM servers.

zandor · Jun 5, 2026

Read up a bit on VLLM... not sure it's the right option for me. From the sounds of things it's good at multi-GPU and multiple simultaneous requests. Presently I don't have either. Aside from screwing around with image generation in ComfyUI I've just been firing up LM Studio, plugging models into an IDE and using them to write code. Mostly PyCharm + Qwen 3.6 at home. At work it's IntelliJ + Google Gemini. Maybe I would if I started playing with OpenClaw. I'm more likely to get some good out of a low level llama.cpp setup. I have a 5090, 285k and 64GB DDR5-6400 on my main rig and a 3090, i9-10980XE, and 64GB in my old rig. llama.cpp is supposed to be the way to go for partial CPU/GPU offloading. Haven't tried that yet, but it would let me run larger models. Funny thing is my old rig may actually be faster at CPU inference. 18 cores with AVX-512 and quad channel DDR4-3600, so it actually has more memory bandwidth than my current rig. That DDR5-6400 was supposed to be temporary until larger XMP CUDIMMs came out. Still waiting, and now ram is stupid expensive.

I could maybe get the old rig up to 128GB of mismatched ram without buying anything. It's end up running at DDR4-2133 (JEDEC) speeds or add heatsinks to a cheap kit of green PCB no heatsink Crucial and do some manual OC. The Crucial is DDR4-3200 non-XMP. It's the XMP 3600 that ends up at 2133 with XMP off. Not sure if it's worth the hassle. Mostly I'm just trying to figure out how much better a 128GB setup would be.

I've been eyeing an NV DGX Spark or AMD Ryzen AI 395+ machine. I'd consider a Mac but you can't actually order a Studio with more than 96GB right now. Given that all I want it for is running AI stuff the Mac is out. Not enough ram. Might get an Air next time I need a laptop. I looked at various multi-GPU options, but getting to 96-128GB of combined vram and building the rest of a system that can suppor them is difficult without going way over the price of one of those mini-PCs.

sharknice · Jun 6, 2026

zandor said:
Read up a bit on VLLM... not sure it's the right option for me. From the sounds of things it's good at multi-GPU and multiple simultaneous requests. Presently I don't have either. Aside from screwing around with image generation in ComfyUI I've just been firing up LM Studio, plugging models into an IDE and using them to write code. Mostly PyCharm + Qwen 3.6 at home. At work it's IntelliJ + Google Gemini. Maybe I would if I started playing with OpenClaw. I'm more likely to get some good out of a low level llama.cpp setup. I have a 5090, 285k and 64GB DDR5-6400 on my main rig and a 3090, i9-10980XE, and 64GB in my old rig. llama.cpp is supposed to be the way to go for partial CPU/GPU offloading. Haven't tried that yet, but it would let me run larger models. Funny thing is my old rig may actually be faster at CPU inference. 18 cores with AVX-512 and quad channel DDR4-3600, so it actually has more memory bandwidth than my current rig. That DDR5-6400 was supposed to be temporary until larger XMP CUDIMMs came out. Still waiting, and now ram is stupid expensive.

I could maybe get the old rig up to 128GB of mismatched ram without buying anything. It's end up running at DDR4-2133 (JEDEC) speeds or add heatsinks to a cheap kit of green PCB no heatsink Crucial and do some manual OC. The Crucial is DDR4-3200 non-XMP. It's the XMP 3600 that ends up at 2133 with XMP off. Not sure if it's worth the hassle. Mostly I'm just trying to figure out how much better a 128GB setup would be.

I've been eyeing an NV DGX Spark or AMD Ryzen AI 395+ machine. I'd consider a Mac but you can't actually order a Studio with more than 96GB right now. Given that all I want it for is running AI stuff the Mac is out. Not enough ram. Might get an Air next time I need a laptop. I looked at various multi-GPU options, but getting to 96-128GB of combined vram and building the rest of a system that can suppor them is difficult without going way over the price of one of those mini-PCs.

It really depends on what you want to do with it. The large amount of memory means you can use bigger models, but they sometimes run so slow it's not worth it.

1.1.2.3.5... · Jun 6, 2026

zandor said:
Read up a bit on VLLM... not sure it's the right option for me. From the sounds of things it's good at multi-GPU and multiple simultaneous requests. Presently I don't have either. Aside from screwing around with image generation in ComfyUI I've just been firing up LM Studio, plugging models into an IDE and using them to write code. Mostly PyCharm + Qwen 3.6 at home. At work it's IntelliJ + Google Gemini. Maybe I would if I started playing with OpenClaw. I'm more likely to get some good out of a low level llama.cpp setup. I have a 5090, 285k and 64GB DDR5-6400 on my main rig and a 3090, i9-10980XE, and 64GB in my old rig. llama.cpp is supposed to be the way to go for partial CPU/GPU offloading. Haven't tried that yet, but it would let me run larger models. Funny thing is my old rig may actually be faster at CPU inference. 18 cores with AVX-512 and quad channel DDR4-3600, so it actually has more memory bandwidth than my current rig. That DDR5-6400 was supposed to be temporary until larger XMP CUDIMMs came out. Still waiting, and now ram is stupid expensive.

I could maybe get the old rig up to 128GB of mismatched ram without buying anything. It's end up running at DDR4-2133 (JEDEC) speeds or add heatsinks to a cheap kit of green PCB no heatsink Crucial and do some manual OC. The Crucial is DDR4-3200 non-XMP. It's the XMP 3600 that ends up at 2133 with XMP off. Not sure if it's worth the hassle. Mostly I'm just trying to figure out how much better a 128GB setup would be.

I've been eyeing an NV DGX Spark or AMD Ryzen AI 395+ machine. I'd consider a Mac but you can't actually order a Studio with more than 96GB right now. Given that all I want it for is running AI stuff the Mac is out. Not enough ram. Might get an Air next time I need a laptop. I looked at various multi-GPU options, but getting to 96-128GB of combined vram and building the rest of a system that can suppor them is difficult without going way over the price of one of those mini-PCs.

Yeah, I haven't personally dove into trying to run mixture of expert (MOE) models mostly because once I started playing with Qwen 3.6 I personally felt model was good enough for my needs and started trying to use it.

Ive tested it pretty thoroughly and I just don't see a big enough lift to go larger and instead ive shifted to playing with retrieval augmentation for my massive collection of ebooks and mcp servers. I don't have that down pat yet.

In terms of IDE integration and code assistants, there's a LOT of great options but im most focused on trying to build competent local inference agents, but still trying to get solid rag performance (chunking and dependency graphs is hard).

FrgMstr · Jun 6, 2026

sharknice said:
It really depends on what you want to do with it. The large amount of memory means you can use bigger models, but they sometimes run so slow it's not worth it.

A basic thread for a while now, and it used to mean more than it does now. With the advent of Hermes, you can set your goals for your agent and they will work without much if any input for a long time. That said, I am running Qwen 3.5 122B right now, because Qwen 3.6 36B was not getting things done. 3.5 seems to be pretty fast on the coding front, and while I have not run any benchmarks, I know beating on 3.6 for five hours and not getting things done makes 3.5 seem hella fast. Really just depends on what your goal and your timeline is.

I have SuperGrok. I usually go and ask, what model for this project? What model for this platform? What tweaks to make it run stable and fast? Grok is pretty good at getting things set up well.

Just started playing with Hermes Desktop on Windows. This will be a compelling UI for a lot of folks that are scared of anything Linux...

1_rick · Jun 7, 2026

1.1.2.3.5... said:
This is the bash script I run to start the vllm server. Setup shouldn't be too terrible because it is a docker. However, this is an Nvidia docker image, so you would need to tell me what GPUs you're running and we can find a ROCm container that will work for you.

Sorry, stuff came up this weekend. I've got an RX 6800 (non-XT).

1_rick · Jun 7, 2026

I may have to switch to something other than arch...AMD's instructions assume Ubuntu and apt, and the commands don't translate directly to using pacman, and I don't to spend a ton of time figuring that out.

1.1.2.3.5... · Jun 7, 2026

1_rick said:
I may have to switch to something other than arch...AMD's instructions assume Ubuntu and apt, and the commands don't translate directly to using pacman, and I don't to spend a ton of time figuring that out.

Yeah, that tracks. Ubuntu is effectively the windows of linux distros.

1_rick · Jun 8, 2026

Good Lord, this is crazy. I installed Ubuntu 26.04, because that's what Ubuntu.com offered. AMD's instructions you linked expect 22.04 or 24.04, so I tried the latter, silly me. I got to the "installing kernel drivers" part of the ROCm quickstart, and it sat and spun for a while and started to recompile something large, then bombed out, probably because I have a newer version of the kernel than expected (7.0.0.22 vs .14), I guess. So I suppose I'll just reinstall AGAIN, but do 24.04 this time.

I did get ollama with llama3 installed pretty easily at least, and it was able to create a workable C# Hello World program.

1.1.2.3.5... · Jun 8, 2026

1_rick said:
Good Lord, this is crazy. I installed Ubuntu 26.04, because that's what Ubuntu.com offered. AMD's instructions you linked expect 22.04 or 24.04, so I tried the latter, silly me. I got to the "installing kernel drivers" part of the ROCm quickstart, and it sat and spun for a while and started to recompile something large, then bombed out, probably because I have a newer version of the kernel than expected (7.0.0.22 vs .14), I guess. So I suppose I'll just reinstall AGAIN, but do 24.04 this time.

I did get ollama with llama3 installed pretty easily at least, and it was able to create a workable C# Hello World program.

Yeah, I never claimed setup would be super easy, which is why I was sharing my setup to hopefully speed you along.

Again, I would really try to stick with docker if you can and write launch scripts to keep what works. I have a local git repo I use for most of my experiments.

The really annoying bits come doen to quantization and what types your specific hardware will be able to run.

zandor · Jun 8, 2026

FrgMstr said:
A basic thread for a while now, and it used to mean more than it does now. With the advent of Hermes, you can set your goals for your agent and they will work without much if any input for a long time. That said, I am running Qwen 3.5 122B right now, because Qwen 3.6 36B was not getting things done. 3.5 seems to be pretty fast on the coding front, and while I have not run any benchmarks, I know beating on 3.6 for five hours and not getting things done makes 3.5 seem hella fast. Really just depends on what your goal and your timeline is.

What was Qwen 3.6 messing up on for you?

I've had it get wedged a couple times. Running a 4-bit quant of 35B-A3 on a 5090. Just sat there doing nothing and one time it pegged a bunch of CPU cores at 100% for no apparent reason. Wasn't out of vram, 5090 had a couple gigs left available and was sitting there idle. Might be LMStudio causing that. At some point I'll get around to setting up llama.cpp & try out vllm. The other one that keeps getting me is it makes assumptions about naming conventions. Arguably that's partially my fault. I've been writing code since the 1990s and mostly work in C-style languages, particularly Java, so I tend to camel case things. Now I'm having Qwen write python. I write something in camel case and Qwen goes and stuffs an underscore in there. So instead of ThingOne Qwen wants to call it thing_one. Usually it messes up an import statement.

1_rick · Jun 8, 2026

1.1.2.3.5... said:
Again, I would really try to stick with docker if you can and write launch scripts to keep what works.

Makes sense. I followed your ROCm link, installed docker, and failed partway through AMD's instructions because something (amdgpu-dkms, I think) wouldn't compile properly. Something about linux-headers-7.0.0.14-generic, when what my kernel was, was .22. I didn't want to spend time figuring out hundreds of lines of compiler errors tonight, not when I hadn't had any chance to do video games yet.

FrgMstr · Jun 9, 2026

zandor said:
What was Qwen 3.6 messing up on for you?

It was not messing up. It was simply not a robust enough model for the coding I was doing. It simply could not figure out some of the things I was asking it to do. Had to go a bit "bigger."

1_rick · Jun 10, 2026

I remembered yesterday that I also have a Snapdragon X laptop, so I asked Gemini about using it. It said yeah and suggested a few models. So I installed ollama and starcoder2:7b last night--I mentioned this in the thread on the other subforum. Then I downloaded a couple of Gemma4 models (e4b and a 4-bit quantized 12b).

I tried again with one of the three, and it did a significantly better job, although it did it in a way unlike the way I originally solved the problem, or how ChatGPT did it last year. No biggie, except I noticed I had given it one detail wrong so I asked it to retry "with the assumption the image will be 1bpp instead of 2bpp" and something about color assumptions that isn't really important. It gave me about half the revised program, and just stopped, no errors or diagnostics, midway through a statement. Also, both times it gave me a partial C# class and then forgot to write the separate program that defines the (gui) window layout, so it didn't actually come up with a complete answer, although after reading the whole thing I think it was reasonably close. Also, last time I tried (12-18 months ago, maybe?), it completely botched my instructions that it should handle me dropping files onto the window, which, to be fair, is a bit complicated, and there wasn't much documentation. This time, it got that right and even added support for dropping multiple files, which I never got working. So definite improvement in spite of the ultimate failure.

Today I tried using gemma4:12b to do something I already had success with with Claude 4.5 (I think, or maybe ChatGPT) a month or two ago. Given an XML file that represents a SOAP template (some tags are empty, others have placeholder values) create a GUI program that lets me fill in values, generates a file based on the template, and calls cUrl to submit the file to an API endpoint. This time it spent a lot of time on a digression about the using statement, including a gratuitous example involving student grades, and then got halfway through the program I requested before, again, dying in the middle of the program. I asked it to try again without the using digression and it started repeating the exact same stuff, so I stopped it and went to lunch.

zandor · Jun 10, 2026

1_rick said:
This time it spent a lot of time on a digression about the using statement, including a gratuitous example involving student grades, and then got halfway through the program I requested before, again, dying in the middle of the program.

That might actually just be configuration or it could be the model. It'll just stop dead in its tracks if it hits the token limit. Max context length is generally something you can set in whatever app you're using (e.g. in ollama), and models also have limits. I can't say for certain that's the issue but that is what will happen if you hit the context length limit.

You might not want to set the context length to maximum depending on how much memory you have on that machine. Larger contexts use up more memory, and running out of memory can be more annoying than hitting the context length limit. Run out of ram and other random apps crash, or you start swapping, machine gets really slow, and you're chewing through write cycles on your SSD.

1.1.2.3.5... · Jun 10, 2026

1_rick said:
I remembered yesterday that I also have a Snapdragon X laptop, so I asked Gemini about using it. It said yeah and suggested a few models. So I installed ollama and starcoder2:7b last night--I mentioned this in the thread on the other subforum. Then I downloaded a couple of Gemma4 models (e4b and a 4-bit quantized 12b).

I tried again with one of the three, and it did a significantly better job, although it did it in a way unlike the way I originally solved the problem, or how ChatGPT did it last year. No biggie, except I noticed I had given it one detail wrong so I asked it to retry "with the assumption the image will be 1bpp instead of 2bpp" and something about color assumptions that isn't really important. It gave me about half the revised program, and just stopped, no errors or diagnostics, midway through a statement. Also, both times it gave me a partial C# class and then forgot to write the separate program that defines the (gui) window layout, so it didn't actually come up with a complete answer, although after reading the whole thing I think it was reasonably close. Also, last time I tried (12-18 months ago, maybe?), it completely botched my instructions that it should handle me dropping files onto the window, which, to be fair, is a bit complicated, and there wasn't much documentation. This time, it got that right and even added support for dropping multiple files, which I never got working. So definite improvement in spite of the ultimate failure.

Today I tried using gemma4:12b to do something I already had success with with Claude 4.5 (I think, or maybe ChatGPT) a month or two ago. Given an XML file that represents a SOAP template (some tags are empty, others have placeholder values) create a GUI program that lets me fill in values, generates a file based on the template, and calls cUrl to submit the file to an API endpoint. This time it spent a lot of time on a digression about the using statement, including a gratuitous example involving student grades, and then got halfway through the program I requested before, again, dying in the middle of the program. I asked it to try again without the using digression and it started repeating the exact same stuff, so I stopped it and went to lunch.

https://huggingface.co/Qwen/Qwen2.5-Coder-14B

https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct

I have considered trying those just because those would be extremely fast on my machine, but those are excellent coding specific models I think you should consider giving a go with....eff it...imma give them a go...

Nobu · Jun 10, 2026

1_rick said:
Makes sense. I followed your ROCm link, installed docker, and failed partway through AMD's instructions because something (amdgpu-dkms, I think) wouldn't compile properly. Something about linux-headers-7.0.0.14-generic, when what my kernel was, was .22. I didn't want to spend time figuring out hundreds of lines of compiler errors tonight, not when I hadn't had any chance to do video games yet.

Yeah, headers version needs to match or it won't find them (they are in separate directories in the /usr/src iirc). If you installed a new kernel+headers and haven't rebooted, it may fail because the old headers were removed but you are still running the old kernel.

1_rick · Jun 10, 2026

zandor said:
That might actually just be configuration or it could be the model. It'll just stop dead in its tracks if it hits the token limit. Max context length is generally something you can set in whatever app you're using (e.g. in ollama), and models also have limits. I can't say for certain that's the issue but that is what will happen if you hit the context length limit.

You might not want to set the context length to maximum depending on how much memory you have on that machine. Larger contexts use up more memory, and running out of memory can be more annoying than hitting the context length limit. Run out of ram and other random apps crash, or you start swapping, machine gets really slow, and you're chewing through write cycles on your SSD.

Yeah, I didn't max it out because of that. Then I went down a side passage into an entirely new rabbit hole about trying to use the NPU instead of the CPU (this was on my Snapdragon laptop, not the Ryzen 255 mini pc I started with.)

Now, after running some errands and stuff, I'm back home on the mini PC, having put Ubuntu 24.04 on it, and going out to the ROCm pages, where I discovered the current version doesn't support the RX 6000 series! So I don't know whether to YOLO it or look for an older version. In the meantime I'll install WINE so I can play a Windows game (or else install it into Steam, for Proton.)

1_rick · Jun 11, 2026

So, finally got it working! I don't remember why I decided to start with ollama, probably because that's what Gemini suggested or something, and once I started I wanted to see it through.

Anyway: after some false starts only partly caused by AI (and ollama's own docs!) I just managed to get the GPU inference working! As a refresher, I've got a Ryzen 255 with 32GB of RAM and an rx 6800 with 16. Measured by eye, the GPU is at least twice as fast as the CPU in terms of how fast it displays text (I guess that's a loose approximation of the token rate?) It kinda makes me curious to try CPU inference on my 14700K, but to do that I'd need to be able to do a benchmark.

1.1.2.3.5... · Jun 11, 2026

There's usually something that logs the token rate, cache hits etc...but not sure for ollama?

There llm-benchmark on GitHub you could try.

https://github.com/thomasdavis/llm-benchmark

https://pypi.org/project/llm-benchmark/
there's a wheel for you too

1_rick · Jun 11, 2026

1.1.2.3.5... said:
There's usually something that logs the token rate, cache hits etc...but not sure for ollama?

I'm willing to try other stuff than ollama. At work I use VS Code and my works' Copilot license, and there's a built-in chat. I'd like to work that way on Linux (or reasonably close to it) but I'm not wedded to that model.

1.1.2.3.5... · Jun 11, 2026

1_rick said:
I'm willing to try other stuff than ollama. At work I use VS Code and my works' Copilot license, and there's a built-in chat. I'd like to work that way on Linux (or reasonably close to it) but I'm not wedded to that model.

continue.dev extension works ish.

But you won't get the copilot experience.

Alternatively, you can use an mcp server and just have agents that do the lifting for you.

1.1.2.3.5... · Jun 11, 2026

View: https://youtu.be/QYSmQx55e44?is=t76ReGdKhbjhrfpE

Interesting. I can get a steady 65-70 tokens/sec on my 2x 3090s. That feels both good and bad. I have a much larger context window, which helps.

1_rick · Jun 11, 2026

1.1.2.3.5... said:
mcp server

I know what the acronym stands for but I have been avoiding learning about them before this. Once work forced my hand, I figured I might as well see if I can get any personal benefit.

1_rick · Jun 12, 2026

Doing a bit of comparison testing this morning. It looks like "ollama run model --verbose" will give you some stats when the run finishes. With a simple query ("write a program to reverse the colors of a 1bpp windows bitmap") I'm getting an eval rate of about 37 tokens/s on GPU compute. Now I'm running the same query with the GPU disabled and the CPU compute rate is a much wimpier 6.5token/s.

This isn't intended to be scientific, just to get a ballpark speed difference. The Ryzen 255 is an 8C/16T processor and I have 32GB of RAM.

1_rick · Jun 12, 2026

GPU

total duration: 47.125871167s
load duration: 317.834968ms
prompt eval count: 39 token(s)
prompt eval duration: 145.353ms
prompt eval rate: 268.31 tokens/s
eval count: 1734 token(s)
eval duration: 46.652501s
eval rate: 37.17 tokens/s

CPU

total duration: 5m37.928418533s
load duration: 311.566063ms
prompt eval count: 31 token(s)
prompt eval duration: 752.212ms
prompt eval rate: 41.21 tokens/s
eval count: 2168 token(s)
eval duration: 5m36.8626s
eval rate: 6.44 tokens/s

1_rick · Jun 12, 2026

I guess Gemma4 doesn't know Star Trek or understand humor. I told it "help me spock" and got a page of thinking and a somewhat generic expression of concern and support.

I guess I'd better not repeat the line from Indiane Jones and the Crystal Skull and demand it tell me everything.

1_rick · Jun 12, 2026

Little bit of humor, maybe: I asked gemma4:12b to write a variation of my test program: remove the border from a black-and-white bitmap, but with a twist: do it for Arduino, with a specific board. It took almost 13 minutes, because it spent well over 1000 lines going back and forth about what my cropping instructions meant. With a context window set to 32K, it did this time eventually spit out a program. I have skimmed the program, not analyzed it fully but I'm pretty sure it's not going to work anyway, and if it does, it'll be poorly-optimized.

For example:

#include <SdFat.h>
#include <Arduino_Get_Firmware_Revision.h>
#; 130

// Configuration for SD Card (Standard for Adafruit M0)
Sd { } ; // Placeholder if using standard SdRoot, but better to use Sd { }
Sd { } ;

// For the M0, we use the standard SdFat library structure
Sd { } ;
using namespace SdFat;

Not sure what's up with line 3 there but "Sd {};" is probably not quite valid[1] and certainly it makes no sense to repeat it 3 times.

// Locate dimensions from DIB info (Offset 14 in many systems, but let's
look at standard headers)
uint32_t bytesPerPixel = (bitDepth == 1) ? 1 : 0; // Placeholder for
logic
// In a real scenario, we would parse the header correctly to find
width/height.
int originalWidth = 128; // Example placeholder - should be parsed from
buffer
int originalHeight = 128; // Example placeholder

It then proceeds to not bother to actually figure out the actual dimensions. A little disappointing.

[1] it might technically be valid but I think all it would do is allocate space for one object, which, without a name, would be inaccessible.

1_rick · Jun 12, 2026

Where's the correct place to get LM Studio? When I did a search for it I found a bunch of sites, and I don't want to grab malware from the wrong one.

zandor · Jun 12, 2026

1_rick said:
Where's the correct place to get LM Studio? When I did a search for it I found a bunch of sites, and I don't want to grab malware from the wrong one.

I got it from lmstudio.ai. That's wher HuggingFace says to get it from: https://huggingface.co/lmstudio-ai Hope that was the right move.

My vllm workflow

2[H]4U

2[H]4U

2[H]4U

[H]F Junkie

2[H]4U

Attachments

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Just Plain Mean

[H]F Junkie

[H]F Junkie

2[H]4U

[H]F Junkie

2[H]4U

Supreme [H]ardness

[H]F Junkie

Just Plain Mean

[H]F Junkie

Supreme [H]ardness

2[H]4U

[H]F Junkie

[H]F Junkie

[H]F Junkie

2[H]4U

[H]F Junkie

2[H]4U

2[H]4U

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

Supreme [H]ardness