• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

My vllm workflow

New to this LLM stuff as well... got baited in from the Gemma4 talk... (hopefully this isn't hijacking.. but could be useful for someone that is as new as I was.. I know this is most likely elementary for some of [H])

I'm on a Windows workstation; home rig with a 9070 XT. My main use case would be my own PowerShell collaboration partner to assist me with scripting, etc.. as my main job is a Windows system engineer...

So far I've downloaded/installed Ollama, Python (along with uvx) installed Open WebUI (this binds to localhost:8080)
I then launch Open WebUi in an isolated environment (whenever I want to utilize the LLM's via the web ui)
1780669675598.png

Then my AI virgin self learned to pull down the LLM's I need to run the following, which I have so far pulled down:

ollama run gemma4:latest ( this pulls down the E4B size I believe )
ollama run qwen2.5-coder:14b

1780670147227.png



And .. away we go...

1780670349788.png
 
New to this LLM stuff as well... got baited in from the Gemma4 talk... (hopefully this isn't hijacking.. but could be useful for someone that is as new as I was.. I know this is most likely elementary for some of [H])

I'm on a Windows workstation; home rig with a 9070 XT. My main use case would be my own PowerShell collaboration partner to assist me with scripting, etc.. as my main job is a Windows system engineer...

So far I've downloaded/installed Ollama, Python (along with uvx) installed Open WebUI (this binds to localhost:8080)
I then launch Open WebUi in an isolated environment (whenever I want to utilize the LLM's via the web ui)
View attachment 807379
Then my AI virgin self learned to pull down the LLM's I need to run the following, which I have so far pulled down:

ollama run gemma4:latest ( this pulls down the E4B size I believe )
ollama run qwen2.5-coder:14b

View attachment 807383


And .. away we go...

View attachment 807385
This looks nearly exactly like what I'm looking for except not powershell.

Also, as OP said, it's not hijacking because this is literally what I was asking about on the other thread.

If you've got a 9070XT you probably ought to at least look into the just-released 12B model, because more Bs are better (I assume.)
 
Okay, just sitting down. I will start with hardware im running for reference.

One box is 2x 3090s with nvlink. One is liquid cooled to keep the system thermals in check. 5950x cpu, 128gb ddr4 3600 360 aio. 4tb sn 850x black ssd main drive and 2tb 990 pro backup and 4x 4tb crucial ssd drives in linux raid .

Goals for main pc is to run inference and mcp servers as well as training smaller models, focusing mostly on vision models.

Second pc is a 5900x 128gb, and a v100 32 gb pcie with a 3d printed fan and 2 small 40mm fans.

Second pc is for any fp64 and as the mcp client.

I'll either push everything to my git and share or just add code here.

Just to be complete on the workflow, those typically sit in the basement and i ssh into them on vs code on my windows laptop upstairs.
 

Attachments

  • IMG_8393.jpeg
    IMG_8393.jpeg
    321.7 KB · Views: 0
  • IMG_8394.jpeg
    IMG_8394.jpeg
    404.1 KB · Views: 0
  • IMG_8395.jpeg
    IMG_8395.jpeg
    412.8 KB · Views: 0
This is the bash script I run to start the vllm server. Setup shouldn't be too terrible because it is a docker. However, this is an Nvidia docker image, so you would need to tell me what GPUs you're running and we can find a ROCm container that will work for you.


To download the model I use hugging face cli with a command like
hf download havenoammo/Qwen3.6-27B-INT8-MTP --local-dir ./Qwen3.6_dense_hotd_int8_MTP

Instructions to install hugging face cli, and you can add a token for faster downloads if you want.
https://huggingface.co/docs/huggingface_hub/guides/cli

Note: some of these parameters are specific to my machine, and won't work unless you have a similar setup. So I will share all this and wait for more information from you on your specific setup.
Bash:
#!/usr/bin/env bash
set -euo pipefail

# echo "[*] Killing all GPU-using processes…"

# sudo systemctl isolate multi-user.target
# sudo systemctl mask --force display-manager.service
# sudo systemctl mask display-manager
# sudo systemctl stop display-manager
# sudo kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader) 2>/dev/null || true
# sudo fuser -k -9 /dev/nvidia*

# echo "[*] GPU VRAM should now be fully freed."

# Launch vLLM with Qwen3.6 dense 8‑bit
# --model /workspace/Qwen3.6_dense_8bit \

docker run --gpus all --rm \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /home/user/workspace:/workspace \ # Path to your local model
  vllm/vllm-openai:cu130-nightly \
    --model /workspace/Qwen3.6_dense_hotd_int8_MTP \
    --tensor-parallel-size 2 \
    --attention-backend FLASHINFER \
    --performance-mode interactivity \
    --max-model-len auto \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.93 \
    --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \
    -O3 \
    --async-scheduling \
    --language-model-only \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --default-chat-template-kwargs.preserve_thinking true \
    --mamba-cache-mode all \
    --mamba-block-size 8 \
    --enable-prefix-caching \
    --enable-chunked-prefill

When the server is done, you should see 'Application startup complete' and you're good and should be serving. You need to ensure the ports from the vllm server and the open webui server are lined up.
1780705205695.png
 
Last edited:
1780703917132.png


You can also click the 'Use This model' button and you get this to come up with a script to run the model if you don't want docker.

Below is the bash script I use to launch open webui from docker too.

Bash:
#!/usr/bin/env bash
set -euo pipefail

echo "[*] Stopping old container (if running)…"
docker stop open-webui 2>/dev/null || true

echo "[*] Removing old container (if exists)…"
docker rm open-webui 2>/dev/null || true

echo "[*] Launching Open WebUI…"
docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL=http://172.17.0.1:8000/v1 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

echo "[*] Open WebUI is now running at: http://localhost:3000"

Once this and the vllm server are running you should get something like:

1780705338042.png


Test your model with whatever query you want:


Code:
To compute **Fourier coefficients** for a continuous-time (CT) signal, you're typically working with the **Continuous-Time Fourier Series (CTFS)**, which applies only to **periodic** signals. If your signal is aperiodic, you'd use the Fourier Transform instead (which yields a continuous spectrum, not discrete coefficients). I'll cover both, but focus on coefficients as requested.

---
### 🔹 1. When to Use Fourier Coefficients
Use Fourier coefficients when:
- The signal \(x(t)\) is **periodic**: \(x(t+T) = x(t)\) for all \(t\).
- It satisfies the **Dirichlet conditions**:
  - Absolutely integrable over one period: \(\int_T |x(t)| dt < \infty\)
  - Finite number of maxima/minima and discontinuities per period.

---
### 🔹 2. Core Formulas (Complex Exponential Form)
For a periodic CT signal with period \(T\) and fundamental frequency \(\omega_0 = \frac{2\pi}{T}\):

**Synthesis (reconstruction):**
\[
x(t) = \sum_{k=-\infty}^{\infty} a_k e^{j k \omega_0 t}
\]

**Analysis (coefficient extraction):**
\[
a_k = \frac{1}{T} \int_{t_0}^{t_0+T} x(t) e^{-j k \omega_0 t} \, dt
\]
- \(a_k\) are the **Fourier coefficients** (complex numbers).
- \(t_0\) can be any starting point; choose it to simplify the integral (e.g., \(0\) or \(-T/2\)).
- \(a_0\) is the **DC component** (average value over one period).

---
### 🔹 3. Step-by-Step Computation
1. **Find the period \(T\)** and compute \(\omega_0 = 2\pi/T\).
2. **Choose integration limits** covering exactly one period.
3. **Split the integral** if \(x(t)\) is defined piecewise.
4. **Evaluate \(a_0\) separately** (set \(k=0\) or use \(\frac{1}{T}\int_T x(t)dt\)).
5. **Compute \(a_k\) for \(k \neq 0\)** using the analysis formula.
6. **Simplify using symmetries** (saves work and reveals properties):
   - Real & even \(x(t) \Rightarrow a_k\) real & even
   - Real & odd \(x(t) \Rightarrow a_k\) purely imaginary & odd
   - Half-wave symmetry \(\Rightarrow a_k = 0\) for even \(k\)
7. **Check convergence**: At discontinuities, the series converges to the midpoint value. Gibbs phenomenon may appear in partial sums.

---
### 🔹 4. Worked Example
**Signal**: Periodic rectangular pulse
\(x(t) = \begin{cases} 1, & -T/2 < t < 0 \\ 0, & 0 \le t < T/2 \end{cases}\), repeated every \(T\).

- \(\omega_0 = 2\pi/T\)
- Choose limits: \([-T/2, T/2]\)

**DC component**:
\[
a_0 = \frac{1}{T} \int_{-T/2}^{0} 1 \, dt = \frac{1}{2}
\]

**For \(k \neq 0\)**:
\[
a_k = \frac{1}{T} \int_{-T/2}^{0} e^{-j k \omega_0 t} dt
= \frac{1}{T} \left[ \frac{e^{-j k \omega_0 t}}{-j k \omega_0} \right]_{-T/2}^{0}
= \frac{1}{T} \cdot \frac{1 - e^{j k \omega_0 T/2}}{j k \omega_0}
\]
Since \(\omega_0 T = 2\pi\), \(e^{j k \pi} = (-1)^k\):
\[
a_k = \frac{1}{j k 2\pi} \left(1 - (-1)^k\right)
\]
- If \(k\) is even \(\Rightarrow a_k = 0\)
- If \(k\) is odd \(\Rightarrow a_k = \frac{1}{j k \pi}\)

Final:
\[
a_k = \begin{cases}
\frac{1}{2}, & k = 0 \\
0, & k \text{ even}, \, k \neq 0 \\
\frac{1}{j k \pi}, & k \text{ odd}
\end{cases}
\]

---
### 🔹 5. Practical & Numerical Notes
- **Analytical vs Numerical**: Closed-form solutions work for piecewise smooth signals. For arbitrary waveforms, discretize the integral:
  \[
  a_k \approx \frac{1}{N} \sum_{n=0}^{N-1} x(nT/N) e^{-j k n 2\pi / N}
  \]
  This is essentially the **DFT/FFT** applied to one period of sampled data.
- **Windowing**: If you only have a finite observation window of a signal, multiply by a window function before computing coefficients to reduce spectral leakage.
- **Trigonometric Form**: If you prefer real coefficients:
  \[
  x(t) = a_0 + \sum_{k=1}^{\infty} \left( A_k \cos(k\omega_0 t) + B_k \sin(k\omega_0 t) \right)
  \]
  where \(A_k = a_k + a_{-k}\), \(B_k = j(a_k - a_{-k})\).

---
### 🔹 6. What If the Signal Isn't Periodic?
For aperiodic CT signals, use the **Continuous-Time Fourier Transform (CTFT)**:
\[
X(\omega) = \int_{-\infty}^{\infty} x(t) e^{-j\omega t} dt
\]
This yields a **continuous spectrum** \(X(\omega)\), not discrete coefficients. You can approximate Fourier coefficients by:
1. Truncating the signal to a large window \(T\)
2. Treating it as periodic
3. Computing \(a_k\) as above (introduces spectral leakage/aliasing if not careful)

---
### ✅ Quick Checklist
- [ ] Signal is periodic → use CTFS coefficients
- [ ] Found \(T\) and \(\omega_0\)
- [ ] Chose convenient integration limits
- [ ] Handled \(k=0\) separately
- [ ] Used symmetry to simplify
- [ ] Verified Dirichlet conditions / convergence behavior
- [ ] For numerical work: sampled uniformly over one period, used FFT

Let me know your specific signal or context (analytical vs numerical, real/complex, periodic/aperiodic), and I can tailor the steps or derive coefficients for it.

Note the above is latex output that renders perfectly nice in markdown.
And note the answer from Qwen3.6 27b is pretty damn good.
1780705531589.png


Tuning the model is a whole thing and one big reason I like to stick with docker containers and scripts as it makes this easy to play with.
Note: I am perfectly happy with both the token rate I am getting and the quality of the answers I get.
1780705652797.png
 
Last edited:
What are you doing with them?

I just use Vane with some local models through ollama on a a machine with a 3090 running Ubuntu.

I tried openclaw but didn't like how it's meant to be used with paid cloud models.
I'm thinking about trying that PewDiePie thing built specifically for having everything local.

I also use comfyui for a bunch of stuff.
 
https://rocm.docs.amd.com/projects/...ml?utm_source=copilot.com#docker-with-toolkit

That is the AMD ROCm container, and I would recommend running that for your RX 6800.

Note: you're running 16gb vram, so you're going to need to make sure you're running a model that is sized right in terms of both number of parameters and the quantization and which types will run natively on your RX 6800 gpu.


Additionally you are limited on parameter types on the Rx 6800 to BF16 and FP16, which will limit the overall size of the model you can load into vram. Also, vllm is great for performance, but it does not officially support the RX 6800 gpu, so you might get better mileage with ollama or on of the other LLM servers.
 
Read up a bit on VLLM... not sure it's the right option for me. From the sounds of things it's good at multi-GPU and multiple simultaneous requests. Presently I don't have either. Aside from screwing around with image generation in ComfyUI I've just been firing up LM Studio, plugging models into an IDE and using them to write code. Mostly PyCharm + Qwen 3.6 at home. At work it's IntelliJ + Google Gemini. Maybe I would if I started playing with OpenClaw. I'm more likely to get some good out of a low level llama.cpp setup. I have a 5090, 285k and 64GB DDR5-6400 on my main rig and a 3090, i9-10980XE, and 64GB in my old rig. llama.cpp is supposed to be the way to go for partial CPU/GPU offloading. Haven't tried that yet, but it would let me run larger models. Funny thing is my old rig may actually be faster at CPU inference. 18 cores with AVX-512 and quad channel DDR4-3600, so it actually has more memory bandwidth than my current rig. That DDR5-6400 was supposed to be temporary until larger XMP CUDIMMs came out. Still waiting, and now ram is stupid expensive.

I could maybe get the old rig up to 128GB of mismatched ram without buying anything. It's end up running at DDR4-2133 (JEDEC) speeds or add heatsinks to a cheap kit of green PCB no heatsink Crucial and do some manual OC. The Crucial is DDR4-3200 non-XMP. It's the XMP 3600 that ends up at 2133 with XMP off. Not sure if it's worth the hassle. Mostly I'm just trying to figure out how much better a 128GB setup would be.

I've been eyeing an NV DGX Spark or AMD Ryzen AI 395+ machine. I'd consider a Mac but you can't actually order a Studio with more than 96GB right now. Given that all I want it for is running AI stuff the Mac is out. Not enough ram. Might get an Air next time I need a laptop. I looked at various multi-GPU options, but getting to 96-128GB of combined vram and building the rest of a system that can suppor them is difficult without going way over the price of one of those mini-PCs.
 
Read up a bit on VLLM... not sure it's the right option for me. From the sounds of things it's good at multi-GPU and multiple simultaneous requests. Presently I don't have either. Aside from screwing around with image generation in ComfyUI I've just been firing up LM Studio, plugging models into an IDE and using them to write code. Mostly PyCharm + Qwen 3.6 at home. At work it's IntelliJ + Google Gemini. Maybe I would if I started playing with OpenClaw. I'm more likely to get some good out of a low level llama.cpp setup. I have a 5090, 285k and 64GB DDR5-6400 on my main rig and a 3090, i9-10980XE, and 64GB in my old rig. llama.cpp is supposed to be the way to go for partial CPU/GPU offloading. Haven't tried that yet, but it would let me run larger models. Funny thing is my old rig may actually be faster at CPU inference. 18 cores with AVX-512 and quad channel DDR4-3600, so it actually has more memory bandwidth than my current rig. That DDR5-6400 was supposed to be temporary until larger XMP CUDIMMs came out. Still waiting, and now ram is stupid expensive.

I could maybe get the old rig up to 128GB of mismatched ram without buying anything. It's end up running at DDR4-2133 (JEDEC) speeds or add heatsinks to a cheap kit of green PCB no heatsink Crucial and do some manual OC. The Crucial is DDR4-3200 non-XMP. It's the XMP 3600 that ends up at 2133 with XMP off. Not sure if it's worth the hassle. Mostly I'm just trying to figure out how much better a 128GB setup would be.

I've been eyeing an NV DGX Spark or AMD Ryzen AI 395+ machine. I'd consider a Mac but you can't actually order a Studio with more than 96GB right now. Given that all I want it for is running AI stuff the Mac is out. Not enough ram. Might get an Air next time I need a laptop. I looked at various multi-GPU options, but getting to 96-128GB of combined vram and building the rest of a system that can suppor them is difficult without going way over the price of one of those mini-PCs.

It really depends on what you want to do with it. The large amount of memory means you can use bigger models, but they sometimes run so slow it's not worth it.
 
Back
Top