Guide for Windows Users

LM Studio · OpenCode · (Optional: llama-server · OpenRouter)

Hardware Requirements

On a Windows PC, local inference requires a discrete GPU with enough VRAM to hold an LLM’s model weights plus a KV cache.

Laptop GPU examples VRAM What runs well
RTX 4060 / 4070 Laptop GPU, RTX 5060 / 5070 Laptop GPU 8 GB Qwen3.5-4B at Q4 or Q6; Qwen3.5-2B or 0.8B if you need extra context headroom
RTX 4080 Laptop GPU, RTX 5070 Ti Laptop GPU 12 GB Qwen3.5-9B at Q4 or Q5; larger Qwen3.5 models do not leave enough room for a useful KV cache
RTX 4090 Laptop GPU, RTX 5080 Laptop GPU 16 GB Qwen3.5-9B at Q6 or Q8 comfortably; Qwen3.5-27B only with low-bit quantization, reduced context, or RAM offload
RTX 5090 Laptop GPU 24 GB Qwen3.6-27B at Q4 comfortably; Qwen3.5-27B at Q4; Qwen3.5-35B-A3B with careful settings

NOTE: At less than 16 GB you are limited to sub-10B models, which struggle with multi-file agentic tasks. The Qwen3.6-27B used in this guide requires roughly 17 GB at 4-bit quantization. This model is just used as an example; smaller models are still a great place to start for experimentation.

Plan your hardware

If you’re ever planning a hardware upgrade to run local models, use tps.bunai.cc to simulate tokens-per-second (TPS) and time-to-first-token (TTFT) for a given model on any GPU before buying. Enter your target model and desired performance to find the right card.


1. LM Studio

LM Studio is a desktop app that lets you browse, download, and run local models. It detects your hardware and predicts which models will fit in your VRAM.

Install

  • Download from lmstudio.ai
  • At first launch, when prompted for a usage mode, select Developer (unlocks local API server and advanced model settings)

Download a model

  1. Click the Search (magnifying glass) icon in the left sidebar to open the model search panel.
  2. In the Formats dropdown, select GGUF. GGUF is the standard format for llama.cpp-based inference on Windows with NVIDIA GPUs.
  3. Search for qwen3.6-27b and find the highest quantization that will fit. Usually, 4bit (Q4_K_M) is a practical minimum for ensuring quality output. 6bit (Q6_K) will produce noticeably better results but with higher memory utilization and slower inference.
    • Any official model published by Qwen themselves, or by unsloth, is a safe bet.
  4. In the model card, check whether LM Studio reports Full GPU Offload Possible.
    • If 27b will not fit, search for qwen3.5-9b and if this is still too big, qwen3.5-4b.
      • NOTE: qwen3.6 is not yet available in smaller variants.
    • The Qwen3.6-35B-A3B MoE variant is also worth considering if it will fit, even with partial GPU offload. This is because it activates only 3B parameters at once. This results in very fast inference at the expense of reduced output quality.
  5. Download the model.
Why Qwen3.5 / 3.6?

As of May 2026, the community consensus is that the Qwen3.5 and Qwen3.6 series are the best open-weight models for local agentic coding.


2. Model Settings

LM Studio’s defaults are tuned for general chat and will produce poor results for agentic coding.

Click the My Models icon in the left panel and find the model you downloaded. Click it, and in the Model Configuration panel on the right, set the following:

In the Load Tab

Parameter Value Notes
Context Length 131072 Recommended minimum for stable agent behaviour. Increase if working on large codebases. The maximum for Qwen3.5/3.6 models is 262144. Note that larger context windows will consume more VRAM, so take into consideration how much you have.

Leave other settings in this tab as the defaults.

In the Inference Tab

Parameter Value Notes
Temperature 0.6 Good balance to prevent erratic output during reasoning
Top K 20 Chooses only from the top 20 tokens in a given forward pass; functions similar to temperature
Repeat Penalty 1 Leave at default; Qwen3.6/3.5 doesn’t need it
Top P 0.95 Broad sampling for thinking mode
Min P 0.0 Disable minimum probability filter

Leave other settings in this tab as the defaults.

preserve_thinking (Qwen3.6 only)

The Qwen3.6 series introduced a new chat template flag, preserve_thinking. Setting this to true improves inference quality and tool calling behaviour.

In the Chat Template editor (Model Configuration → Advanced → Chat Template), add this line to the very top of the template:

{%- set preserve_thinking = true %}

Why this matters: By default, the model’s internal reasoning (<think>...</think> blocks) is discarded between turns. Over a long agentic session with many tool calls, the model can lose the coherent mental model it built in earlier turns, causing inconsistency and repeated mistakes. With preserve_thinking = true, the template includes prior reasoning in the conversation context, so the model’s chain-of-thought carries forward across turns.


3. Test in LM Studio Chat

Switch to the Chat tab. In a new chat, load the model using the dropdown at the top of the window. Send a test prompt, for example:

Write a Python function that reverses a linked list. Add type hints and a docstring.

You should see a <think> block followed by code output.


4. Start the Local Server

  1. Click the Developer icon (left sidebar, >_).
  2. Toggle Server to On. The default port is 1234.
  3. Confirm the model is loaded (it appears in the Active Models list).

The server is now available at http://127.0.0.1:1234/v1 (OpenAI-compatible endpoint).


5. OpenCode Setup

Install OpenCode

Install OpenCode via curl:

curl -fsSL https://opencode.ai/install | bash

or the npm package manager:

npm i -g opencode-ai@latest

Configure the local model

Create or edit %USERPROFILE%\.config\opencode\opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "LM Studio (Local)",
      "options": {
        "baseURL": "http://localhost:1234/v1"
      },
      "models": {
        "qwen/qwen3.6-27b": {
          "name": "qwen3.6-27b"
        }
      }
    }
  }
}

In the JSON above, be sure to enter the correct model ID for the model you downloaded.

Finding the model ID: In LM Studio, go to My Models and look at the LLM column. The string shown there (e.g., qwen3.6-27b) is what you use as the value in the models object above.

External providers (cloud models)

You can add external providers, but you must bring your own key (BYOK). OpenCode does not provide API access. If you have them, add your own keys in the provider section the same way and switch between them with /models in OpenCode.


6. Code a Hello World App

Open a terminal in VS Code inside an empty project folder and run:

opencode

When the TUI appears, type your task, for example:

Create a Python hello world script that prints the current date and time.
Run it and show me the output.

OpenCode will write the script, execute it, and display the result.


Advanced (Optional): llama-server + OpenRouter

This section upgrades the stack for better performance and cloud model access.

Each component’s role:

  • LM Studio — model browser and downloader only; no inference
  • llama-server — replaces LM Studio’s built-in server with direct llama.cpp, giving full control over KV cache quantization, flash attention, context flags, and prefix caching, resulting in better TTFT and generation speed
  • OpenRouter — single API key for 300+ cloud models including free tiers (DeepSeek, Qwen3 235B, GPT-OSS); configured as a second provider alongside llama-server in OpenCode
  • OpenCode — the coding agent; selects between llama-server (local) and OpenRouter (cloud) via /models

Install llama-server

Download a pre-built Windows binary from the llama.cpp releases page. Choose the llama-b*-bin-win-cuda-cu12*-x64.zip build matching your CUDA version. Extract to a permanent location, e.g., C:\llama.cpp\.

Optimal parameters for llama-server

The model is stored in LM Studio’s folder and there is no need to move it. Find your model file at:

%USERPROFILE%\.lmstudio\models\lmstudio-community\Qwen3.6-27B-GGUF\

The key llama-server flags for agentic coding:

Flag Value Purpose
--ctx-size 131072 128K minimum per Qwen’s recommendation for thinking mode
--temp 0.6 Temperature
--top-p 0.95 Top-P sampling
--top-k 20 Top-K sampling
--min-p 0.0 Disable min-P filter
--repeat-penalty 1.0 No repeat penalty (Qwen3.6 default)
--flash-attn (flag only) Faster attention
--cache-prompt (flag only) Prefix caching; reduces TTFT on repeated prefixes
--n-predict 65536 Max generation tokens; covers long think blocks + code output
--chat-template-kwargs '{"preserve_thinking": true}' Preserve thinking

Create a launcher batch script

Create C:\llama.cpp\start-qwen.bat (be sure to change the model file location if it differs):

@echo off
SET MODEL="%USERPROFILE%\.lmstudio\models\lmstudio-community\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q4_K_M.gguf"

C:\llama.cpp\llama-server.exe ^
  --model %MODEL% ^
  --ctx-size 131072 ^
  --temp 0.6 ^
  --top-p 0.95 ^
  --top-k 20 ^
  --min-p 0.0 ^
  --repeat-penalty 1.0 ^
  --flash-attn ^
  --cache-prompt ^
  --n-predict 65536 ^
  --port 8080
  --chat-template-kwargs '{"preserve_thinking": true}'

pause

Double-click start-qwen.bat to start the server. The terminal window stays open while it’s running. Close the window to stop the server.

Tip

You can minimize (not close) the terminal window while coding. The server continues running in the background.

Connect OpenCode to llama-server

Next, update %USERPROFILE%\.config\opencode\opencode.json to point to the model being served on llama-server:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llamaserver": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (Local)",
      "options": {
        "baseURL": "http://localhost:8080/v1",
        "apiKey": "none"
      },
      "models": {
        "qwen3.6-27b": {
          "name": "qwen3.6-27b"
        }
      }
    }
  }
}

Set up OpenRouter

  1. Create an account at openrouter.ai.
  2. On the welcome page, click Get API Key. Create one and copy it.

Connect OpenCode to OpenRouter

Run inside OpenCode:

/connect

Search for OpenRouter and enter your API key.

OpenRouter’s built-in models load automatically from your connected key. You don’t need to configure them manually.

Switching between local and cloud in OpenCode

Type /models inside OpenCode. You will see:

  • qwen3.6-27b — your local model via llama-server
  • All other OpenRouter models you have access to (scroll down).

The OpenRouter free tier includes models like DeepSeek V3, Qwen3 235B, and GPT-OSS 120B. To find free-tier models, type the word free in the model search.

Free tier limits

Without a paid balance, OpenRouter limits you to 50 requests/day across free models. A single active agentic coding session can exhaust this quickly. Adding $10 to your account raises the limit to 1,000 requests/day, which can go a very long way for some models. For example, Deepseek V4-Flash charges $0.14/M input tokens.