Guide for macOS Users
LM Studio · OpenCode · (Optional: oMLX · OpenRouter)
Hardware Requirements
On an Apple Silicon chip, local inference requires enough unified memory to hold the model weights plus a working KV cache. As a practical baseline:
| Hardware | Minimum VRAM | Recommended |
|---|---|---|
| M1 / M2 | 18 GB | 24 GB+ |
| M3 / M4 | 16 GB | 36 GB+ |
NOTE: At less than 16 GB you are limited to sub-10B models, which struggle with multi-file agentic tasks. The Qwen3.6-27B model used in this guide requires roughly 17 GB at 4-bit quantization. This model is just used as an example; smaller models are still a great place to start for experimentation.
If / when you plan a hardware upgrade to run local models, use tps.bunai.cc to simulate tokens-per-second (TPS) and time-to-first-token (TTFT) for a given model on any GPU before buying. Enter your target model and desired performance to find the right chip.
1. LM Studio
LM Studio is a desktop app that lets you browse, download, and run local models. It detects your hardware and predicts which models will fit in your memory.
Install
- Download from lmstudio.ai
- At first launch, when prompted for a usage mode, select Developer (unlocks local API server and advanced model settings)
Download a model
- Click the Search (magnifying glass) icon in the left sidebar to open the model search panel.
- In the Formats dropdown, select MLX. MLX is Apple’s native inference framework and is significantly faster than GGUF on Apple Silicon.
- Search for
qwen3.6-27band find the highest quantization that will fit. Usually, 4bit is a practical minimum for ensuring quality output. 5bit or 6bit will produce noticeably better results but with higher memory utilization and slower inference.- Any model published by
mlx-communityis a safe bet.
- Any model published by
- In the model card, check whether LM Studio reports
Full GPU Offload Possible.- If 27b will not fit, search for
qwen3.5-9band if this is still too big,qwen3.5-4b.- NOTE: you have to use a qwen3.5 model as smaller variants of qwen3.6 are not yet available.
- The Qwen3.6-35B-A3B MoE variant is also worth considering if it will fit, even with partial GPU offload. This is because it activates only 3B parameters at once. This results in very fast inference at the expense of reduced output quality.
- If 27b will not fit, search for
- Download the model.
As of May 2026, the community consensus is that the Qwen3.5 and Qwen3.6 series are the best open-weight models for local agentic coding.
2. Model Settings
LM Studio’s defaults are tuned for general chat and will produce poor results for agentic coding.
Click the My Models icon in the left panel and find qwen3.6-27b. Click it, and in the Model Configuration panel on the right, set the following:
In the Load Tab
| Parameter | Value | Notes |
|---|---|---|
| Context Length | 131072 |
Recommended minimum for stable agent behaviour. Increase if working on large codebases. The maximum for Qwen3.5/3.6 models is 262144. Note that larger context windows will consume more VRAM, so take into consideration how much you have. |
Leave other settings in this tab as the defaults.
In the Inference Tab
| Parameter | Value | Notes |
|---|---|---|
| Temperature | 0.6 |
Good balance to prevent erratic output during reasoning |
| Top K | 20 |
Chooses only from the top 20 tokens in a given forward pass; functions similar to temperature |
| Repeat Penalty | 1 |
Leave at default; Qwen3.6/3.5 doesn’t need it |
| Top P | 0.95 |
Broad sampling for thinking mode |
| Min P | 0.0 |
Disable minimum probability filter |
Leave other settings in this tab as the defaults.
preserve_thinking (Qwen3.6 only)
The Qwen3.6 series introduced a new chat template flag, preserve_thinking. Setting this to true improves inference quality and tool calling behaviour.
In the Chat Template editor (Model Configuration → Advanced → Chat Template), add this line to the very top of the template:
{%- set preserve_thinking = true %}
Why this matters: By default, the model’s internal reasoning (<think>...</think> blocks) is discarded between turns. Over a long agentic session with many tool calls, the model can lose the coherent mental model it built in earlier turns, causing inconsistency and repeated mistakes. With preserve_thinking = true, the template includes prior reasoning in the conversation context, so the model’s chain-of-thought carries forward across turns.
3. Test in LM Studio Chat
Switch to the Chat tab. In a new chat, load the model using the dropdown at the top of the window. Send a test prompt, for example:
Write a Python function that reverses a linked list. Add type hints and a docstring.
You should see a <think> block followed by code output.
4. Start the Local Server
- Click the Developer icon (left sidebar,
>_). - Toggle Server to On. The default port is 1234.
- Confirm the model is loaded (it appears in the Active Models list).
The server is now available at http://127.0.0.1:1234/v1 (OpenAI-compatible endpoint).
5. OpenCode Setup
Install OpenCode
Install OpenCode via curl:
curl -fsSL https://opencode.ai/install | bashor the Homebrew package manager:
brew install anomalyco/tap/opencodeConfigure the local model
Create or edit ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "LM Studio (Local)",
"options": {
"baseURL": "http://localhost:1234/v1"
},
"models": {
"qwen/qwen3.6-27b": {
"name": "qwen3.6-27b"
}
}
}
}
}In the JSON above, be sure to enter the correct model ID for the model you downloaded.
Finding the model ID: In LM Studio, go to My Models and look at the LLM column. The string shown there (e.g., qwen3.6-27b) is what you use as the value in the models object above.
You can add external providers, but you must bring your own key (BYOK). OpenCode does not provide API access. If you have them, add your own keys in the provider section the same way and switch between them with /models in OpenCode.
6. Code a Hello World App
Open a terminal in VS Code inside an empty project folder and run:
opencodeWhen the TUI appears, type your task, for example:
Create a Python hello world script that prints the current date and time.
Run it and show me the output.
OpenCode will write the script, execute it, and display the result.
Advanced (Optional): oMLX + OpenRouter
This section upgrades the stack for better performance and cloud model access.
Each component’s role:
- LM Studio — model browser and downloader only; no inference
- oMLX — replaces LM Studio’s built-in server with paged SSD KV caching and continuous batching, dramatically reducing TTFT on long agentic runs
- OpenRouter — single API key for 300+ cloud models including free tiers (DeepSeek, Qwen3 235B, GPT-OSS); you configure it as a second provider alongside oMLX in OpenCode
- OpenCode — the coding agent; selects between oMLX (local) and OpenRouter (cloud) via
/models
Install oMLX
NOTE: oMLX requires either macOS Tahoe (the latest version) or Sequoia (the next most recent version).
Download the DMG from omlx.ai, drag to Applications, and launch. The Welcome screen walks through three steps:
Model directory — point it at your existing LM Studio models folder:
~/.lmstudio/models/oMLX reuses your already-downloaded MLX models; nothing re-downloads.
Start — click Start server. oMLX runs on port 8000 by default and lives in your menu bar.
Configure — click the oMLX menu bar icon → Admin Panel → Settings.
Optimal parameters in oMLX
Global Settings
In the oMLX Admin Panel, Click Settings → Global Settings. Use the following:
| Parameter | Value |
|---|---|
| Hot Cache Limit | 10% |
| Max Context Window | 131072 |
| Max Tokens | 65536 |
| Temperature | 0.6 |
| Top P | 0.95 |
| Top K | 20 |
| Repetition Penalty | 1 |
Notes: - Hot Cache Limit keeps the most-accessed blocks in unified memory, reducing SSD read latency. - Max Context Window of 131072 corresponds to 128k tokens, which Qwen recommends as the minimum to preserve high-quality thinking for agentic coding tasks. - To enable preserve_thinking for Qwen3.6 models (strongly recommended), you need to do so on a per-model basis (see below).
Leave other settings in this tab as the defaults.
Per-Model Settings
Click Model Settings at the top. Open the settings for your model and apply the same parameters as in LM Studio:
| Parameter | Value |
|---|---|
| Ctx Window | 131072 |
| Max Tokens | 65536 |
| Temperature | 0.6 |
| Top P | 0.95 |
| Top K | 20 |
| Min P | 0 |
| Repetition Penalty | 1 |
| Presence Penalty | 0 |
In the Advanced Settings sidebar:
- Enable Thinking: ON
- If you’re using a Qwen3.6+ model, you can also enable
preserve_thinkingin oMLX. Scroll to Chat Template Kwargs in the right panel. Add a new Custom Kwarg calledpreserve_thinkingwith a value oftrue. Check the Force box.
Leave other settings in this modal as the defaults and save your changes.
Connect OpenCode to oMLX
Next, update ~/.config/opencode/opencode.json to include your model served on oMLX:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"omlx": {
"npm": "@ai-sdk/openai-compatible",
"name": "oMLX (Local)",
"options": {
"baseURL": "http://localhost:8000/v1",
"apiKey": "YOUR_API_KEY"
},
"models": {
"qwen3.6-27b": {
"name": "qwen3.6-27b"
}
}
}
}
}Your API key (if any) was chosen during setup.
Set up OpenRouter
- Create an account at openrouter.ai.
- On the welcome page, click Get API Key. Create one and copy it.
Connect OpenCode to OpenRouter
Run inside OpenCode:
/connect
Search for OpenRouter and enter your API key.
OpenRouter’s built-in models load automatically from your connected key. You don’t need to configure them manually.
Switching between local and cloud in OpenCode
Type /models inside OpenCode. You will see:
omlx/qwen3.6-27b— your local model via oMLX- All other OpenRouter models you have access to (scroll downfree)
The OpenRouter free tier includes models like DeepSeek V3, Qwen3 235B, and GPT-OSS 120B. To find free-tier models, type the word free in the model search.
Without a paid balance, OpenRouter limits you to 50 requests/day across free models. A single active agentic coding session can exhaust this quickly. Adding $10 to your account raises the limit to 1,000 requests/day, which can go a very long way for some models. For example, Deepseek V4-Flash charges $0.14/M input tokens.