Guide for macOS Users

LM Studio · OpenCode · (Optional: oMLX · OpenRouter)

Hardware Requirements

On an Apple Silicon chip, local inference requires enough unified memory to hold the model weights plus a working KV cache. As a practical baseline:

Hardware	Minimum VRAM	Recommended
M1 / M2	18 GB	24 GB+
M3 / M4	16 GB	36 GB+

NOTE: At less than 16 GB you are limited to sub-10B models, which struggle with multi-file agentic tasks. The Qwen3.6-27B model used in this guide requires roughly 17 GB at 4-bit quantization. This model is just used as an example; smaller models are still a great place to start for experimentation.

Plan your hardware

If / when you plan a hardware upgrade to run local models, use tps.bunai.cc to simulate tokens-per-second (TPS) and time-to-first-token (TTFT) for a given model on any GPU before buying. Enter your target model and desired performance to find the right chip.

1. LM Studio

LM Studio is a desktop app that lets you browse, download, and run local models. It detects your hardware and predicts which models will fit in your memory.

Install

Download from lmstudio.ai
At first launch, when prompted for a usage mode, select Developer (unlocks local API server and advanced model settings)

Download a model

Click the Search (magnifying glass) icon in the left sidebar to open the model search panel.
In the Formats dropdown, select MLX. MLX is Apple’s native inference framework and is significantly faster than GGUF on Apple Silicon.
Search for qwen3.6-27b and find the highest quantization that will fit. Usually, 4bit is a practical minimum for ensuring quality output. 5bit or 6bit will produce noticeably better results but with higher memory utilization and slower inference.
- Any model published by mlx-community is a safe bet.
In the model card, check whether LM Studio reports Full GPU Offload Possible.
- If 27b will not fit, search for qwen3.5-9b and if this is still too big, qwen3.5-4b.
  - NOTE: you have to use a qwen3.5 model as smaller variants of qwen3.6 are not yet available.
- The Qwen3.6-35B-A3B MoE variant is also worth considering if it will fit, even with partial GPU offload. This is because it activates only 3B parameters at once. This results in very fast inference at the expense of reduced output quality.
Download the model.

Why Qwen3.5 / 3.6?

As of May 2026, the community consensus is that the Qwen3.5 and Qwen3.6 series are the best open-weight models for local agentic coding.

2. Model Settings

LM Studio’s defaults are tuned for general chat and will produce poor results for agentic coding.

Click the My Models icon in the left panel and find qwen3.6-27b. Click it, and in the Model Configuration panel on the right, set the following:

In the Load Tab

Parameter	Value	Notes
Context Length	`131072`	Recommended minimum for stable agent behaviour. Increase if working on large codebases. The maximum for Qwen3.5/3.6 models is `262144`. Note that larger context windows will consume more VRAM, so take into consideration how much you have.

Leave other settings in this tab as the defaults.

In the Inference Tab

Parameter	Value	Notes
Temperature	`0.6`	Good balance to prevent erratic output during reasoning
Top K	`20`	Chooses only from the top 20 tokens in a given forward pass; functions similar to temperature
Repeat Penalty	`1`	Leave at default; Qwen3.6/3.5 doesn’t need it
Top P	`0.95`	Broad sampling for thinking mode
Min P	`0.0`	Disable minimum probability filter

Leave other settings in this tab as the defaults.

preserve_thinking (Qwen3.6 only)

The Qwen3.6 series introduced a new chat template flag, preserve_thinking. Setting this to true improves inference quality and tool calling behaviour.

In the Chat Template editor (Model Configuration → Advanced → Chat Template), add this line to the very top of the template:

{%- set preserve_thinking = true %}

Why this matters: By default, the model’s internal reasoning (<think>...</think> blocks) is discarded between turns. Over a long agentic session with many tool calls, the model can lose the coherent mental model it built in earlier turns, causing inconsistency and repeated mistakes. With preserve_thinking = true, the template includes prior reasoning in the conversation context, so the model’s chain-of-thought carries forward across turns.

3. Test in LM Studio Chat

Switch to the Chat tab. In a new chat, load the model using the dropdown at the top of the window. Send a test prompt, for example:

Write a Python function that reverses a linked list. Add type hints and a docstring.

You should see a <think> block followed by code output.

4. Start the Local Server

Click the Developer icon (left sidebar, >_).
Toggle Server to On. The default port is 1234.
Confirm the model is loaded (it appears in the Active Models list).

The server is now available at http://127.0.0.1:1234/v1 (OpenAI-compatible endpoint).

5. OpenCode Setup

Install OpenCode

Install OpenCode via curl:

curl -fsSL https://opencode.ai/install | bash

or the Homebrew package manager:

brew install anomalyco/tap/opencode

Configure the local model

Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "LM Studio (Local)",
      "options": {
        "baseURL": "http://localhost:1234/v1"
      },
      "models": {
        "qwen/qwen3.6-27b": {
          "name": "qwen3.6-27b"
        }
      }
    }
  }
}

In the JSON above, be sure to enter the correct model ID for the model you downloaded.

Finding the model ID: In LM Studio, go to My Models and look at the LLM column. The string shown there (e.g., qwen3.6-27b) is what you use as the value in the models object above.

External providers (cloud models)

You can add external providers, but you must bring your own key (BYOK). OpenCode does not provide API access. If you have them, add your own keys in the provider section the same way and switch between them with /models in OpenCode.

6. Code a Hello World App

Open a terminal in VS Code inside an empty project folder and run:

opencode

When the TUI appears, type your task, for example:

Create a Python hello world script that prints the current date and time.
Run it and show me the output.

OpenCode will write the script, execute it, and display the result.

Advanced (Optional): oMLX + OpenRouter

This section upgrades the stack for better performance and cloud model access.

Each component’s role:

LM Studio — model browser and downloader only; no inference
oMLX — replaces LM Studio’s built-in server with paged SSD KV caching and continuous batching, dramatically reducing TTFT on long agentic runs
OpenRouter — single API key for 300+ cloud models including free tiers (DeepSeek, Qwen3 235B, GPT-OSS); you configure it as a second provider alongside oMLX in OpenCode
OpenCode — the coding agent; selects between oMLX (local) and OpenRouter (cloud) via /models

Install oMLX

NOTE: oMLX requires either macOS Tahoe (the latest version) or Sequoia (the next most recent version).

Download the DMG from omlx.ai, drag to Applications, and launch. The Welcome screen walks through three steps:

Model directory — point it at your existing LM Studio models folder:
```
~/.lmstudio/models/
```
oMLX reuses your already-downloaded MLX models; nothing re-downloads.
Start — click Start server. oMLX runs on port 8000 by default and lives in your menu bar.
Configure — click the oMLX menu bar icon → Admin Panel → Settings.

Optimal parameters in oMLX

Global Settings

In the oMLX Admin Panel, Click Settings → Global Settings. Use the following:

Parameter	Value
Hot Cache Limit	`10%`
Max Context Window	`131072`
Max Tokens	`65536`
Temperature	`0.6`
Top P	`0.95`
Top K	`20`
Repetition Penalty	`1`

Notes: - Hot Cache Limit keeps the most-accessed blocks in unified memory, reducing SSD read latency. - Max Context Window of 131072 corresponds to 128k tokens, which Qwen recommends as the minimum to preserve high-quality thinking for agentic coding tasks. - To enable preserve_thinking for Qwen3.6 models (strongly recommended), you need to do so on a per-model basis (see below).

Leave other settings in this tab as the defaults.

Per-Model Settings

Click Model Settings at the top. Open the settings for your model and apply the same parameters as in LM Studio:

Parameter	Value
Ctx Window	`131072`
Max Tokens	`65536`
Temperature	`0.6`
Top P	`0.95`
Top K	`20`
Min P	`0`
Repetition Penalty	`1`
Presence Penalty	`0`

In the Advanced Settings sidebar:

Enable Thinking: ON
If you’re using a Qwen3.6+ model, you can also enable preserve_thinking in oMLX. Scroll to Chat Template Kwargs in the right panel. Add a new Custom Kwarg called preserve_thinking with a value of true. Check the Force box.

Leave other settings in this modal as the defaults and save your changes.

Connect OpenCode to oMLX

Next, update ~/.config/opencode/opencode.json to include your model served on oMLX:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "omlx": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "oMLX (Local)",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "YOUR_API_KEY"
      },
      "models": {
        "qwen3.6-27b": {
          "name": "qwen3.6-27b"
        }
      }
    }
  }
}

Your API key (if any) was chosen during setup.

Set up OpenRouter

Create an account at openrouter.ai.
On the welcome page, click Get API Key. Create one and copy it.

Connect OpenCode to OpenRouter

Run inside OpenCode:

/connect

Search for OpenRouter and enter your API key.

OpenRouter’s built-in models load automatically from your connected key. You don’t need to configure them manually.

Switching between local and cloud in OpenCode

Type /models inside OpenCode. You will see:

omlx/qwen3.6-27b — your local model via oMLX
All other OpenRouter models you have access to (scroll downfree)

The OpenRouter free tier includes models like DeepSeek V3, Qwen3 235B, and GPT-OSS 120B. To find free-tier models, type the word free in the model search.

Free tier limits

Without a paid balance, OpenRouter limits you to 50 requests/day across free models. A single active agentic coding session can exhaust this quickly. Adding $10 to your account raises the limit to 1,000 requests/day, which can go a very long way for some models. For example, Deepseek V4-Flash charges $0.14/M input tokens.