AI
Lesezeit : 30 Minuten
14 Local LLMs vs. the Mac Mini M4: One surprising winner and a new mcp to swap them mid-session


Verfasst von
Oussama Khaznaji
Veröffentlicht am
7. April 2026
Aktualisiert am
7. April 2026
Key Takeaways
Best overall: Qwen3.5-35B (thinking)
Best value: Qwen3.5-9B
Fastest: Qwen3-Coder
Speed Vs intelligence: It's a tricky balance to find that perfect middle ground.
MCP GitHub link: MCP server that lets Claude Code hot-swap local models mid-session without losing context.
Introduction
Running large language models entirely on local hardware is no longer an experiment for researchers and hobbyists. In early 2026, it is a practical option for professionals and small organizations willing to invest an afternoon in setup. This benchmark was designed to find out exactly how practical that option is, and where it breaks down.
Fourteen model variants were tested on a single Mac Mini M4 with 32GB of unified memory, running entirely through llama.cpp with no cloud services, no external APIs, and no subscription fees. The models covered a wide range: from sub-6GB quantized variants to 22GB files pushing the hardware to its limit, from dense architectures to sparse mixture-of-experts (MoE) designs, and from general-purpose assistants to specialized coding agents. Several models were tested twice, once with thinking mode disabled and once with it enabled, to understand what extended internal reasoning actually delivers in practice.
The results contain a few surprises. Some models ran two to three times faster than theoretical estimates. Some models that looked strong on paper failed simple constraint reasoning. The relationship between quantization precision and output quality turned out to be almost nonexistent. And thinking mode, despite the hype, only helped some models on some tasks, while making others worse.
The goal here is not a leaderboard. It is an honest picture of what you get, at what speed, at what memory cost, for three tasks that represent real professional work.
Hardware
Component | Details |
Device | Apple Mac Mini M4 (base chip, 2024) |
Unified Memory | 32 GB LPDDR5X |
Theoretical Memory Bandwidth | 120 GB/s (Apple official specification) |
Practical Inference Bandwidth | approximately 80 GB/s (calibrated; see note below) |
Inference Engine | llama.cpp (llama-server build) |
API | OpenAI-compatible REST at localhost:8000/v1 |
Operating System | macOS |
A note on memory bandwidth figures. Apple rates the M4 base chip at 120 GB/s, which is the theoretical peak under ideal conditions. Memory bandwidth benchmarks using Metal GPU compute kernels on the M4 (published in "Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency", arXiv:2502.05317) measured a practical ceiling of up to 100 GB/s, which is around 83% of the rated peak. This gap exists because the memory bus is shared across the CPU, GPU, Neural Engine, and system processes simultaneously, so no single workload ever claims the full pipe.
For LLM inference via llama.cpp, the effective bandwidth is lower still, typically 65–90% of that measured ceiling depending on model size, quantization format, and access patterns. Back-calculating from the dense model results in this benchmark confirms this: the 27B model at 4.4 t/s on a 17.6 GB file implies roughly 77 GB/s effective bandwidth, while the 9B Q8 model at 8.3 t/s on a 13 GB file implies roughly 108 GB/s. The variation reflects different memory access patterns across model sizes and architectures. The 80 GB/s figure is used as a practical planning midpoint for pre-benchmark speed estimation, not a fixed measurement.
Benchmark Design
Three Prompts, Three Reasons
The same three prompts were sent to every model variant. Each prompt has a known correct answer, which makes scoring objective rather than subjective.
Prompt 1: Logic and Constraint Reasoning
Three missionaries and three cannibals must cross a river using a boat that holds at most 2 people. Cannibals must never outnumber missionaries on either bank. Solve it step by step.
The classic missionaries-and-cannibals puzzle has an optimal solution of exactly 11 crossings (proven via state-space search; see Pressman & Singmaster, The Mathematical Gazette, 73:464, 1989). Solving it requires maintaining a consistent state across every step, verifying a hard constraint at each transition, and recovering from dead ends without losing track of where things stand. This maps directly to professional tasks like dependency resolution, configuration planning, and multi-step scheduling, where intermediate states must satisfy constraints that compound with each move. A model that hallucinates states or enters a reasoning loop cannot be trusted for those tasks regardless of how polished its prose is.
Scoring is binary: either the model produces a valid, complete 11-step solution or it does not. There is no partial credit for almost getting there.
Prompt 2: Code Quality
Write a Python function using the Sieve of Eratosthenes to find all primes up to n. Include type hints, a docstring, and state the time complexity.
The Sieve of Eratosthenes is well-understood enough that the output can be run, tested, and checked for correctness. The correct time complexity is O(n log log n). The prompt asks for specific professional code quality markers: type hints (standard modern Python practice), a docstring (necessary for maintainable production code), and time complexity (demonstrates that the model understands what it is implementing, not just reproducing a pattern). A model that produces plausible-looking code with a subtle off-by-one bug, or that states the wrong complexity, fails in ways that matter when the code goes into a codebase nobody else will audit.
Scoring out of 20 rewarded: algorithmic correctness, type annotation completeness, docstring quality across all sections (parameters, returns, raises, time complexity, space complexity, examples), use of Python idioms like math.isqrt and slice assignment for marking composites, edge case handling for n less than 2, and input validation via ValueError for negative input.
Prompt 3: Technical Analysis
Compare microservices vs monolithic architecture. Give a concrete recommendation for a 3-person startup building a SaaS product with clear reasoning.
The correct answer here is well-established in software engineering practice. A three-person startup should start with a modular monolith. Microservices solve problems of team scale, operational maturity, and granular deployment requirements that a three-person early-stage team simply does not have yet. Recommending microservices to a three-person startup is not a defensible architectural opinion; it is a failure to reason about the specific constraints of the context given. (Conway's Law, the observation that system architecture tends to mirror the communication structure of the team that builds it, is a useful lens here: with three people, there is no communication overhead to distribute.)
Scoring out of 20 rewarded: the correct recommendation (modular monolith specifically, not just "monolith"), a structured comparison, reasoning explicitly tied to team size and Conway's Law, specific trigger conditions for future migration, and actionable implementation guidance on module boundaries and data isolation.
Models Tested
Model | Quant | File Size | Thinking | Notes |
GPT-OSS 20B | MXFP4 | 12.1 GB | No | OpenAI open-weight, MoE architecture |
Qwen3-Coder-30B-A3B-Instruct | UD-Q4_K_XL | 17.7 GB | No (by design) | MoE, coding specialist, no thinking variant exists |
Devstral-Small-2507 | UD-Q4_K_XL | 14 GB | No | Dense, agentic SWE specialist |
Codestral-22B-v0.1 | Q4_K_M | 13 GB | No | Dense, FIM coding specialist |
GLM-4.7-Flash | UD-Q4_K_XL | 17.5 GB | No (disabled) | Z.ai MoE (formerly Zhipu AI), thinking ON by default |
GLM-4.7-Flash | UD-Q4_K_XL | 17.5 GB | Yes | Same model, thinking enabled |
Nemotron-3-Nano-30B-A3B | UD-Q4_K_XL | 22.8 GB | No (disabled) | NVIDIA hybrid Mamba-2/Transformer MoE |
Nemotron-3-Nano-30B-A3B | UD-Q4_K_XL | 22.8 GB | Yes | Same model, thinking enabled |
Qwen3.5-9B | UD-Q4_K_XL | 5.97 GB | No (not needed) | Dense |
Qwen3.5-9B | UD-Q8_K_XL | 13 GB | No (not needed) | Dense, higher precision quant of same model |
Qwen3.5-27B | UD-Q4_K_XL | 17.6 GB | No | Dense |
Qwen3.5-27B | UD-Q4_K_XL | 17.6 GB | Yes | Same model, thinking enabled |
Qwen3.5-35B-A3B | UD-Q4_K_XL | 22.2 GB | No | MoE, 3B active parameters |
Qwen3.5-35B-A3B | UD-Q4_K_XL | 22.2 GB | Yes | Same model, thinking enabled |
Quantization terminology. All Unsloth Dynamic (UD) quantizations use the Unsloth Dynamic 2.0 methodology, which upquants the most sensitive weight layers to Q8 or F16 precision. This preserves accuracy in the layers that matter most while keeping average file size at 4-bit levels. MXFP4 (Microscaling Floating Point 4-bit) is OpenAI's native quantization format for the MoE projection weights in GPT-OSS; it is applied only to those weights, with all others stored in BF16. Q4_K_M is a standard llama.cpp quantization using k-quantization with a medium block size.
Architecture terminology. MoE (Mixture of Experts) is a sparse architecture where only a subset of the model's parameters are active per token. The "A3B" suffix on MoE models denotes approximately 3 billion active parameters per forward pass, regardless of total parameter count. Dense models activate all parameters on every token.
Speed Estimation
Token generation speed during single-user inference is almost entirely memory-bandwidth bound. The GPU is not the bottleneck. The math is simple:
Dense models: tokens per second = effective_bandwidth_GB_s / file_size_GB
MoE models: tokens per second = effective_bandwidth_GB_s / (file_size_GB × MoE_factor)
A MoE factor of 0.27 was used as a conservative pre-benchmark estimate, taken from community GPU benchmarks. On Apple Metal, the actual factor turned out to be much lower, between 0.11 and 0.20 across all MoE models tested. Lower factor means faster inference. MoE models ran 1.5 to 2.5 times faster than the conservative estimate predicted, which is a meaningful and consistent finding. The implication is that sparse MoE architectures suit Apple Silicon particularly well: you get the knowledge of a larger model at the inference cost of a smaller one, and the unified memory architecture avoids the cross-chip communication overhead that slows MoE models on multi-GPU CUDA setups.
Results
Table sorted by measured speed (t/s) descending. Row order does not reflect quality ranking.
Row | Model | Quant | Size | Est. t/s | Actual avg t/s | Logic | Coding /20 | Analysis /20 |
1 | Qwen3-Coder-30B-A3B-Instruct | UD-Q4_K_XL | 17.7 GB | ~11–16 | 39.5 | partial | 16 | 16 |
2 | GPT-OSS 20B | MXFP4 | 12.1 GB | ~24 | 34.4 | fail | 19 | 17 |
3 | GLM-4.7-Flash (no thinking) | UD-Q4_K_XL | 17.5 GB | ~11–17 | 29.5 | fail | 15 | 16 |
4 | Nemotron-3-Nano-30B-A3B (no thinking) | UD-Q4_K_XL | 22.8 GB | ~8–17 | 28.5 | fail | 16 | 18 |
5 | Nemotron-3-Nano-30B-A3B (thinking) | UD-Q4_K_XL | 22.8 GB | ~8–17 | 28.4 | fail | 18 | 19 |
6 | GLM-4.7-Flash (thinking) | UD-Q4_K_XL | 17.5 GB | ~11–17 | 24.1 (a) | fail | 17 | 17 |
7 | Qwen3.5-35B-A3B (thinking) | UD-Q4_K_XL | 22.2 GB | ~8–13 | 17.9 | pass | 18 | 19 |
8 | Qwen3.5-35B-A3B (no thinking) | UD-Q4_K_XL | 22.2 GB | ~8–13 | 17.8 | fail | 16 | 18 |
9 | Qwen3.5-9B | UD-Q4_K_XL | 5.97 GB | ~13.4 | 13.7 | pass | 16 | 17 |
10 | Qwen3.5-9B | UD-Q8_K_XL | 13 GB | ~8.3 | 8.3 | pass | 13 | 16 |
11 | Codestral-22B-v0.1 | Q4_K_M | 13 GB | ~6.2 | 7.5 | fail | 11 | 5 |
12 | Devstral-Small-2507 | UD-Q4_K_XL | 14 GB | ~5.7 | 7.0 | fail | 15 | 15 |
13 | Qwen3.5-27B (thinking) | UD-Q4_K_XL | 17.6 GB | ~4.4 | 4.4 | pass | 16 | 19 |
14 | Qwen3.5-27B (no thinking) | UD-Q4_K_XL | 17.6 GB | ~4.4 | 4.4 | pass | 15 | 17 |
(a) GLM-4.7-Flash thinking average is pulled down by 15.6 t/s on the Logic prompt, where a 65K-token context filled with reasoning tokens slowed Metal attention considerably. Coding and Analysis ran at 28–29 t/s.
Logic: pass = correct verified 11-step solution / partial = model found the correct solution mid-response but discarded it and delivered an incorrect final answer / fail = invalid, incomplete, or looping response.
Per-Model Results
GPT-OSS 20B | MXFP4 | 34.4 t/s
GPT-OSS is the second fastest model in the benchmark at 34.4 t/s, beaten only by Qwen3-Coder. The 43% overshoot on the speed estimate (predicted 24 t/s) reflects how well MXFP4 quantization translates to Metal performance.
Logic: fail. 14,558 tokens and 436 seconds of circular self-correction. The model identified the puzzle structure correctly, then repeatedly proposed a sequence, spotted an error, discarded it, and started over. It never converged. This is not a close miss. It is a fundamental inability to maintain consistent state across a multi-step constraint problem.
Coding: 19/20. The best coding output in the benchmark. It used from __future__ import annotations, math.isqrt, slice assignment for marking composites, ValueError for negative input, space complexity in the docstring, and included worked examples. The only points missing were minor stylistic issues.
Analysis: 17/20. Correct recommendation, good structure, clear reasoning on team size. Slightly less depth than the top scorers on migration planning and module boundary guidance.
The verdict: use GPT-OSS for coding tasks where the output gets executed and tested. Do not use it for anything requiring multi-step reasoning that cannot be mechanically verified.
Qwen3-Coder-30B-A3B-Instruct | UD-Q4_K_XL | 39.5 t/s
The fastest model tested. Thinking mode does not exist for this variant; the Instruct model is non-thinking by design.
Logic: partial. This needs explaining carefully, because the partial rating is arguably more concerning than an outright failure. The model produced the correct 11-step solution in the first half of its response, then explicitly reconsidered, labeled that solution as potentially wrong, and presented an invalid 7-step alternative as the final answer. Anyone reading only the final answer gets an incorrect result. The correct reasoning appeared and was thrown away.
Coding: 16/20. Correct implementation, type hints, clear docstring, time complexity stated. No ValueError for negative input, no space complexity.
Analysis: 16/20. Correct recommendation. Reasonable reasoning. Less structured than the top scorers.
At 39.5 t/s this model is well-suited for high-volume coding assistance where output is compiled or executed. The Logic behavior is a real risk for any task where the model's final answer is taken at face value without verification.
Nemotron-3-Nano-30B-A3B | UD-Q4_K_XL | 28.4–28.5 t/s
NVIDIA's hybrid Mamba-2/Transformer MoE (52 total layers: 23 Mamba-2, 23 MoE, 6 grouped-query attention; architecture described in the Nemotron 3 Nano technical report). Speed is consistent across both thinking modes because the architecture is bandwidth-bound like everything else here.
Logic: fail (both modes). Without thinking, Step 1 immediately violates the constraint: two missionaries cross first, leaving the left bank at 1M, 3C. With thinking, the model generated 12,192 tokens of circular reasoning and still failed. Thinking made the failure longer, not better.
Coding: 16/20 (no thinking), 18/20 (thinking). Thinking improved the coding output in a measurable way. The thinking run added math.isqrt, slice assignment, and a more complete docstring. Second-best coding score overall, tied with Qwen3.5-35B-A3B thinking.
Analysis: 18/20 (no thinking), 19/20 (thinking). One of the strongest analysis outputs in the benchmark. Modular monolith specifically named, structured comparison, solid migration triggers. Thinking improved the depth here too.
This model is well-suited for coding and analysis work. For those two tasks, thinking mode makes a measurable difference. For logic and constraint reasoning, no amount of thinking helps.
GLM-4.7-Flash | UD-Q4_K_XL | 29.5 t/s (no thinking), 24.1 t/s avg (thinking)
Z.ai's 30B MoE with approximately 3B active parameters. Thinking is on by default and must be explicitly disabled. Requires repeat-penalty set to 1.0 to avoid generation loops.
Logic: fail (both modes). Without thinking, the right bank shows "1M, 2C" after only cannibals have crossed, which is impossible. The model then explicitly acknowledges mid-response that the constraint is violated and proceeds anyway. Nine steps total, wrong from step one.
With thinking, GLM generated 60,164 characters of internal reasoning, consumed 24,256 tokens, and ran for over 25 minutes before producing an answer. The answer was still wrong. Step 2 claimed the left bank returned to 3M, 3C after only one cannibal came back (mathematically impossible), and Step 5 left the right bank at 1M, 3C (a clear constraint violation). This is the most expensive Logic failure in the benchmark by a significant margin. More compute produced a more elaborate failure, not a correct solution.
The speed drop to 15.6 t/s on the Logic thinking run is because the 65K context window filled with reasoning tokens, which slows Metal attention computations for long sequences. On shorter prompts like Coding and Analysis, GLM thinking ran at 28–29 t/s.
Coding: 15/20 (no thinking), 17/20 (thinking). The no-thinking run uses modern list[int] syntax, slice assignment, and marks from p², but misses ValueError, space complexity, and examples. Thinking improved the docstring and added math.isqrt.
Analysis: 16/20 (no thinking), 17/20 (thinking). Correct monolith recommendation with a comparison table and reasonable reasoning. Neither run used the term "modular monolith" specifically. The thinking run added the Strangler Fig migration pattern, which is a useful concrete reference.
Fast and capable for coding and analysis. Completely unreliable for constraint reasoning regardless of how much thinking time it gets.
Qwen3.5-9B | UD-Q4_K_XL | 13.7 t/s
At 5.97 GB, this is the smallest model in the benchmark and the one with the most comfortable memory footprint. It leaves over 26 GB free for context, other applications, and system processes.
Logic: pass. Correct 11-step solution with consistent state tracking. There is some visible self-correction mid-response, but the final summary is clean and accurate.
Coding: 16/20. Type hints, comprehensive docstring, O(n log log n) stated, marks from p², edge case for n less than 2. Missing ValueError, space complexity, and examples.
Analysis: 17/20. Correct modular monolith recommendation with solid reasoning on team size and operational overhead. Lighter on migration trigger conditions than the top scorers.
The 9B Q4_K_XL is the value standout of this entire benchmark. It solves all three prompts correctly at 13.7 t/s on a file that takes up less than 6 GB of memory. For anyone running a 32GB machine who wants reliable general-purpose inference without thinking mode complexity, this is the straightforward answer.
Qwen3.5-9B | UD-Q8_K_XL | 8.3 t/s
The same model at higher quantization precision. The file is 13 GB rather than 6 GB.
Logic: pass. Correct solution, concise and clean.
Coding: 13/20. Correct implementation and type hints, but the weakest coding output among all Qwen3.5 variants. No space complexity, no examples, no ValueError, no slice assignment. Higher quantization precision did not translate to higher output quality here, which is the notable finding.
Analysis: 16/20. Correct recommendation, less structured than the Q4_K_XL variant.
The Q8_K_XL variant is slower, uses twice the memory, and produces weaker outputs than the Q4_K_XL of the same model. There is no practical reason to prefer it on this hardware.
Qwen3.5-27B | UD-Q4_K_XL | 4.4 t/s (both modes)
Speed is identical between thinking and non-thinking because the model is memory-bandwidth bound. Thinking mode does not change how fast the weights move through the bus; it just generates more tokens. The Logic prompt produces roughly 8 times more tokens with thinking enabled, while Coding and Analysis produce about twice as many. The average across all three prompts is approximately 5 times more tokens in thinking mode.
Without thinking: Logic passes cleanly. Coding at 15/20 is correct but missing several quality markers. Analysis at 17/20 is good but lighter on implementation specifics.
With thinking: Logic still passes. Coding improves slightly to 16/20. Analysis improves to 19/20, one of the joint highest scores in the benchmark. The thinking run produced a comparison table, named the modular monolith pattern explicitly, discussed data consistency trade-offs, gave clear trigger conditions for migration, and included implementation guidance on module isolation. This is the kind of analysis you would trust from a senior architect.
At 4.4 t/s, patience is required. But the thinking-enabled 27B model is the highest-quality general-purpose configuration tested, and on tasks where the answer matters more than the wait time, it is the right choice.
Qwen3.5-35B-A3B | UD-Q4_K_XL | 17.8–17.9 t/s
This is the benchmark's most interesting result. It is a MoE model (35B total parameters, 3B active during inference) that ran at 17.8–17.9 t/s despite an estimated range of 8–13 t/s.
Without thinking: Logic failed. The model hit the 4,096-token output limit mid-reasoning, entering the same circular self-correction pattern seen in GPT-OSS and Nemotron. Even with the limit removed, the model could not converge on a valid solution.
Coding at 16/20 is solid: comprehensive docstring sections, type annotations, marks from i², edge case handled. Analysis at 18/20 is strong: modular monolith named, team size reasoning, cognitive load argument, execution guide with module boundary rules, migration triggers.
With thinking: Logic passed. This is the key finding. The same model that could not solve the Logic puzzle without thinking produced a correct, verified 11-step solution once thinking was enabled. The internal reasoning ran to 12,540 characters and clearly worked through the constraint checking that the model cannot sustain in direct answer mode.
Coding improved to 18/20 with thinking: added ValueError for negative input, a Raises section in the docstring, and cleaner structure overall. Analysis held at 19/20, tied for the top score.
Qwen3.5-35B-A3B with thinking is the only model in this benchmark that solves all three prompts correctly at speed above 15 t/s. It is also the clearest demonstration of what thinking mode actually does: it does not make models smarter in a general sense, but it gives a capable model the context space to self-verify reasoning that would otherwise collapse. Without thinking, this model fails Logic. With it, it passes.
Devstral-Small-2507 | UD-Q4_K_XL | 7.0 t/s
Fine-tuned from Mistral-Small-3.1. Designed specifically for agentic software engineering: navigating codebases, editing multiple files, using tools in multi-step loops. It achieves 53.6% on SWE-bench Verified in that agentic scaffold setting. No thinking capability.
These three prompts are not that setting. Devstral is built for structured agentic workflows with tool access, not isolated conversational Q&A. That context matters for interpreting its scores here.
One practical note: without an explicit token limit, the Logic prompt triggered a generation loop. The model produced 15,625 tokens over roughly 46 minutes before the run was stopped. A rerun with max_tokens set to 4,096 and explicit repeat penalties produced a complete response in normal time. This behavior is consistent with a model not trained for open-ended conversational reasoning.
Logic: fail. Step 3 leaves the left bank at 1M, 2C (unsafe). Step 5 leaves the right bank at 1M, 3C (unsafe). The invalid solution is presented confidently.
Coding: 15/20. Correct implementation, type hints, docstring, time complexity. No ValueError, space complexity, or slice assignment.
Analysis: 15/20. Correct monolith recommendation with adequate reasoning. Missing "modular monolith" terminology and migration triggers.
Do not use Devstral's scores here as a proxy for its actual capability. It should be evaluated in an OpenHands-style agentic scaffold for any meaningful assessment.
Codestral-22B-v0.1 | Q4_K_M | 7.5 t/s
Dense 22B Mistral model from May 2024. Predates the current generation of reasoning-aware architectures. Optimized for Fill-in-the-Middle (FIM) code completion. No thinking capability.
Logic: fail. Step 2 sends two missionaries, leaving the west bank at 1M, 3C. Wrong from the second move. Claims solved in 8 steps.
Coding: 11/20. The lowest coding score in the benchmark, which is notable for a model marketed as a coding specialist. The implementation contains a correctness bug: range(2, n) excludes n itself, so if n is prime it will not be returned. No edge case for n less than 2, no space complexity, no examples, no ValueError.
Analysis: 5/20. Recommends microservices for a three-person startup. This is the wrong answer by the consensus of software engineering practice and the assessment of every other model in this benchmark. The reasoning given (scalability, fault isolation, team autonomy) accurately describes microservices benefits in the abstract while completely ignoring the specific constraints of a three-person early-stage team. Someone following this recommendation would spend months fighting infrastructure complexity before building a single feature users care about.
Codestral belongs in an IDE with a compiler and test runner attached. That is where its FIM strengths show. For conversational Q&A or architectural guidance, it should not be used.
Cross-Model Findings
Who Solved Logic
Five model variants passed the Logic prompt out of 14 tested:
Qwen3.5-9B (UD-Q4_K_XL)
Qwen3.5-9B (UD-Q8_K_XL)
Qwen3.5-27B (both thinking and non-thinking)
Qwen3.5-35B-A3B (thinking only)
One model (Qwen3-Coder-30B-A3B-Instruct) found the correct solution mid-response, then discarded it. The remaining eight variants failed entirely.
The pattern is not MoE versus dense. Qwen3.5-35B-A3B is a MoE model and it passed with thinking enabled. The pattern is family-specific: every Qwen3.5 variant passes when thinking is available, while GPT-OSS, GLM, Nemotron, Devstral, and Codestral all fail regardless of architecture, parameter count, or thinking mode. This suggests Qwen3.5's post-training included constraint satisfaction tasks in a way that the other families' training did not.
Thinking mode helped exactly one model on Logic (Qwen3.5-35B-A3B, converting a failure into a pass). For all other failing models, thinking either did nothing or produced a longer failure.
Coding Scores
GPT-OSS 20B scored highest at 19/20. It was the only model to combine all of: from __future__ import annotations, ValueError for negative input, slice assignment, space complexity, a complete docstring, and code examples. Nemotron thinking and Qwen3.5-35B-A3B thinking tied at 18/20. Codestral 22B scored 11/20 and contained a correctness bug.
The Qwen3.5-9B Q8_K_XL scored lower on coding than its Q4_K_XL counterpart, confirming that higher quantization precision does not reliably produce higher output quality.
Analysis Scores
Qwen3.5-27B thinking, Qwen3.5-35B-A3B thinking, and Nemotron thinking all scored 19/20. All three named the modular monolith pattern specifically, included structured comparisons, and gave concrete migration trigger conditions. Codestral scored 5/20 for recommending microservices. Every other model gave the correct recommendation.
What Thinking Mode Actually Does
Scenario | Result |
Qwen3.5-35B-A3B, Logic | Thinking turned a fail into a pass |
Qwen3.5-27B, Analysis | Score improved from 17 to 19 |
Nemotron, Coding | Score improved from 16 to 18 |
Nemotron, Logic | Thinking produced a longer, equally wrong failure |
GLM, Logic | 60K chars of thinking, 25 minutes, still wrong |
Qwen3.5-27B, Logic | Already passing without thinking; no change |
Speed | Zero effect on tokens per second |
Thinking mode is not a general quality upgrade. It helps models that have the underlying reasoning capability but need extended context to exercise it reliably. For models where the capability is absent, more thinking tokens produce more cost, nothing more.
MoE Speed on Apple Metal
Every MoE model ran faster than the 0.27-factor conservative estimate predicted. The actual effective MoE factors, back-calculated from measured results, ranged from roughly 0.11 to 0.20. MoE models ran 1.5 to 2.5 times faster than the conservative estimates suggested. Apple Metal handles sparse activation patterns efficiently, and the unified memory architecture avoids the cross-chip communication overhead that slows MoE models on multi-GPU CUDA setups. The implication is straightforward: on Apple Silicon, an MoE model is not a compromise between capability and cost; it is the better choice on both axes.
Memory
The 9B Q4_K_XL model left roughly 4.6 GB of free memory at its minimum during inference. GPT-OSS left about 4.3 GB. Every other model consumed all but 69 to 81 MB of the available 32 GB, with no swap triggered in any run. The M4's unified memory handled the pressure without spilling, but there is essentially no headroom for context growth or parallel processes when running models above approximately 14 GB on this hardware configuration.
Recommendations
Use case | Model | Notes |
General purpose, best speed-quality balance | Qwen3.5-35B-A3B thinking | Only model passing all 3 at speed above 15 t/s |
General purpose, smallest footprint | Qwen3.5-9B UD-Q4_K_XL | 13.7 t/s, passes all 3, under 6 GB |
Highest quality, latency acceptable | Qwen3.5-27B thinking | 4.4 t/s, joint-highest analysis score |
High-volume coding with verification | Qwen3-Coder-30B-A3B-Instruct | 39.5 t/s; verify all outputs |
Conclusions
For Similar Hardware
The single most useful thing this benchmark shows is that you do not need the biggest, slowest model to get reliable results. Qwen3.5-9B UD-Q4_K_XL solves all three prompts correctly, runs at 13.7 t/s, fits in under 6 GB, and leaves the rest of the machine's memory available. For routine coding, analysis, and reasoning tasks, it is the sensible starting point.
If quality matters more than speed, Qwen3.5-27B thinking at 4.4 t/s delivers the best analysis results in the benchmark. The wait is real but the output justifies it for tasks where the answer will be acted on.
For tasks that mix coding speed with general reasoning, Qwen3.5-35B-A3B with thinking enabled is the strongest overall configuration, and the only model that passes all three prompts at above 15 t/s. It uses close to all available memory on a 32GB machine, but it does not swap, and it delivers.
A few things worth keeping in mind. Speed and quality are not correlated here. GPT-OSS is the second-fastest model and the second-worst at reasoning. Codestral is among the slower models and produced a coding bug. Quantization precision does not predict output quality either: the Q8 variant of the 9B model is slower and worse than the Q4 variant. And thinking mode is a targeted tool, not a universal upgrade. Enable it when the task needs extended reasoning; do not expect it to rescue a model that fundamentally cannot track state.
Open-Weight Models in 2025–2026
Three years ago, running a 27B parameter model locally was a specialist exercise. Today it is a weekend project. The models tested here represent a real capability shift, and the Qwen3.5 family in particular shows that reasoning quality at the 27B–35B scale is no longer exclusive to proprietary cloud APIs.
The benchmark results confirm something worth saying plainly: these models are not toys. Qwen3.5-27B thinking, at 4.4 t/s on a compact desktop, produced analysis output that would hold up in a senior engineering review. Qwen3.5-35B-A3B thinking solved the Logic puzzle correctly at 17.9 t/s on 22 GB of unified memory, a 35-billion parameter model running faster than a comfortable reading pace, on hardware that fits on a desk.
The failures are equally informative. Constraint reasoning at the 11-step level defeated nearly every model not in the Qwen3.5 family, regardless of parameter count or architecture. This points to post-training mattering at least as much as scale, and it is a reason to be skeptical of leaderboard numbers until you have tested a model on tasks that actually resemble your workload.
MoE architectures have arrived as the clear practical choice for local inference. Sparse activation means you get the knowledge of a large model at the inference cost of a smaller one. On Apple Silicon, the absence of cross-chip communication overhead makes this even more pronounced: every MoE model tested ran well above the conservative estimate. A 22GB file that runs at 28 t/s is a more useful tool than a 22GB dense model that runs at 4.5 t/s. The Qwen3.5-35B-A3B result, combining MoE efficiency with Qwen3.5 reasoning quality at nearly 18 t/s, is the headline finding.
The gap between open-weight and proprietary models is narrowing, though it has not closed. For structured professional tasks (code generation, technical analysis, constraint planning), the gap is now small enough to matter only for the most demanding use cases. For open-ended reasoning, creative synthesis, and tasks that require genuine understanding of context across very long documents, the proprietary frontier models still lead. The practical question is not "which is better in the abstract" but "which is good enough for the specific task, at what cost, with what privacy tradeoffs."
The Enterprise Case for Local Inference
The economic argument for self-hosted inference is real but not complicated. A Mac Mini M4 with 32GB and 512GB SSD costs around €1,400 in Germany at Apple's current pricing. It runs Qwen3.5-35B-A3B at 17.9 t/s indefinitely, with no per-token charges. At typical commercial API pricing, that hardware cost is recovered within weeks to months of moderate use, and every subsequent query is free. For high-volume internal workflows, the economics are not close.
But the cost argument is actually the secondary one. The primary argument is about data.
Every query sent to a cloud LLM API travels over a network, lands on infrastructure you do not control, and is processed by a system whose data retention policies are subject to change. For a law firm drafting contracts, a healthcare company running patient data through an AI pipeline, a financial institution asking questions about proprietary trading logic, or a consultancy working with client strategies that cannot leave the building, this is not an acceptable arrangement. It is not primarily a legal risk calculation, though that is real. It is a question of professional responsibility.
Local inference removes that problem at the root. The data does not leave the machine.
This opens a practical workflow that is worth thinking through carefully. The current state of open-weight models suggests a clean split between cloud and local inference, not as competitors but as tools for different parts of the same process.
Cloud models, particularly the frontier reasoning models from Anthropic, OpenAI, and Google, are stronger at high-level planning, architectural thinking, and complex cross-domain reasoning. They have seen more training data, more parameters, and more aggressive post-training for instruction following across diverse domains. For the planning phase of a project, where the inputs are abstract specifications and the output is a detailed implementation roadmap, sending those queries to a cloud model is a reasonable trade-off. The data at that stage is still at the level of requirements and architecture diagrams, not proprietary code or production records.
Once that plan exists, the implementation work is different. Writing code against an internal codebase, debugging against production logs that contain customer data, refactoring modules that encode years of business logic, running monitoring queries against databases with personally identifiable information: none of that should go to a cloud API. And none of it needs to. A locally-run model with the plan already established as context can execute on that plan, referencing actual code and actual data, without any of it leaving the network perimeter.
This is not a speculative arrangement. It is achievable today with the hardware and models described in this benchmark. The split is clean, the cost is manageable, and the privacy boundary is clear.
Teams that build this way gain something beyond cost savings and privacy. They gain operational independence: no rate limits during critical deadlines, no exposure to pricing changes, the ability to audit every inference and enforce their own output policies. These are compounding advantages, not one-time benefits.
The capability is there. The hardware is affordable. The models are open. What remains is deciding to set it up.
Ready to move your AI workflows from the cloud to your own perimeter? Explore how iterise helps organizations implement secure, local inference strategies.
Bonus: mcp-llama-swap: Hot-Swapping Local Models in Claude Code Without Losing Context
There is a practical friction point in the two-model workflow described above that is worth addressing directly. Planning with one model and implementing with another means stopping one model, starting another, re-establishing context, and losing conversational continuity in the process. On a single machine with 32GB of unified memory, you cannot keep two large models loaded simultaneously. The swap is unavoidable, but the disruption does not have to be.
To make this transition seamless, I built and published an MCP (Model Context Protocol) server called mcp-llama-swap that gives Claude Code direct control over which model is loaded behind llama-server. It manages the model swap through macOS launchctl, polls the health endpoint until the new model is ready, and hands control back to the coding session with the full conversation history intact. You define your models as aliases in a simple configuration file, and from that point on, switching from a reasoning model to a coding model is a single command inside the same session. No terminal switching, no manual process management, no lost context.
The project is intentionally small and does one thing: swap the model behind your inference server while preserving the conversation you are in. It is available on PyPI and can be installed and running in minutes. The repository contains full setup instructions, a LiteLLM proxy configuration for translating between API formats, and example configurations for both directory-based model discovery and explicit alias mapping.
It is built for the single-machine, single-user setup that this benchmark describes. The pattern it implements, a clean separation between planning and execution backed by different models selected for different strengths, is the same pattern that scales to the organizational workflow outlined above. Someone with the motivation could extend it to route between a cloud API for the planning phase and a local model for the implementation phase, enforcing the privacy boundary programmatically rather than relying on discipline. That is not what it does today, but the architecture does not prevent it.
The gap between knowing that a two-model workflow is possible and actually running one day to day is usually just tooling. This is the tooling.
A note on context handling: the conversation history survives model swaps not because of anything the MCP server does, but because Claude Code holds the full message history on the client side and re-sends it with every request. The MCP server is only responsible for managing the model process. This means the swap is transparent to the conversation flow, but it also means that if for example two models -a planner and a coder- are configured with different context window sizes in their respective llama-server plists, a long planning session could produce a prompt that exceeds the coder model's configured limit. In practice, keeping the --ctx-size parameter consistent across your model configurations avoids the issue entirely.
References
Apple Inc. Mac mini Technical Specifications. https://www.apple.com/mac-mini/specs/ Official M4 base chip memory bandwidth: 120 GB/s.
Hübner P., Hu A., Peng I., Markidis S. "Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency." arXiv:2502.05317 (2025). https://arxiv.org/abs/2502.05317 Metal GPU memory bandwidth measurements on M-series chips.
Pressman I., Singmaster D. "The jealous husbands and the missionaries and cannibals." The Mathematical Gazette 73:464, pp. 73-81 (1989). Proves that the 3M/3C river-crossing puzzle requires a minimum of 11 crossings.
OpenAI. "gpt-oss-120b & gpt-oss-20b Model Card." arXiv:2508.10925 (2025). https://arxiv.org/abs/2508.10925 GPT-OSS architecture and MXFP4 quantization details.
NVIDIA. "Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning." Technical Report (2025). https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf Nemotron-3-Nano-30B-A3B architecture: 52 layers, 23 Mamba-2, 23 MoE, 6 GQA attention.
Mistral AI. "Devstral Small 1.1 Model Card." HuggingFace (2025). https://huggingface.co/mistralai/Devstral-Small-2507 53.6% SWE-bench Verified, OpenHands scaffold.
Unsloth AI. "Unsloth Dynamic 2.0 Quantization." https://unsloth.ai UD-Q4_K_XL and UD-Q8_K_XL methodology.





