⚡ Qwen3.7 Max — Quick Verdict
Released: May 20, 2026 · Alibaba Cloud Summit, Hangzhou
Intelligence Index: 56.6 — #5 at launch, highest Chinese AI model ever recorded
Best for: Long-horizon agentic coding, mathematics, multilingual tasks, sustained autonomous execution
Price: $2.50/M input · $7.50/M output · $0.25/M cached input (90% discount)
Context: 1 million tokens · Max output: 65,536 tokens per request
Open weights: No — API only via Alibaba Cloud DashScope and OpenRouter
Caveat: Highly verbose — generates 97M tokens per eval vs median 24M — cost implications at scale
Alibaba launched Qwen3.7 Max at the Alibaba Cloud Summit in Hangzhou on May 20, 2026. The model scored 56.6 on the Artificial Analysis Intelligence Index at launch — placing it #5 globally and making it the highest-ranked Chinese AI model ever recorded on that leaderboard. It ran a 35-hour autonomous coding session firing 1,158 tool calls. It leads every competitor on HMMT 2026 competition mathematics. And it costs $2.50 per million input tokens with a 1M token context window. Here is the full review.
What Is Qwen3.7 Max?
Qwen3.7 Max is the flagship model in Alibaba's Qwen3.7 series, released alongside Qwen3.7-Plus-Preview — a multimodal variant with vision input at a lower price point. Max is the heavy hitter: purpose-built for reasoning, agentic coding, long-horizon autonomous execution, and document-scale context. Unlike several earlier Qwen releases, Qwen3.7 Max is not open-weight — it runs only through Alibaba's hosted API via DashScope, Alibaba Cloud Model Studio, and OpenRouter.
The model represents a significant step up from Qwen3.6-Plus. On Terminal Bench 2.0-Terminus, Qwen3.7-Max scores 69.7 against Qwen3.6-Plus's 61.6. On the YC-Bench startup simulation, it achieved 2.08 million USD in total revenue — double Qwen3.6-Plus's 1.05 million USD.
Qwen3.7 Max Benchmarks — Where It Leads and Where It Trails
| Benchmark |
Qwen3.7 Max |
Claude Opus 4.6 Max |
DeepSeek V4 Pro Max |
| AA Intelligence Index |
56.6 |
N/A |
N/A |
| HMMT 2026 (math) |
97.1% |
96.2% |
95.2% |
| HLE (Humanity's Last Exam) |
41.4% |
40.0% |
37.7% |
| GPQA Diamond |
92.3–92.4% |
N/A |
N/A |
| IMOAnswerBench |
90.0% |
N/A |
89.8% |
| Apex benchmark |
44.5 |
34.5 |
38.3 |
| PolyMATH |
86.5% |
80.2% |
N/A |
| WMT24++ (multilingual) |
85.8% |
82.7% |
N/A |
| Terminal Bench 2.0 |
69.7% |
N/A |
N/A |
Qwen3.7 Max leads every competitor on HMMT 2026 mathematics, HLE, Apex, PolyMATH, and multilingual translation. It scores 92.3% on GPQA Diamond — graduate-level scientific reasoning — ahead of where Claude Opus 4.6 and Gemini 3.1 Pro landed at the same benchmark stage. The overall Intelligence Index of 56.6 places it between Gemini 3.5 Flash and Claude Opus 4.7 on Artificial Analysis's composite leaderboard.
The 35-Hour Autonomous Coding Run
The most striking demonstration in the Qwen3.7 Max launch is the autonomous kernel optimization run. Alibaba's internal testing reports a 35-hour autonomous coding run that fired 1,158 tool calls and hit a 10x speedup over the standard Triton reference. The task was GPU kernel optimization — the kind of sustained, multi-step engineering work that requires holding context across hundreds of iterations, debugging failures autonomously, and making architectural decisions without human intervention.
35 hours of continuous autonomous operation is longer than any documented agentic coding run from Claude Code, GPT-5.5 Codex, or Kimi Code CLI. It demonstrates that Qwen3.7 Max's 1M token context window is not a marketing figure — it is actually being used to maintain coherent state across a task that spans a working day and a half.
The 10x speedup on Triton reference is the practical outcome: not a benchmark score but a real-world engineering result. GPU kernel optimization that takes a skilled human engineer weeks was completed autonomously in 35 hours.
Pricing — The Important Caveat
Qwen3.7 Max costs $2.50 per million input tokens and $7.50 per million output tokens, based on the median across providers. Cached input drops to $0.25 per million tokens — a 90% discount for repeated long-context calls that makes sustained agentic sessions with shared context significantly cheaper.
Here is where it gets complicated. The model is notably verbose. Artificial Analysis observed approximately 97 million tokens generated during their evaluation, far above the median of 24 million. That is roughly 4x more output tokens than comparable frontier models generate for the same evaluation tasks.
The practical implication: at $7.50/M output tokens and 4x verbosity, a task that costs $7.50 on another frontier model may cost $30 on Qwen3.7 Max if you do not actively manage output length. For long agentic sessions, the 90% cached input discount partially offsets this — but output token cost is the variable that needs controlling.
| Model |
Input $/M |
Output $/M |
Context |
| Qwen3.7 Max |
$2.50 |
$7.50 |
1M tokens |
| Claude Opus 4.7 |
$5.00 |
$25.00 |
200K tokens |
| GPT-5.5 |
$5.00 |
$30.00 |
1M tokens |
| Gemini 3.5 Flash |
~$0.30 |
~$1.20 |
1M tokens |
| Kimi K2.6 |
$0.95 |
$4.00 |
256K tokens |
Where Qwen3.7 Max Wins
Mathematics at the frontier: 97.1% on HMMT 2026 is the highest score recorded on that benchmark. Combined with 90.0% on IMOAnswerBench and strong PolyMATH performance, Qwen3.7 Max is now the strongest model for competition-level mathematics outside of specialized reasoning models.
Long-horizon autonomous execution: The 35-hour, 1,158-tool-call coding run is the most compelling evidence of sustained autonomous capability released alongside any model in 2026. For tasks that require multi-day autonomous execution, Qwen3.7 Max's 1M context window and demonstrated endurance make it the clearest choice.
Multilingual performance: 85.8% on WMT24++ across 55 languages leads all measured competitors. For teams working across Asian and European languages at the same time, this is a meaningful advantage over models tuned primarily for English.
Input cost vs context window: $2.50/M input with a 1M token context window compares favorably to GPT-5.5 ($5.00/M, 1M context) for long-document workloads. The input cost advantage holds even before the 90% cached input discount applies.
Where Qwen3.7 Max Falls Short
Output verbosity cost: Generating 97M tokens per evaluation vs a 24M median means real-world output costs run 2–4x higher than comparable tasks on other models unless you explicitly constrain response length in your prompts. The $7.50/M output price becomes expensive fast if the model is generating unnecessary elaboration.
Not open-weight: Unlike Kimi K2.6 (Modified MIT) or DeepSeek V4 (partially open), Qwen3.7 Max is API-only. Teams that need to self-host, fine-tune on proprietary data, or run inference behind their own firewall cannot use Qwen3.7 Max.
No multimodal in Max tier: Qwen3.7-Plus-Preview adds vision input — Qwen3.7 Max does not. For teams that need image or document analysis alongside reasoning, the Plus variant or a different model is required.
No enterprise compliance documentation yet: As a new model from Alibaba Cloud, SOC 2, HIPAA, and similar compliance certifications are not confirmed available for Qwen3.7 Max at launch. For regulated industries, this is a blocker until documentation is published.
Who Should Use Qwen3.7 Max
Use Qwen3.7 Max if your primary workloads are mathematics-heavy (research, financial modeling, scientific computing), you need sustained autonomous coding execution measured in hours rather than minutes, you work across multiple languages and need frontier-level multilingual performance, or you want GPT-5.5-level context (1M tokens) at half the input cost.
Manage the verbosity actively: add explicit output length constraints in your system prompt ("Respond concisely. Do not elaborate beyond what is required.") to avoid the 4x output token inflation that Artificial Analysis observed in unguided evaluation.
Do not use Qwen3.7 Max if you need open weights, multimodal (vision) input in the same model, enterprise compliance certifications, or the cheapest possible token cost — Gemini 3.5 Flash and Kimi K2.6 are significantly cheaper for high-volume workloads that do not require Qwen3.7 Max's specific strengths.
FAQ
When was Qwen3.7 Max released?
Qwen3.7 Max launched on May 20, 2026 at the Alibaba Cloud Summit in Hangzhou. It is available via Alibaba Cloud DashScope, Alibaba Cloud Model Studio, and OpenRouter.
What is Qwen3.7 Max's context window?
Qwen3.7 Max supports a 1 million token context window with a maximum output of 65,536 tokens per request. The 1M context window is shared with GPT-5.5 and Gemini 3.5 Flash — significantly larger than Claude Opus 4.7's 200K tokens.
Is Qwen3.7 Max open source?
No. Unlike several earlier Qwen releases, Qwen3.7 Max is not open-weight. It runs only through Alibaba's hosted API. An open-weight mid-tier Qwen3.7 variant may follow, consistent with Alibaba's pattern of releasing hosted flagship models before open-weight equivalents.
How does Qwen3.7 Max compare to Claude Opus 4.7?
Qwen3.7 Max leads on mathematics (HMMT 2026, IMOAnswerBench, PolyMATH), costs 50% less per input token ($2.50 vs $5.00/M), and offers a larger context window (1M vs 200K tokens). Claude Opus 4.7 leads on SWE-Bench Pro coding benchmarks (64.3% vs Qwen3.7 Max's Terminal Bench score), enterprise reliability, and has a mature ecosystem via Claude Code. The right choice depends on your workload — math and multilingual favor Qwen3.7 Max, production coding reliability favors Claude Opus 4.7.
What is the Qwen3.7 Max API price?
The API costs $2.50 per million input tokens and $7.50 per million output tokens, with a maximum output of 65,536 tokens per request. Cached input drops to $0.25 per million tokens — a 90% discount that makes repeated long-context calls significantly cheaper. Note the verbosity caveat: Qwen3.7 Max generates roughly 4x more output tokens than comparable models in unguided tasks, which inflates real-world output costs unless prompt-level length constraints are applied.
What is the Qwen3.7 Max Intelligence Index score?
Qwen3.7 Max scores 57 on the Artificial Analysis Intelligence Index — well above average among comparable reasoning models, placing it in the global top 10 of 151 measured models at launch and making it the highest-ranked Chinese AI model on that leaderboard to date.