Qwen introduces Qwen3.7-Max: an inference agent model with a 1M-Token context window


Most AI models today are not designed for continuous, multi-step self-execution. Tasks such as running hundreds of iterative code modifications, or cascading tool calls across hours without human intervention, require a different type of model architecture and training focus.

Alibaba’s Qwen team officially announced Qwen3.7-Max at the 2026 Alibaba Cloud Summit on May 20. Although two preview builds of the Qwen3.7 series have quietly appeared on the Arena AI leaderboard with no press release and no official API announcement.

Two preview models were released simultaneously

Alibaba previewed two models simultaneously: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They are ranked 13th globally in text capabilities and 16th in vision capabilities, respectively, according to LM Arena.

In the Text Arena, Qwen3.7-Max-Preview ranked No. 13 overall, making Alibaba the No. 6 text lab. In the Vision Arena, Qwen3.7-Plus-Preview ranked No. 16 overall, putting Alibaba at No. 5 in the Vision Arena. Model rank and laboratory rank are two separate numbers.

Qwen3.7-Plus-Preview is described as a preview of a balanced, high-performance release, focusing on logical reasoning and expression, with its toolchain gradually opening up in the future. It handles vision and multimodal input. Qwen3.7-Max is the leader in text-only reasoning. This article discusses Qwen3.7-Max, as it is the model officially announced by Alibaba with API access.

What is the design of Qwen3.7-Max?

The Alibaba Qwen team describes Qwen3.7-Max as the most advanced and comprehensive proxy model to date. The model is special and closed weight. It’s capable of handling coding and debugging, office workflow automation, and long-running tasks spanning hundreds or even thousands of steps.

Expanded thinking mode

Qwen3.7-Max is a logic model. The model first generates a series of ideas – an internal sequence of steps where it plans, checks its work, and course-corrects before committing to a final answer. On interfaces like Qwen Chat, this appears as a “thinking” mode that you can turn on to see the model’s logic trace.

Inference models produce a much larger number of output symbols than standard completions. When the artificial analysis performed its IQ evaluation, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for models on this benchmark. For short or simple tasks, this overhead adds latency without improving output quality. For multi-step mapping, code refactoring, or long agent chains, the extended reasoning mode is where the power of the model applies.

Context window

The model features a 1M token context window, compared to 256KB in Qwen3.6 Max Preview. It supports text input and output only. Prices have not been announced yet. The price of Qwen3.6 Max Preview is set at $1.30/$7.80 per million I/O tokens on Alibaba Cloud.

A context window of 1 million tokens can contain an entire medium-sized code repository or a large set of documents in a single request. Models are often less reliable when the context window is full. Long-context independent testing of Qwen3.7-Max is not yet available.

Standard results

The Qwen3.7-Max scored 56.6 on the AI ​​Index, ranking fifth overall. This represents an increase of 4.8 points over its predecessor, the Qwen3.6 Max Preview (51.8), and puts it ahead of the Google Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the overall rankings.

IQ Index Version 4.0 collects ten ratings, including GDPVval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.

https://qwen.ai/blog?id=qwen3.7

Optimization on Qwen3.6 Max Preview is not uniform. Most of the index’s gains are concentrated in scientific thinking, effective ability, and coding. CritPt rose 9.7 percentage points (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 points (from 28.9% to 38.1%), and Terminal-Bench Hard jumped 6.9 points (from 43.9% to 50.8%). GDP AA added 42 ELO points (from 1504 to 1546). Scores in other benchmarks are largely flat compared to the Qwen3.6 Max Preview.

One of the results in the indicator requires careful reading. In AA-Omniscience, the Qwen3.7-Max’s raw accuracy decreased by 7.6 percentage points (from 37.7% to 30.1%), while the hallucination rate decreased by 21.3 points (from 44.2% to 22.9%). The model chooses to say “I don’t know” more often rather than remember more facts. Its attempt rate dropped from 67.3% to 48.0%, the lowest among the parametric models in the comparison. The AA-Omniscience standard rewards correct answers and punishes hallucinations but there is no penalty for refusing an answer. For use cases that rely on recalling large-scale facts, this is a meaningful constraint to test against your workload.

In Text Arena, Qwen3.7-Max-Preview ranked 13 overall with an Elo score of 1,475. Category rankings include No. 7 in Mathematics, No. 9 in Expert Claims, No. 9 in Software and Information Technology, and No. 10 in Programming.

All indices are preliminary. The model carries a “preview” status, indicating that Alibaba considers it an early build.

Agent performance – internal testing

In an internal Alibaba test on a new chip platform, the model independently made more than 1,000 tool calls and repeated code modifications to optimize a key kernel. Alibaba claimed that the process improved inference speed by about 10-fold compared to the previous version.

Visual explanation of Marktechpost






Slide 1 of 6
What is Qwen3.7-Max?
Alibaba’s proprietary reasoning model, designed for long-running agent tasks, code generation, and multi-step automation.

Context window
1 million icons – enough to fit an entire medium-sized code repository in a single request.
Inference model
Uses a train of thought (extended thinking mode) before providing a final answer.
I/O
Text in text, text out. There is no support for entering images in this form.
API string
is used qwen3.7-max When connected via Alibaba Cloud Model Studio.

Apache compatible API
OpenAI and human specifications
Preview – There are no open weights yet

Slide 2 of 6
Quick start: Chat interface
The fastest way to test Qwen3.7-Max without requiring an API key or setup.

  • 1

    Go to Quinn’s chat
    Go to chat. qwen.ai Create a free account.
  • 2

    Select the form
    In the form selector drop-down list, choose Qwen3.7-max. It may appear as Qwen3.7-Max-Preview During the inspection period.
  • 3

    Enable reflection mode
    employment Think mode In the chat interface. This activates the chain of thought logic and shows the model’s internal heuristics before the final answer.
  • 4

    Submit your claim
    Type your query. For best results on complex tasks, be specific about the steps, constraints, and expected output format.
💡
Use the most difficult, realistic prompts when testing. Multi-step mathematical problems, complex reconstruction requests, and ambiguous expert questions reveal more about the quality of the model than simple claims.

Slide 3 of 6
Access to the API
Qwen3.7-Max is compatible with OpenAI and Anthropic API specifications. You can connect it to existing pipelines with minimal changes.

OpenAI compatible Python call

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain chain-of-thought reasoning."}
    ]
)

print(response.choices[0].message.content)

ℹ️
Get your API key from Alibaba Cloud Model Studio (Dash Scoop). The base URL for international access is Dashscope-intl.aliyuncs.com.
⚠️
Pricing has not yet been announced for the Qwen3.7-Max. For reference, the price of Qwen3.6 Max Preview is set at $1.30 / $7.80 per million I/O tokens.

Slide 4 of 6
Understand the mode of thinking
The reasoning mode is the sequential reasoning layer of the model. It specifies how the model deals with the problem before generating the response.

When do you use it?
Multi-step code refactoring, complex mathematical proofs, long agent task chains, and ambiguous problems that require step-by-step planning.
When do you skip it?
Short rewrites, simple compilations, quick searches, or tasks that need to keep latency and token cost to a minimum.


API: Enable reasoning via extra_body

response = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[{"role":"user","content":"Your prompt here"}],
    extra_body={"enable_thinking": True}
)

💡
Qwen3.7-Max generated approximately 97 million tokens according to synthetic analysis benchmarks, versus an average of 24 million for similar models. Each reflection code adds to latency and cost – use reflection mode selectively.

Slide 5 of 6
Proxy and long-horizon tasks
Qwen3.7-Max is designed to run long, independent task loops. In Alibaba’s internal testing, it executed more than 1,000 calls to the tool and sustained standalone execution for up to 35 hours.

  • 1

    Clearly identify tools
    Passing tool definitions in the OpenAI standard tools Parameter. The model supports function calling and iterative tool calling locally.
  • 2

    Use the 1M context window intentionally
    Pass the complete task history, previous tool output, and code status to the context. Aggressively truncate when full context is not needed – each token is accounted for.
  • 3

    Aim for the final answer in affirmations
    The output of inference is longer and more varied than standard completion. When writing tests, focus on the final answer, not the exact wording of the thought trail.
  • 4

    Good use cases
    Kernel optimization, code debugging loops, office workflow automation, and multi-step data pipelines with iterative validation.
⚠️
The tool’s 35-hour and over 1,000 recall numbers come from Alibaba’s internal tests alone. There is no independent verification of these specific claims.

Slide 6 of 6
Known limitations
Understanding these limitations before merging will save debugging time and help you set the right expectations.

No image input
Qwen3.7-Max is text only. For multimedia tasks, use Qwen3.7-Plus-Preview instead, which supports vision input.
Aa- Abstaining from voting
In the AA-Omniscience benchmark, the model attempt rate decreased from 67.3% to 48.0%. He abstains more and hallucinates less, but his recall of raw reality is also reduced. Test carefully for knowledge recall tasks.
Preview status
The form currently has the -preview suffix. Benchmark scores, behavior and pricing could change prior to the stable release. No open weight version available as of May 2026.
Long context reliability
The 1M token context window is a cap and not a guarantee. Long-context independent testing of Qwen3.7-Max is not yet available. Validate retrieval quality on your specific workload.
ℹ️
For the latest model updates, check out Qwen’s official blog at qwen.ai/blog And Alibaba Cloud Model Studio documentation.

Key takeaways:

  • Alibaba has released two Qwen3.7 preview models: Max (Text/Inference) and Plus (Multimedia).
  • The Qwen3.7-Max scored 56.6 on the AI ​​Index, ranking fifth overall – up 4.8 points on the Qwen3.6 Max Preview.
  • The 1 million token context window doubles the 256KB limit of Qwen3.6 Max Preview; Text only, no image input.
  • In AA-Omniscience, initial accuracy decreased while abstention increased, which is worth testing in knowledge recall use cases.
  • The model lasted over 1,000 tool calls and 35 hours of standalone execution in Alibaba internal testing alone; There is no independent verification yet.

verify Technical details. and Documents. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply