Meet AntAngelMed: an open source medical language model with a 103B parameter built on the MOE architecture with a 1/32 activation ratio


A team of researchers from China has released AntAngelMed, a large, open-source medical language model that the team describes as the largest and most capable of its kind currently available.

What is AntAngelMed?

AntAngelMed is a language model for the medical domain that has 103 billion total parameters, but it doesn’t activate all of those parameters during inference. Instead, it uses a mixture of experts (MoE) architecture with an activation ratio of 1/32, which means that only 6.1 billion parameters are active at any given time when processing a query.

It is useful to know how the Ministry of Education structures work. In the standard dense model, every parameter is involved in processing each token. In the MOE model, the network is divided into many “expert” subnetworks, and the routing mechanism selects only a small subset of them to handle each input. This allows you to have a very large total number of parameters – which is usually associated with strong knowledge power – while keeping the actual computing cost of inference proportional to the smaller number of active parameters.

AntAngelMed inherits this design from Ling-flash-2.0, a basic model developed by includeAI and guided by what the team calls Ling Scaling Laws. Specific improvements on top include: improved expert detail, fine-tuned joint expert ratio, attention balancing mechanisms, auxiliary lossless sigmoid routing, MTP (Multi-Token Prediction) layer, QK-Norm, and Partial-RoPE (rotary position embedding is applied to a subset of attention vertices instead of all of them). According to the research team, together these design choices allow small MoE activation models to deliver up to 7× efficiency compared to dense architectures of similar size, meaning that with only 6.1 billion activation parameters, AntAngelMed can match the performance of an approximately 40 billion dense model. Separately, as the output length grows during inference, the relative speed advantage can also reach 7× or more compared to dense models of similar size.

https://modelscope.cn/models/MedAIBase/AntAngelMed

Training pipeline

AntAngelMed uses a The training process has three stages It is designed to put a general understanding of language above deep adaptation to the medical field.

the The first stage It is ongoing pre-training on wide-ranging medical collections, including encyclopedias, web texts, and academic publications. This phase is built on the Ling-flash-2.0 checkpoint, giving the model a strong foundation for general reasoning before medical specialization begins.

the The second stage It is supervised fine-tuning (SFT), where the model is trained on a multi-source instruction dataset. This dataset combines general reasoning tasks-mathematics, programming, and logic-to maintain train-of-thought capabilities, along with medical scenarios such as doctor-patient QandA, diagnostic reasoning, and safety and ethics situations.

the The third stage It is reinforcement learning using the GRPO (Group Proportional Policy Optimization) algorithm, combined with task-specific reward models. GRPO, originally presented in the DeepSeekMath paper, is a variation of PPO that estimates baselines from group scores rather than a separate critical model, making it computationally lighter. Here, reward cues are designed to model behavior toward empathy, structured clinical responses, safety boundaries, and evidence-based reasoning-all with the goal of reducing hallucinations in medical questions.

Perform inference

On H20 devices, AntAngelMed exceeds 200 tokens per second, which the research team reports is about 3 times faster than a 36 billion parameter-intensive model. Extrapolating from YaRN (another RoPE extension), it supports a context length of 128 KB – long enough to handle full clinical documents, extended patient histories, or multi-turn clinical dialogues.

The research team also released a quantitative version of the FP8 model. When this quantization is combined with the EAGLE3 speculative decoder improvement, the inference throughput at concurrency 32 improves significantly compared to FP8 alone: ​​71% on HumanEval, 45% on GSM8K, and 94% on Math-500. These benchmarks measure mathematical coding and reasoning tasks-not clinical tasks directly-but serve as proxies for the stability of the model’s overall throughput across output types.

Standard results

In HealthBench, OpenAI’s open source medical evaluation benchmark that uses multi-turn simulated medical dialogues to measure real-world clinical performance, AntAngelMed ranks… Firstly Among all the open source models it outperforms some of the best proprietary models as well, with a particularly large advantage in the HealthBench-Hard subset.

In MedAIBench, an evaluation system run by the National AI Medical Industry Pilot Facility of China, AntAngelMed ranks first, with particularly strong scores in the Medical Knowledge QandA and Medical Ethics and Safety categories.

In MedBench, a Chinese healthcare management MSc benchmark covering 36 independently curated datasets and nearly 700,000 samples across five dimensions – answering medical knowledge questions, understanding medical language, generating medical language, complex medical reasoning, and safety and ethics – AntAngelMed ranks first overall.

Visual explanation of Marktechpost

Technical guide
AntAngelMed

1 / 7

01 – Overview
What is AntAngelMed?
It is jointly developed by Zhejiang Provincial Health Information Center, Ant Healthcare and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.

103 bTotal parameters
6.1 bActive in reasoning
128 thousandContext length

AntAngelMed is a medical field based LLM The structure of the Ministry of Education has an activation rate of 1/32. With a total of 103B parameters and only 6.1B active at inference time, it almost matches the performance 40B dense models At a fraction of the computational cost.

Model weights are released below Apache 2.0. The code repository is licensed under Massachusetts Institute of Technology.

02 – Architecture
Architecture and basic model of the Ministry of Education
Built on Ling-flash-2.0 by includeAI, guided by the laws of Ling Scaling.

AntAngelMed uses a 1/32 activation rate Ministry of Environment With improvements across all core components. These options allow MOE models with a small activation to provide up to 7× efficiency On dense structures of similar size – and as the output length grows, the relative speedups can reach 7x or more.

Main architectural components:

Expert accuracy
Percentage of experts involved
X-guidance
No loss of help
Medium-term plan layer
QK Norm
Partial rope
Spinning extrapolation
Attention balance

03- Training
Three-stage training pipeline
It is designed to layer general language understanding on top of deep adaptation to the medical field.

Stage 01
Continuous pre-training
Built on Ling-flash-2.0, it is trained on large-scale medical collections – encyclopedias, web texts, academic publications – to inject deep knowledge and global scope.

Stage 02
Supervised Fine Tuning (SFT)
Multi-source instruction data that combines general tasks (math, programming, and logic) for ideation, as well as clinical scenarios (doctor-patient QandA, diagnostic reasoning, and safety/ethics) for clinical adaptation.

Stage 03
Reinforce learning via GRPO
Optimizing group relative policy using task-specific reward models. Models behavior toward empathy, structural clarity, safety boundaries, and evidence-based reasoning to reduce hallucinations.

04 – Reasoning
Perform inference
Hardware benchmarks on H20 and productivity improvements through FP8+EAGLE3 optimization.

> 200 toc/s
On H20 devices. Nearly three times faster than a similarly dense 36B model.

7× efficiency
MoE vs. dense equivalent volume. The acceleration increases further as the output length grows.

+71% / +45% / +94%
Throughput gains of FP8 + EAGLE3 compared to FP8 alone on HumanEval/GSM8K/Math-500 in conjunction 32.

Context 128 kilos
Powered by YaRN extrapolation. Handles complete clinical documents and extended multi-turn dialogues.

05 – Standards
Standard results
It is assessed by three accredited criteria for the LLM.

Standard range a result
Health BenchOpenAI Multi-turn medical dialogue simulation for real-world clinical performance. 1 Open Source; Outperforms many proprietary models. Biggest lead on HealthBench-Hard.
MedAIBenchNat’l AI Medical Pilot Facility Chinese authority standard covering cognitive questions and answers and medical ethics/safety. Top level. Strongest in QandA knowledge and medical ethics/safety.
MidbenchChinese healthcare field 36 datasets, approximately 700,000 samples across 5 clinical dimensions. No. 1 overall in all five dimensions.

06 – Quick Start
Run with face-hugging transformers
Requires trust_remote_code=True for MoE routing code.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MedAIBase/AntAngelMed",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("MedAIBase/AntAngelMed")

messages = [
  {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
  {"role": "user",   "content": "What should I do if I have a headache?"}
]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt",
    return_token_type_ids=False).to(model.device)
out    = model.generate(**inputs, max_new_tokens=16384)
out    = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

Also supports: vLLM v0.11.0 (4-GPU parallel tensor), SGLang With FlashAttention-3, and vLLM-asc For Huawei Ascend 910B NPUs.

07 – Arrival
Resources and links
Model weights Apache 2.0 – MIT Code Repository – FP8 quantization variant available separately.

Developed by Zhejiang Provincial Health Information Center, Ant health careand Zhejiang Anziner Medical Artificial Intelligence Co., Ltd
Coverage by marketpost -Marktechpost.com

Key takeaways

  • AntAngelMed is an open source 103B parameter medical LLM that activates only 6.1B parameters at inference time using a 1/32 activation ratio MoE architecture inherited from Ling-flash-2.0.
  • It uses a three-stage training pipeline: continuous pre-training on clinical materials, SFT with mixed general instruction and clinical data, and GRPO-based reinforcement learning for safety and diagnostic inference.
  • On H20 devices, the model exceeds 200 tokens/sec and supports a context length of 128K via YaRN extrapolation – about 3x faster than the similarly dense 36B model.
  • AntAngelMed ranks first among open source models in OpenAI’s HealthBench, outperforms many proprietary models, and tops both the MedAIBench and MedBench leaderboards.
  • The model is available on Hugging Face, ModelScope, and GitHub; The model weights are Apache 2.0, the code is MIT, and a quantum version FP8 has also been released.

verify Typical weights at high frequency, GitHub repo and Technical details. Also, feel free to follow us on twitter And don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply