Poolside AI has released the first two models in the Laguna family: Laguna M.1 and Laguna XS.2. In addition, the company launches… Swimming pool – A lightweight terminal coding agent and binary agent Agent Client Protocol (ACP) – The same environment that Poolside uses internally to train and evaluate Agent RL, is now available as a research preview.
What are these models, and why should you care about them?
Both the Laguna M.1 and Laguna XS.2 as well A mixture of experts (Ministry of Education) Models. Instead of activating all parameters for each token, MOE models route each token through only a subset of specialized subnetworks called “experts.” This means a large number of total parameters and the capabilities that come with them while only paying the cost of computing for a much smaller number of “activated” parameters at inference time.
Laguna M.1 It is a 225-byte grossly parameterized MoE model with 23 bytes of active parameters, trained from scratch on 30T tokens using 6,144 threaded NVIDIA Hopper GPUs. She completed the pre-training at the end of last year and serves as the foundation for the entire Laguna family. On standards, it arrives 72.5% were bench-verified in SWE, 67.3% on the SWE multilingual seat, 46.9% discount on SWE-bench Proand 40.7% on Terminal-Bench 2.0.

Laguna XS.2 It is the second generation MoE and Poolside model and is the first open weight model, building on everything learned since M.1 training. With a total of 33B parameters with 3B activation per token, it is designed for proxy encryption and long-term operation on a local machine – small enough to run on a Mac with 36GB of RAM via Ollama. He’s recording 68.2% were bench-verified in SWE, 62.4% on the SWE multilingual seat, 44.5% discount on SWE-bench Proand 30.1% on Terminal-Bench 2.0. Poolside will also be launched Laguna XS.2-Base Coming soon to practitioners who want to hone.
Architecture: Efficiency decisions in XS.2
Uses of XS.2 Sigmoid gating with rotating scales for each layerenabling hybrid sliding window alerting (SWA) and 3:1 global attention planning across 40 total layers – 30 SWA layers and 10 global attention layers. Sliding window alerting restricts each token’s attention to a local window of 512 tokens rather than the entire sequence, which significantly truncates the KV cache. Global interest layers of 1 to 4 maintain long-term dependencies without paying the full cost everywhere. The model also scales the KV cache to FP8which results in reduced memory per token.
Under the hood, it uses the XS.2 256 experts with 1 joint expertsupports a Contextual window containing 131,072 charactersIt features native logic support – pruned reasoning between tool calls with per-request control to enable or disable reasoning.




Training: Three areas next to the pool that are pressed hard
The Poolside team trains all of its models from scratch using its own data pipeline, its own training database (Titan), and its own agent RL infrastructure. Three areas have seen private investment in Laguna.
AutoMixer: Optimize data mix automatically. The organization of the data and the mix that goes into training has a huge impact on the performance of the final model. Instead of relying on manual heuristics, Poolside developed an auto-mixing framework that trains a set of about 60 agent models, each on a different mix of data, and measures performance across key ability sets – code, math, STEM, and common sense. Alternative regressions are then appropriate to approximate how changes in data set proportions affect final ratings, providing a learned mapping of data mix to performance that can be directly improved. This approach is inspired by previous work including: Olmix, MDE, and RegMix,was adapted to the Poolside setting with richer datasets.
On the data side, both Laguna models were trained on over 30T codes. Poolside’s diversity-preserving data curation approach-which keeps portions of medium- and low-quality collections alongside high-quality data to avoid STEM bias-results in approx. 2x more tokens Compared to accuracy-focused pipelines, with gains persisting over longer training periods. A separate data deduplication analysis also confirmed this Global data deduplication removes disproportionately high-quality dataTo update us on how the team is fine-tuning its pipeline. Synthetic data contributes to 13% of the final training mix In the Laguna XS.2, with the Laguna series being used approx 4.4T+ synthetic tokens In total.
Mohsen Moon. Instead of AdamW – the most common optimizer for training large models – Poolside used a distributed implementation of Mohsen Moon During all training stages for both models. In the initial pre-training ablations, the research team achieved approximately the same training loss as the AdamW baseline 15% fewer stepswith significant improvements in absolute evaluation on the final model, achieving transfer learning rate across model metrics. Additional benefit: Muon requires only one instance per parameter instead of two, reducing memory requirements for both training and screening. During pre-training of Laguna M.1, the overhead from the optimizer was less than 1% of the training step time.
Works poolside too Periodic hash checks on model weights Via training replicas to detect silent data corruption (SDC) from faulty GPUs – specifically errors in arithmetic logic and pipeline registers, which unlike DRAM and SRAM are not covered by ECC protection.
Asynchronous proxy on RL approach. This is arguably the most complex piece in the Laguna training set. Poolside has built a fully asynchronous online RL system where actor processes pull tasks from a dataset, spin up sandboxed containers, and run a production binary agent against each task using the newly deployed model. The resulting tracks are recorded, filtered and transcribed Iceberg tables,while the trainer continuously consumes these records and ,produces the next checkpoint – where inference and training ,occur asynchronously in parallel, adjusting throughput to ,balance off policy slack.
Key takeaways
- Poolside launches its first open weight model: Laguna
- Strong standard performance on small scale: The Laguna
- Muon optimizer outperforms AdamW by 15% in training efficiency: Poolside replaced AdamW with a distributed implementation of the Muon optimizer, resulting in the same training loss in approximately 15% fewer steps, and with lower memory requirements – just one instance per parameter instead of two.
- AutoMixer replaces manual data mixing with learned optimization: Instead of hand-crafted data recipes, Poolside trains a set of about 60 proxy models on different data sets and fits alternative regressions to improve the proportions of the dataset – with synthetic data making up about 13% of the final training mix for Laguna XS.2 out of a total of 4.4T+ synthetic codes.
- Fully asynchronous proxy RL with GPUDirect RDMA weight transfer: Poolside’s RL system runs inference and training in parallel, transferring hundreds of gigabytes of BF16 weights between nodes in less than 5 seconds via GPUDirect RDMA, using a distinct actor design, out-of-code and the CISPO algorithm to stabilize out-of-policy training.
verify Typical weights and Technical details. Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us