In this tutorial, we explore lambda/hermes-inference-dataset To understand how agent-based models think, use tools, and create responses across multi-turn conversations. We start by loading and inspecting the dataset, examining its structure, categories, and conversation format to get a clear idea of the information available. We then build simple parsers to extract core components such as heuristic traces, tool calls, and tool responses, allowing us to separate internal reasoning from external actions. We also analyze patterns such as frequency of tool use, conversation length, and error rates to better understand agent behavior. We also create visualizations to highlight these trends and make analysis easier. Finally, we prepare the dataset for training by converting it to a model-friendly format, making it suitable for tasks such as supervised fine-tuning.
!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl
import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets
random.seed(0)
CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))
COMPARE_BOTH = False
if COMPARE_BOTH:
ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
ds_glm = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
ds_glm = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
print("Combined:", ds, "→ counts:", Counter(ds["source"]))
sample = ds[0]
print("\n=== Sample 0 ===")
print("id :", sample["id"])
print("category :", sample["category"], "/", sample["subcategory"])
print("task :", sample["task"])
print("turns :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...\n")
We install all the required libraries and import the necessary modules to set up our environment. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its structure, fields, and classes. We also optionally combine multiple dataset configurations and examine a sample to understand the conversation format.
THINK_RE = re.compile(r"(.*?) ", re.DOTALL)
TOOL_CALL_RE = re.compile(r"\s*(\{.*?\})\s* ", re.DOTALL)
TOOL_RESP_RE = re.compile(r"\s*(.*?)\s* ", re.DOTALL)
def parse_assistant(value: str) -andgt; dict:
thoughts = [t.strip() for t in THINK_RE.findall(value)]
calls = []
for raw in TOOL_CALL_RE.findall(value):
try:
calls.append(json.loads(raw))
except json.JSONDecodeError:
calls.append({"name": "", "arguments": {}})
final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
return {"thoughts": thoughts, "tool_calls": calls, "final": final}
def parse_tool(value: str):
raw = TOOL_RESP_RE.search(value)
if not raw: return {"raw": value}
body = raw.group(1)
try: return json.loads(body)
except: return {"raw": body}
first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])
We define regex-based parsers to extract inference traces, tool calls, and tool responses from the dataset. We process auxiliary messages to separate ideas, actions and final outputs in an organized manner. We then test the parser on a conversation model to ensure that the extraction process is working properly.

N = 3000
sub = ds.select(range(min(N, len(ds))))
tool_calls = Counter()
parallel_widths = Counter()
thoughts_per_turn = []
calls_per_traj = []
errors_per_traj = []
turns_per_traj = []
cat_counts = Counter()
for ex in sub:
cat_counts[ex["category"]] += 1
n_calls = n_err = 0
turns_per_traj.append(len(ex["conversations"]))
for t in ex["conversations"]:
if t["from"] == "gpt":
p = parse_assistant(t["value"])
thoughts_per_turn.append(len(p["thoughts"]))
if p["tool_calls"]:
parallel_widths[len(p["tool_calls"])] += 1
for c in p["tool_calls"]:
tool_calls[c.get("name", "")] += 1
n_calls += len(p["tool_calls"])
elif t["from"] == "tool":
r = parse_tool(t["value"])
blob = json.dumps(r).lower()
if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
n_err += 1
calls_per_traj.append(n_calls)
errors_per_traj.append(n_err)
print(f"\nScanned {len(sub)} trajectories")
print(f"Avg turns/traj : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with andgt;=1 error : {100*np.mean([eandgt;0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns : {100*sum(v for k,v in parallel_widths.items() if kandgt;1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools :", tool_calls.most_common(10))
fig, axes = plt.subplots(2, 2, figsize=(13, 9))
top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")
ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel(" tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")
axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")
cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")
plt.tight_layout(); plt.show()
We perform dataset-level analyzes to measure tool usage, conversation duration, and error patterns. We aggregate statistics across multiple samples to understand overall agent behavior. We also create visualizations to highlight trends such as tool duplication, parallel calls, and category distribution.
def render_trace(ex, max_chars=350):
print(f"\n{'='*72}\nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}\n{'='*72}")
for t in ex["conversations"]:
role = t["from"]
if role == "system":
continue
if role == "human":
print(f"\n[USER]\n{textwrap.shorten(t['value'], 600)}")
elif role == "gpt":
p = parse_assistant(t["value"])
for th in p["thoughts"]:
print(f"\n[THINK]\n{textwrap.shorten(th, max_chars)}")
for c in p["tool_calls"]:
args = json.dumps(c.get("arguments", {}))[:200]
print(f"[CALL] {c.get('name')}({args})")
if p["final"]:
print(f"\n[ANSWER]\n{textwrap.shorten(p['final'], max_chars)}")
elif role == "tool":
print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
print("="*72)
idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])
def get_tool_schemas(ex):
try: return json.loads(ex["tools"])
except: return []
schemas = get_tool_schemas(sample)
print(f"\nSample 0 has {len(schemas)} tools available")
for s in schemas[:3]:
fn = s.get("function", {})
print(" -", fn.get("name"), "-", (fn.get("description") or "")[:80])
ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}
def to_openai_messages(conv):
return [{"role": ROLE_MAP[t["from"]], "content": t["value"]} for t in conv]
example_msgs = to_openai_messages(sample["conversations"])
print("\nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
print(" ", m["role"], "→", m["content"][:120].replace("\n", " "), "...")
We build utilities to present complete conversation traces in a readable format for deeper examination. We also extract instrument diagrams and convert the dataset to an OpenAI-style message format for compatibility with training pipelines. This helps us better understand the architecture of tools and how to standardize conversations.
from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)
def build_masked(conv, tokenizer, max_len=2048):
msgs = to_openai_messages(conv)
for m in msgs:
if m["role"] == "tool":
m["role"] = "user"
m["content"] = "[TOOL OUTPUT]\n" + m["content"]
input_ids, labels = [], []
for m in msgs:
text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
ids = tokenizer.encode(text, add_special_tokens=False)
input_ids.extend(ids)
labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
return input_ids[:max_len], labels[:max_len]
ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"\nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")
think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
for t in ex["conversations"]:
if t["from"] != "gpt": continue
p = parse_assistant(t["value"])
for th in p["thoughts"]: think_lens.append(len(th))
for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
if p["final"]: ans_lens.append(len(p["final"]))
plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()
class TraceReplayer:
def __init__(self, ex):
self.ex = ex
self.steps = []
pending = None
for t in ex["conversations"]:
if t["from"] == "gpt":
if pending: self.steps.append(pending)
pending = {"think": parse_assistant(t["value"]), "responses": []}
elif t["from"] == "tool" and pending:
pending["responses"].append(parse_tool(t["value"]))
if pending: self.steps.append(pending)
def __len__(self): return len(self.steps)
def play(self, i):
s = self.steps[i]
print(f"\n── Step {i+1}/{len(self)} ──")
for th in s["think"]["thoughts"]:
print(f"💭 {textwrap.shorten(th, 280)}")
for c in s["think"]["tool_calls"]:
print(f"⚙️ {c.get('name')}({json.dumps(c.get('arguments', {}))[:140]})")
for r in s["responses"]:
print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
if s["think"]["final"]:
print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")
rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
rp.play(i)
TRAIN = False
if TRAIN:
import torch
from transformers import AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
train_subset = ds.select(range(200))
def to_text(batch):
msgs = to_openai_messages(batch["conversations"])
for m in msgs:
if m["role"] == "tool":
m["role"] = "user"; m["content"] = "[TOOL]\n" + m["content"]
batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
return batch
train_subset = train_subset.map(to_text)
model = AutoModelForCausalLM.from_pretrained(
TOK_ID,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto" if torch.cuda.is_available() else None,
)
cfg = SFTConfig(
output_dir="hermes-sft-demo",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=20,
learning_rate=2e-5,
logging_steps=2,
max_seq_length=1024,
dataset_text_field="text",
report_to="none",
fp16=torch.cuda.is_available(),
)
SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
print("Fine-tune demo finished.")
print("\n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
"tokenized + label-masked SFT examples, and an optional training hook.")
We encode conversations and apply label masking so that only assistant responses contribute to training. We analyze the distributions of inference length, tool calls, and answers to gain further insights. We also implement a trace replay tool to step through the agent’s behavior and optionally run a small tuning loop.
In conclusion, we have developed a structured workflow to effectively analyze and work with agent inference traces. We were able to break down conversations into meaningful components, examine how customers think step by step, and measure how they interact with tools while solving problems. Using visualizations and analytics, we gained insights into common patterns and behaviors across the dataset. In addition, we transformed the data into a format suitable for training language models, including handling encoding and masking of labels for helper responses. This process also provides a solid foundation for studying, evaluating, and improving AI systems that use tools in a practical and scalable way.
verify Complete codes with notebook. Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us