How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python

In this tutorial, we explore AgentTrove, one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.

!pip -q install “datasets>=2.19” pandas matplotlib pyarrow huggingface_hub
import itertools, json, collections, textwrap, re, random, statistics
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
REPO = “open-thoughts/AgentTrove”
random.seed(0)
print(” Imports ready. Target dataset:”, REPO)
ds = load_dataset(REPO, split=”train”, streaming=True)
print(” Streaming dataset opened.”)
first = next(iter(ds))
print(“n Columns present in a row:”)
for k in first.keys():
v = first[k]
t = type(v).__name__
preview = (str(v)[:70] + “…”) if v is not None and len(str(v)) > 70 else v
print(f” • {k:<18} ({t}): {preview}”)

We install the required libraries and import the core tools needed for streaming, analysis, and visualization. We define the AgentTrove repository, open the dataset in streaming mode, and avoid downloading the full dataset locally. We then inspect the first row to understand the available columns and get an initial view of the dataset schema.

def find_trace_key(row):
for cand in (“conversations”, “messages”):
if cand in row and isinstance(row[cand], list):
return cand
for k, v in row.items():
if isinstance(v, list) and v and isinstance(v[0], dict) and
(“content” in v[0] or “role” in v[0] or “value” in v[0]):
return k
raise KeyError(“No conversation-like column found.”)
TRACE_KEY = find_trace_key(first)
print(f”n Trace column detected: ‘{TRACE_KEY}'”)
def normalize_turns(trace):
turns = []
for turn in trace:
if not isinstance(turn, dict):
turns.append((“unknown”, str(turn)))
continue
role = turn.get(“role”) or turn.get(“from”) or “unknown”
content = turn.get(“content”)
if content is None:
content = turn.get(“value”, “”)
turns.append((str(role), “” if content is None else str(content)))
return turns
sample_turns = normalize_turns(first[TRACE_KEY])
print(f” First trace has {len(sample_turns)} turns. “
f”Roles: {collections.Counter(r for r, _ in sample_turns)}”)

We create a defensive function to automatically detect the column that contains the conversation or trace data. We then normalize each turn into a consistent role-content format so that different dataset schemas can be handled smoothly. We also inspect the first trajectory to count the number of turns and understand the roles present in the conversation.

def extract_commands(assistant_content):
“””Best-effort: pull shell commands out of an assistant JSON turn.”””
cmds = []
txt = re.sub(r”“`(?:json)?|“`”, “”, assistant_content).strip()
try:
obj = json.loads(txt)
except Exception:
return cmds
def walk(o):
if isinstance(o, dict):
for key in (“commands”, “command”, “keystrokes”, “cmd”, “action”):
if key in o:
val = o[key]
if isinstance(val, str):
cmds.append(val.strip())
elif isinstance(val, list):
for item in val:
if isinstance(item, str):
cmds.append(item.strip())
elif isinstance(item, dict):
walk(item)
for v in o.values():
if isinstance(v, (dict, list)):
walk(v)
elif isinstance(o, list):
for v in o:
walk(v)
walk(obj)
return [c for c in cmds if c]

We define a command-extraction utility that reads assistant responses and attempts to parse shell commands from JSON-style outputs. We clean possible code fences, load the content as JSON, and recursively search through common command-related fields. This helps us identify tool-like actions inside agent trajectories and measure how often agents issue executable commands.

def render_trace(row, max_chars=600):
meta = {k: row.get(k) for k in
(“original_source”, “original_teacher”, “model”, “task”,
“result”, “reward”, “model_provider”) if k in row}
print(“=” * 78)
print(” METADATA:”, {k: v for k, v in meta.items() if v is not None})
print(“=” * 78)
for i, (role, content) in enumerate(normalize_turns(row[TRACE_KEY])):
tag = {“system”: ” SYSTEM”, “user”: ” USER”,
“assistant”: ” ASSISTANT”, “tool”: ” TOOL”}.get(role, f” {role.upper()}”)
snippet = content if len(content) <= max_chars else content[:max_chars] + ” …[truncated]”
print(f”n[{i}] {tag}”)
print(textwrap.indent(snippet, ” “))
if role == “assistant”:
for c in extract_commands(content)[:5]:
print(f” └─ parsed command: {c!r}”)
print(“=” * 78, “n”)
print(“n EXAMPLE TRAJECTORY (first row):”)
render_trace(first, max_chars=400)

We build a trace-rendering function that prints the metadata and the full conversation trajectory in a readable format. We label each turn by role, truncate long messages for clarity, and show parsed commands under assistant messages.

N = 2000
records = []
print(f”n Streaming {N} rows for analysis…”)
for row in itertools.islice(load_dataset(REPO, split=”train”, streaming=True), N):
turns = normalize_turns(row[TRACE_KEY])
roles = collections.Counter(r for r, _ in turns)
total_chars = sum(len(c) for _, c in turns)
asst_cmds = sum(len(extract_commands(c)) for r, c in turns if r == “assistant”)
records.append({
“original_source”: row.get(“original_source”),
“original_teacher”: row.get(“original_teacher”),
“model”: row.get(“model”),
“model_provider”: row.get(“model_provider”),
“result”: row.get(“result”),
“reward”: row.get(“reward”),
“n_turns”: len(turns),
“n_user”: roles.get(“user”, 0),
“n_assistant”: roles.get(“assistant”, 0),
“n_tool”: roles.get(“tool”, 0),
“total_chars”: total_chars,
“n_commands”: asst_cmds,
})
df = pd.DataFrame(records)
print(f” Built DataFrame: {df.shape[0]} rows × {df.shape[1]} cols”)
print(“n Numeric summary (turns / length / commands):”)
print(df[[“n_turns”, “n_assistant”, “n_tool”, “total_chars”, “n_commands”]]
.describe().round(1).to_string())
def show_dist(col, top=15):
if col in df and df[col].notna().any():
print(f”n Top values for ‘{col}’:”)
print(df[col].value_counts(dropna=True).head(top).to_string())
else:
print(f”n ‘{col}’ is empty/absent in this sample.”)
for c in (“original_source”, “original_teacher”, “model”, “model_provider”, “result”):
show_dist(c)

We stream a sample of rows from AgentTrove and collect useful statistics, including turn counts, tool usage, total characters, and parsed command counts. We store these lightweight features in a pandas DataFrame to make the dataset easier to summarize and analyze. We also print distribution tables for fields such as source, teacher model, model provider, and result to understand where the traces originate.

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
src = df[“original_source”].value_counts().head(10)
axes[0, 0].barh(src.index[::-1], src.values[::-1], color=”#4C72B0″)
axes[0, 0].set_title(“Top 10 Task Sources”); axes[0, 0].set_xlabel(“traces”)
tch = df[“original_teacher”].value_counts().head(10)
axes[0, 1].barh(tch.index[::-1], tch.values[::-1], color=”#55A868″)
axes[0, 1].set_title(“Teacher Models”); axes[0, 1].set_xlabel(“traces”)
axes[1, 0].hist(df[“n_turns”].clip(upper=df[“n_turns”].quantile(0.97)),
bins=30, color=”#C44E52″, edgecolor=”white”)
axes[1, 0].set_title(“Turns per Trajectory (97th-pct clipped)”)
axes[1, 0].set_xlabel(“turns”); axes[1, 0].set_ylabel(“count”)
axes[1, 1].scatter(df[“n_assistant”], df[“n_commands”], alpha=0.3, s=12, color=”#8172B2″)
axes[1, 1].set_title(“Assistant Turns vs. Parsed Commands”)
axes[1, 1].set_xlabel(“assistant turns”); axes[1, 1].set_ylabel(“shell commands extracted”)
plt.tight_layout(); plt.show()

We create four visualizations to explore the sampled traces from different angles. We plot the top task sources, teacher models, turn-count distribution, and the relationship between assistant turns and parsed commands. These charts help us quickly identify patterns in the dataset and understand how agent behavior varies across sources and tasks.

def is_success(row):
res = (row.get(“result”) or “”).lower()
if res in (“resolved”, “success”, “pass”, “passed”, “correct”):
return True
rw = row.get(“reward”)
try:
return float(rw) >= 1.0
except (TypeError, ValueError):
return False
out_path = “agenttrove_clean_sft.jsonl”
kept, scanned, SCAN, KEEP = 0, 0, 1500, 200
print(f”n Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…”)
with open(out_path, “w”) as f:
for row in itertools.islice(load_dataset(REPO, split=”train”, streaming=True), SCAN):
scanned += 1
if not is_success(row):
continue
turns = normalize_turns(row[TRACE_KEY])
conv = [{“from”: r, “value”: c} for r, c in turns if c.strip()]
if len(conv) < 2:
continue
f.write(json.dumps({
“conversations”: conv,
“source”: row.get(“original_source”),
“teacher”: row.get(“original_teacher”),
}) + “n”)
kept += 1
if kept >= KEEP:
break
print(f” Scanned {scanned} rows → wrote {kept} clean traces to ‘{out_path}'”)
def search_traces(keyword=None, source=None, limit=3, scan=3000):
“””Stream the dataset and yield-print traces matching filters.”””
hits = 0
for row in itertools.islice(load_dataset(REPO, split=”train”, streaming=True), scan):
if source and row.get(“original_source”) != source:
continue
if keyword:
blob = ” “.join(c for _, c in normalize_turns(row[TRACE_KEY]))
if keyword.lower() not in blob.lower():
continue
render_trace(row, max_chars=300)
hits += 1
if hits >= limit:
break
if hits == 0:
print(“No matches in the scanned window — try increasing `scan`.”)
print(“n Searching for ‘nl2bash’ source traces:”)
search_traces(source=”nl2bash”, limit=2, scan=4000)
print(“n Tutorial complete! Next ideas:”)
print(” • Increase N / SCAN for bigger analyses.”)
print(” • Filter by original_source (swesmith, codeforces, r2egym…) for a domain SFT set.”)
print(” • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.”)

We define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded. We then export successful trajectories into a clean ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to find traces by keyword or source, making the dataset easier to explore for specific agentic tasks.

In conclusion, we built a complete, hands-on pipeline to inspect, analyze, filter, and export data from AgentTrove in a Colab-friendly way. We started with streaming access, then progressively added schema detection, turn normalization, command extraction, trajectory rendering, statistical analysis, visualization, success-based filtering, and keyword or source-based search. This workflow helps us understand the internal structure of agentic traces and gives us a reusable foundation for preparing high-quality subsets for fine-tuning or evaluation. We also keep the process scalable by avoiding full dataset downloads and using streamed samples only when needed. Also, we demonstrated how AgentTrove can be used as more than a static dataset: we treated it as a rich source of agent behavior, tool usage, task outcomes, and training-ready conversations that can support future experiments in agent learning, workflow analysis, and domain-specific SFT dataset creation.

Check out the Full Codes with NotebookAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *