A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we leverage efficient methods like LoRA to make training feasible even on limited hardware, such as Google Colab’s T4 GPU. As we move step by step, we build intuition for how modern alignment pipelines work, from teaching models how to respond to shaping their behavior using preferences and verifiable rewards.

import subprocess, sys
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “-U”,
“torchao>=0.16”,
“trl>=0.20”,
“transformers>=4.45”,
“datasets”,
“peft>=0.13”,
“accelerate”,
“bitsandbytes”,
])

import sys as _sys
for _m in [m for m in list(_sys.modules) if m.startswith((“torchao”, “peft”))]:
_sys.modules.pop(_m, None)
try:
import torchao
except Exception:
import types
_fake = types.ModuleType(“torchao”)
_fake.__version__ = “0.16.1”
_sys.modules[“torchao”] = _fake

import os, re, gc, torch, warnings
warnings.filterwarnings(“ignore”)
os.environ[“TOKENIZERS_PARALLELISM”] = “false”
os.environ[“WANDB_DISABLED”] = “true”
os.environ[“HF_HUB_DISABLE_PROGRESS_BARS”] = “1”

from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig

print(f”torch={torch.__version__} cuda={torch.cuda.is_available()}”)
if torch.cuda.is_available():
print(f”GPU: {torch.cuda.get_device_name(0)} “
f”({torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB)”)

MODEL_NAME = “Qwen/Qwen2.5-0.5B-Instruct”
DEVICE = “cuda” if torch.cuda.is_available() else “cpu”
BF16_OK = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

LORA_CFG = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05, bias=”none”,
target_modules=[“q_proj”, “k_proj”, “v_proj”, “o_proj”],
task_type=”CAUSAL_LM”,
)

def cleanup():
“””Release VRAM between training stages (Colab T4 is tight).”””
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

def chat_generate(model, tokenizer, prompt, max_new_tokens=120):
“””Helper: format as chat, generate, decode just the assistant turn.”””
msgs = [{“role”: “user”, “content”: prompt}]
ids = tokenizer.apply_chat_template(
msgs, return_tensors=”pt”, add_generation_prompt=True
).to(model.device)
with torch.no_grad():
out = model.generate(
ids, max_new_tokens=max_new_tokens,
do_sample=True, temperature=0.7, top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True)

We install and configure the full training stack, ensuring compatibility across libraries like TRL (Transformer Reinforcement Learning library), Transformers, and PEFT. We set up environment variables and GPU checks, and define reusable components such as LoRA configuration and helper functions. We also prepare utility functions for memory cleanup and chat-style generation to support all later stages.

print(“n” + “=”*72 + “nPART 1 — Supervised Fine-Tuning (SFT)n” + “=”*72)

from trl import SFTTrainer, SFTConfig

sft_ds = load_dataset(“trl-lib/Capybara”, split=”train[:300]”)
print(f”SFT dataset rows: {len(sft_ds)}”)
print(f”Example messages: {sft_ds[0][‘messages’][:1]}”)

sft_args = SFTConfig(
output_dir=”./sft_out”,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy=”no”,
bf16=BF16_OK, fp16=not BF16_OK,
max_length=768,
gradient_checkpointing=True,
report_to=”none”,
)

sft_trainer = SFTTrainer(
model=MODEL_NAME,
args=sft_args,
train_dataset=sft_ds,
peft_config=LORA_CFG,
)
sft_trainer.train()

print(“n[SFT inference]”)
print(“Q: Explain the bias-variance tradeoff in two sentences.”)
print(“A:”, chat_generate(sft_trainer.model, sft_trainer.processing_class,
“Explain the bias-variance tradeoff in two sentences.”))

sft_trainer.save_model(“./sft_out/final”)
del sft_trainer; cleanup()

We begin by supervised fine-tuning, loading a conversational dataset, and configuring the SFT trainer. We train the model to imitate high-quality responses using LoRA for efficient adaptation on limited hardware. We then validate the model’s behavior through inference to confirm it follows instruction-style outputs.

print(“n” + “=”*72 + “nPART 2 — Reward Modelingn” + “=”*72)

from trl import RewardTrainer, RewardConfig

rm_ds = load_dataset(“trl-lib/ultrafeedback_binarized”, split=”train[:300]”)
print(f”RM dataset rows: {len(rm_ds)} keys: {list(rm_ds[0].keys())}”)

rm_args = RewardConfig(
output_dir=”./rm_out”,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
learning_rate=1e-4,
logging_steps=10,
save_strategy=”no”,
bf16=BF16_OK, fp16=not BF16_OK,
max_length=512,
gradient_checkpointing=True,
report_to=”none”,
)

rm_lora = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05, bias=”none”,
target_modules=[“q_proj”, “k_proj”, “v_proj”, “o_proj”],
task_type=”SEQ_CLS”,
)

rm_trainer = RewardTrainer(
model=MODEL_NAME,
args=rm_args,
train_dataset=rm_ds,
peft_config=rm_lora,
)
rm_trainer.train()
del rm_trainer; cleanup()

We move to reward modeling, where we train a model to score responses based on human preference data. We configure a sequence classification setup and train using chosen vs rejected pairs. This stage helps us learn a reward signal that can guide alignment in later methods.

print(“n” + “=”*72 + “nPART 3 — Direct Preference Optimization (DPO)n” + “=”*72)

from trl import DPOTrainer, DPOConfig

dpo_ds = load_dataset(“trl-lib/ultrafeedback_binarized”, split=”train[:300]”)

dpo_args = DPOConfig(
output_dir=”./dpo_out”,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-6,
logging_steps=10,
save_strategy=”no”,
bf16=BF16_OK, fp16=not BF16_OK,
max_length=512,
max_prompt_length=256,
beta=0.1,
gradient_checkpointing=True,
report_to=”none”,
)

dpo_trainer = DPOTrainer(
model=MODEL_NAME,
args=dpo_args,
train_dataset=dpo_ds,
peft_config=LORA_CFG,
)
dpo_trainer.train()
del dpo_trainer; cleanup()

We implement Direct Preference Optimization to directly optimize the model using preference data without needing a separate reward model. We configure a low learning rate and control divergence using the beta parameter. We train the model to efficiently align its outputs with preferred responses.

print(“n” + “=”*72 + “nPART 4 — GRPO with verifiable math rewardsn” + “=”*72)

from trl import GRPOTrainer, GRPOConfig
import random

random.seed(0)
def make_math_problem():
a, b = random.randint(1, 50), random.randint(1, 50)
op = random.choice([“+”, “-“, “*”])
expr = f”{a} {op} {b}”
return {
“prompt”: f”Solve this and end your reply with only the final number. {expr} =”,
“answer”: str(eval(expr)),
}

grpo_ds = Dataset.from_list([make_math_problem() for _ in range(200)])
print(f”GRPO dataset rows: {len(grpo_ds)}”)
print(f”Example: {grpo_ds[0]}”)

def correctness_reward(completions, **kwargs):
“””+1 if the last number in the completion matches the gold answer.”””
answers = kwargs[“answer”]
rewards = []
for c, gold in zip(completions, answers):
nums = re.findall(r”-?d+”, c)
rewards.append(1.0 if nums and nums[-1] == gold else 0.0)
return rewards

def brevity_reward(completions, **kwargs):
“””Small bonus for short answers — discourages rambling.”””
return [max(0.0, 1.0 – len(c) / 200) * 0.2 for c in completions]

grpo_args = GRPOConfig(
output_dir=”./grpo_out”,
learning_rate=1e-5,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
num_generations=4,
max_prompt_length=128,
max_completion_length=96,
logging_steps=2,
save_strategy=”no”,
bf16=BF16_OK, fp16=not BF16_OK,
gradient_checkpointing=True,
max_steps=15,
report_to=”none”,
)

grpo_trainer = GRPOTrainer(
model=MODEL_NAME,
args=grpo_args,
train_dataset=grpo_ds,
reward_funcs=[correctness_reward, brevity_reward],
peft_config=LORA_CFG,
)
grpo_trainer.train()

print(“n[GRPO inference]”)
for q in [“What is 17 + 28?”, “What is 9 * 7?”, “What is 100 – 47?”]:
a = chat_generate(grpo_trainer.model, grpo_trainer.processing_class, q, 60)
print(f”Q: {q}nA: {a}n”)

del grpo_trainer; cleanup()

print(“n✓ Tutorial complete — you’ve trained 4 post-training algorithms!”)

We apply GRPO by generating multiple responses per prompt and evaluating them using custom reward functions. We design deterministic rewards for correctness and brevity, allowing the model to learn from verifiable signals. We finally test the model on arithmetic queries to observe improved reasoning behavior.

In conclusion, we implemented and understood four major post-training paradigms that define today’s LLM alignment workflows. We saw how each method builds on the previous one, starting with structured learning in SFT, moving to preference understanding in RM, simplifying optimization with DPO, and finally scaling reasoning with GRPO. Also, we demonstrate that advanced training techniques are not restricted to massive infrastructure; they can be prototyped efficiently with the right tools and abstractions. It gives us a strong foundation for further experimentation, customizing reward functions, scaling models, and designing our own aligned AI systems.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *