How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training

In this tutorial, we explore how we use BudouX to bring intelligent, phrase-aware line breaking to languages where whitespace is not naturally present, such as Japanese, Chinese, and Thai. We begin by setting up the library and working with its default parsers to understand how raw text is segmented into meaningful chunks. We then move into HTML transformation, where we visually see how BudouX improves readability in constrained layouts by inserting invisible breakpoints. As we progress, we dive deeper into the underlying model, inspecting its learned features and weights to understand how decisions are made. We also experiment with custom model manipulation, integrate BudouX into practical workflows like line wrapping and JSON-based pipelines, and evaluate its performance. Also, we build a minimal end-to-end training pipeline to gain intuition about how such lightweight ML models are constructed.

import subprocess, sys
def pip(*pkgs):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, *pkgs])
pip(“budoux”)

import json, time, textwrap, html, random, re, os, tempfile
from pathlib import Path
import budoux
from IPython.display import HTML, display, Markdown

print(f” BudouX version: {budoux.__version__ if hasattr(budoux,’__version__’) else ‘installed’}”)

def header(title):
display(Markdown(f”## {title}”))

header(“1⃣ Default parsers — Japanese / Chinese (Simplified & Traditional) / Thai”)

samples = {
“Japanese (ja)”: (“今日は天気です。BudouXは機械学習を用いた改行整形ツールです。”,
budoux.load_default_japanese_parser()),
“Simplified Chinese”: (“今天是晴天。BudouX 是一个使用机器学习的换行整理工具。”,
budoux.load_default_simplified_chinese_parser()),
“Traditional Chinese”: (“今天是晴天。BudouX 是一個使用機器學習的換行整理工具。”,
budoux.load_default_traditional_chinese_parser()),
“Thai (th)”: (“วันนี้อากาศดีมากและฉันอยากออกไปเดินเล่นที่สวนสาธารณะ”,
budoux.load_default_thai_parser()),
}
for name, (text, parser) in samples.items():
chunks = parser.parse(text)
print(f”n• {name}”)
print(f” raw : {text}”)
print(f” parsed: {‘ | ‘.join(chunks)} ({len(chunks)} phrases)”)

We install BudouX and set up all required imports to begin working with the library. We load default parsers for multiple languages and pass sample sentences through them to observe how the text is segmented into meaningful phrases. This helps us understand the core functionality of BudouX and how it handles different linguistic structures out of the box.

header(“2⃣ HTML translation with `translate_html_string`”)

ja_parser = budoux.load_default_japanese_parser()
html_in = “今日は<b>とても天気</b>です。”
html_out = ja_parser.translate_html_string(html_in)
visible = html_out.replace(“u200b”, “·”)
print(“Input HTML :”, html_in)
print(“Output HTML :”, html_out)
print(“Visualised :”, visible)

demo_text = (“BudouXは機械学習を用いて、CJK言語の文章を意味のある”
“フレーズに分割し、自然な位置で改行できるようにします。”)
demo_html = ja_parser.translate_html_string(demo_text)
display(HTML(f”””
<div style=”display:flex; gap:16px; font-family:’Hiragino Sans’,sans-serif;”>
<div style=”width:140px; border:2px solid #c33; padding:8px;”>
<b style=”color:#c33;”> Plain</b><br>{demo_text}
</div>
<div style=”width:140px; border:2px solid #2a8; padding:8px;”>
<b style=”color:#2a8;”> BudouX</b><br>{demo_html}
</div>
</div>
“””))

header(“3⃣ Model introspection — features & weights”)

model_dir = Path(budoux.__file__).parent / “models”
print(“Bundled models:”, [p.name for p in model_dir.glob(“*.json”)])

with open(model_dir / “ja.json”, encoding=”utf-8″) as f:
ja_model = json.load(f)

print(f”nFeature categories in ja.json: {list(ja_model.keys())}”)
total = sum(len(v) for v in ja_model.values())
print(f”Total learned features: {total:,}”)
for cat, feats in ja_model.items():
print(f” • {cat:5s} → {len(feats):,} features”)

flat = [(cat, feat, w) for cat, d in ja_model.items() for feat, w in d.items()]
flat.sort(key=lambda x: x[2], reverse=True)
print(“nTop 5 features that vote ‘BREAK HERE’:”)
for cat, feat, w in flat[:5]:
print(f” [{cat}] {feat!r} → weight={w}”)
print(“nTop 5 features that vote ‘DO NOT BREAK’:”)
for cat, feat, w in flat[-5:]:
print(f” [{cat}] {feat!r} → weight={w}”)

We use BudouX to transform HTML strings by inserting invisible breakpoints that improve text wrapping. We visualize the effect by comparing plain text rendering with BudouX-enhanced output in a constrained layout. We also inspect the internal model structure, exploring feature categories and weights to understand how the segmentation decisions are learned.

header(“4⃣ Loading a custom model with `budoux.Parser(model)`”)

neutered = {cat: {k: 0 for k in d} for cat, d in ja_model.items()}
flat_parser = budoux.Parser(neutered)
print(“All-zero model output :”, flat_parser.parse(“今日は天気です。”))
print(“Default model output :”, ja_parser.parse(“今日は天気です。”))

header(“5⃣ Practical: custom separators, line-wrapping, JSON export”)

def wrap_with_budoux(text, parser, max_width=12, sep=”n”):
lines, current = [], “”
for phrase in parser.parse(text):
if len(current) + len(phrase) > max_width and current:
lines.append(current); current = phrase
else:
current += phrase
if current: lines.append(current)
return sep.join(lines)

novel = (“吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。”
“何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。”)
print(“Wrapped at width 12:”)
print(wrap_with_budoux(novel, ja_parser, max_width=12))

seg = {“text”: novel, “phrases”: ja_parser.parse(novel)}
print(“nJSON payload (first 120 chars):”, json.dumps(seg, ensure_ascii=False)[:120], “…”)

We experiment with a custom model by modifying all feature weights to zero and observing how segmentation behavior changes. We then implement a practical text-wrapping function that respects BudouX phrase boundaries for better readability. Finally, we export the segmented output as JSON, making it easy to integrate into downstream systems or front-end applications.

header(“6⃣ Performance benchmark”)

big_text = novel * 200
t0 = time.perf_counter()
phrases = ja_parser.parse(big_text)
elapsed = time.perf_counter() – t0
print(f”Parsed {len(big_text):,} chars → {len(phrases):,} phrases “
f”in {elapsed*1000:.1f} ms ({len(big_text)/elapsed/1000:.0f}k chars/sec)”)

header(“7⃣ Mini end-to-end trainer (toy demo)”)

training_lines = [
“私は▁遅刻魔で、▁待ち合わせに▁いつも▁遅刻して▁しまいます。”,
“メールで▁待ち合わせ▁相手に▁一言、▁「ごめんね」と▁謝れば▁どうにか▁なると▁思って▁いました。”,
“海外では▁ケータイを▁持って▁いない。”,
“今日は▁とても▁いい▁天気です。”,
“明日は▁雨が▁降る▁かも▁しれません。”,
“週末は▁友達と▁映画を▁見に▁行きます。”,
] * 20

SEP = “u2581”

def extract_features(s, i):
def g(idx): return s[idx] if 0 <= idx < len(s) else “”
feats = []
for off in (-3,-2,-1,0,1,2):
feats.append(f”U{off}:{g(i+off)}”)
for off in (-2,-1,0,1):
feats.append(f”B{off}:{g(i+off)}{g(i+off+1)}”)
for off in (-1,0):
feats.append(f”T{off}:{g(i+off)}{g(i+off+1)}{g(i+off+2)}”)
return feats

def make_examples(lines):
X, y = [], []
for line in lines:
clean = line.replace(SEP, “”)
breaks = set()
j = 0
for ch in line:
if ch == SEP: breaks.add(j)
else: j += 1
for i in range(1, len(clean)):
X.append(extract_features(clean, i))
y.append(1 if i in breaks else -1)
return X, y

X, y = make_examples(training_lines)
print(f”Training examples: {len(X)} (positives: {sum(1 for v in y if v==1)})”)

We benchmark BudouX’s performance to evaluate its efficiency in processing large amounts of text. We then begin constructing a minimal training pipeline by preparing labeled data and extracting features around potential breakpoints. This gives us insight into how training data is structured and how features contribute to segmentation decisions.

def adaboost(X, y, rounds=80):
n = len(y)
w = [1/n]*n
feat_set = sorted({f for fx in X for f in fx})
fmap = [set(fx) for fx in X]
model_rounds = []
for r in range(rounds):
best_feat, best_err, best_pol = None, 1.0, 1
for f in feat_set:
err_pos = sum(w[i] for i in range(n) if (f in fmap[i]) != (y[i]==1))
err_neg = 1 – err_pos
if err_pos < best_err: best_feat, best_err, best_pol = f, err_pos, +1
if err_neg < best_err: best_feat, best_err, best_pol = f, err_neg, -1
if best_err >= 0.5 – 1e-9: break
eps = max(best_err, 1e-6)
alpha = 0.5 * ( (1-eps)/eps ) ** 0.5
new_w = []
for i in range(n):
pred = best_pol if best_feat in fmap[i] else -best_pol
new_w.append(w[i] * (0.5 if pred == y[i] else 2.0))
s = sum(new_w); w = [x/s for x in new_w]
model_rounds.append((best_feat, best_pol, alpha))
return model_rounds

print(“Training (this is a toy trainer — be patient ~10s)…”)
t0 = time.perf_counter()
rounds = adaboost(X, y, rounds=60)
print(f”Done in {time.perf_counter()-t0:.1f}s, {len(rounds)} stumps kept.”)

correct = 0
for fx, label in zip(X, y):
score = sum(a if (f in fx) == (p==1) else -a for f,p,a in rounds)
pred = 1 if score > 0 else -1
correct += (pred == label)
print(f”Training accuracy of toy model: {correct/len(X)*100:.1f}%”)
print(” For a production model, use `scripts/train.py` from the BudouX repo with the matching feature extractor — this section is illustrative.”)

header(“8⃣ Real-world demo — narrow column comparison”)

paragraph = (“BudouXはGoogleが開発したオープンソースの改行ライブラリです。”
“機械学習モデルを使って、文章を意味のあるフレーズに分割し、”
“読みやすい位置でのみ改行が起こるようにします。”
“依存関係がなく軽量なため、ウェブサイトやモバイルアプリに”
“簡単に組み込むことができます。”)
display(HTML(f”””
<div style=”display:flex; gap:24px; font-family:’Hiragino Sans’,’Yu Gothic’,sans-serif; font-size:15px;”>
<div style=”flex:1; border:2px solid #c33; padding:12px; max-width:180px;”>
<b style=”color:#c33;”>Without BudouX</b>
<p style=”line-height:1.7;”>{paragraph}</p>
</div>
<div style=”flex:1; border:2px solid #2a8; padding:12px; max-width:180px;”>
<b style=”color:#2a8;”>With BudouX</b>
<p style=”line-height:1.7;”>{ja_parser.translate_html_string(paragraph)}</p>
</div>
</div>
<p style=”font-size:12px;color:#666;”>Resize the browser/Colab pane to see the difference more clearly — BudouX never breaks a phrase mid-word.</p>
“””))

print(“n Tutorial complete. Try plugging BudouX output into your own UI.”)

We implement a simple AdaBoost-based training loop to build a toy segmentation model from scratch. We evaluate the model’s accuracy to understand how well it learns phrase boundaries from the data. Finally, we present a real-world comparison that shows how BudouX improves readability in narrow layouts, reinforcing its practical value.

In conclusion, we developed a comprehensive understanding of how BudouX applies machine learning to solve the nuanced problem of natural line breaking in CJK and similar languages. We saw how it operates efficiently without heavy dependencies, making it ideal for web and mobile integrations. Through hands-on exploration, from parsing and HTML rendering to model introspection, customization, and even training, we learned how to use BudouX and also how to extend and adapt it for our own use cases. This equips us with both the practical tools and conceptual clarity needed to incorporate phrase-aware text segmentation into real-world applications with confidence.

Check out the Full Codes here. Find 100s of ML/Data Science Colab Notebooks here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *