How we built voice-matched drafts with Claude Sonnet

Most AI drafts sound like ChatGPT in business attire — over-formal, oddly cheerful, full of "I hope this email finds you well." The reason is simple: the model has no idea how YOU write. It's writing how it thinks an executive should write. The result reads like a hostage letter from a polite stranger.

Triagd's voice-matched drafts are different — and the difference is mostly a retrieval problem, not a generation problem. Here's how we built it.

The corpus: your last 500 sent emails

When you connect Gmail, we index the most recent 500 emails you've sent — excluding one-line replies ("thanks!", "sg", "will do"), excluding auto-responses, and excluding anything to mailing lists. What remains is a representative sample of how you actually write to other humans.

We extract three things from each sent email: the body text (cleaned of signatures and quoted replies), the recipient's relationship to you (peer, report, external, family), and stylometric signals — average sentence length, formality score, sign-off pattern, emoji rate, and a small handful of vocabulary fingerprints.

Stylometry: the cheap signal that does most of the work

Before we touch any LLM, we compute six stylometric features per user:

Average sentence length (in words)
Greeting pattern: "Hi <name>", "Hey", no greeting, etc.
Sign-off pattern: "Thanks,", "Best,", "– A", initial only
Use of bullets vs. prose (binary, per recipient relationship)
Question density: questions per 100 words
Casual marker rate: contractions, lowercased starts, mid-sentence dashes

These six numbers, computed per recipient-relationship, are passed into every draft prompt as a style header. They're the cheapest thing we could possibly send and they fix about 60% of the "sounds like an AI" problem on their own.

Retrieval: finding the in-style examples

Stylometry gets you partway. The rest is exemplar retrieval. For each draft we need to write, we find 3–5 past sent emails from the corpus that are tonally similar to what we're about to write.

Tonally similar means: same recipient-relationship, similar topic vibe, similar length. We use a small embedding model (Voyage-Lite over the cleaned body) and filter by recipient-relationship as a hard constraint. The 3–5 nearest neighbors become exemplars in the prompt.

Why not just fine-tune?

Per-user fine-tunes were our first instinct and it's the wrong instinct. Three reasons:

1Cost: a fine-tune per operator means a custom model per operator, with all the storage and inference complexity that implies. RAG over 500 emails scales linearly with users; fine-tuning doesn't.
2Freshness: operators write differently after a job change, a fundraise, a kid being born. Fine-tunes calcify; retrieval adapts the moment the most recent emails change.
3Quality: stylometric headers + exemplar retrieval with a strong base model (Claude Sonnet 4.6) beats per-user fine-tunes on every blind A/B we've run, and it's not close.

The prompt structure

When we draft a reply, we send Claude Sonnet 4.6 the following, in order:

System prompt: "You write email replies in the operator's voice. Match their tone exactly."
Style header: the six stylometric features, formatted as a tight markdown table
3–5 exemplar emails the operator has sent to similar recipients
The incoming email to reply to
A one-line intent statement extracted from the original ("accept the meeting," "push back politely," "ask a clarifying question")

Prompt-cache the system prompt and the exemplars (they don't change per request). The variable part — the incoming email and intent — is the only thing that costs full input tokens. With Anthropic's prompt caching, this is a 90%+ cache hit rate per user.

Result quality, measured

We measure draft quality two ways. The hard metric is edit distance — how much did the operator change before pressing send? Median edit distance across 12,000 drafts in our beta: 18%. Most drafts go out with a single tweak — usually a personalization the model couldn't know.

The softer metric is whether operators trust the draft enough to send it without reading carefully. We don't recommend that — but in our anonymous telemetry, 31% of Tier 2 drafts are sent within 5 seconds of being opened. That's the trust.

What we got wrong (and fixed)

Early versions over-indexed on the most recent 50 sent emails. That worked fine until someone's last 50 emails were all post-fundraise announcements — and then every draft sounded like a fundraise announcement. We now bucket the corpus by recipient-relationship and sample evenly across buckets. Lesson: recency bias is the enemy of voice fidelity.

We also had a brief, embarrassing period where the model would generate signatures ("Best, Alex") because the exemplars contained them. We now strip signatures aggressively before embedding and we tell the model to omit sign-offs in its drafts — the operator's email client appends their real signature.

Voice matching isn't magic. It's a corpus, a few features, careful retrieval, and a frontier model. Together, they produce something most operators can press send on with one click.

Triagd builds what we write about.

If this post landed for you, the product behind it is what you actually want. Connect Gmail or Outlook in under a minute. 7-day free trial — no card required.

Start free trial →Try the free toolkit