Skip to main content
test prompts and voice settings in the playground before shipping them into a bot.
write your prompt in latin script only, e.g. yeh ek test message hai, not यह एक टेस्ट मैसेज है. silk muga 1 is our more expressive model: a hinglish emotion-TTS model with two control surfaces, a paragraph tone set by the selector or a [tone] marker, and discrete inline events (laugh, chuckle, sigh) you place wherever you want them to sound.

1. paragraph tone

the voice tone selector sets the default tone for every paragraph. type [ at the start of a paragraph to insert a specific [tone] marker for that paragraph.
tonewhen to usedelivery
[happy]light, positive, casual chatbright, smiling, mid-energy
[excited]high-energy reactions: wins, surprises, hypeloud, fast, pitch-up
[sad]loss, disappointment, griefslow, breathy, low pitch
[angry]frustration, confrontation, blametight, clipped, sharp
[neutral]information delivery: instructions, factualflat, even, no affect
[whisper]secrets, late-night, intimatequiet, breathy, no voiced energy
rules
  • one tone marker per paragraph; a blank line starts a new paragraph.
  • paragraphs without an explicit marker use the selected default tone; [neutral] is the default when nothing else is selected.
  • the marker applies to the whole paragraph, even if it is inserted after the first word.

2. inline events

type these directly in the transcript at the position you want the sound. three are supported:
eventdurationsound
<laugh>0.5-1.5sloud, voiced laughter (haha, hehe)
<chuckle>0.3-0.7ssoft, amused laugh, almost a breath
<sigh>0.4-0.8saudible exhale, breathy
rules
  • lowercase, angle brackets, no spaces inside. <laugh>, never <Laugh> or < laugh >.
  • space on both sides when between words. never mid-word.
  • position-sensitive. <laugh> kya baat hai sounds different from kya baat hai <laugh>.
  • stack at most two. <laugh> <laugh> for a longer/harder laugh. three or more becomes unstable.

3. tone and event compatibility

events have to match the tone. laughter belongs to high-energy positive states; sighs belong to low-energy reflective ones. mix them (a laugh in a sad line, a sigh in an excited shout) and the model fights itself, because the training set has almost no examples of those combinations.
tone<laugh><chuckle><sigh>
[happy]✓✓✓✓
[excited]✓✓
[sad]✓✓
[angry]~
[neutral]
[whisper]
✓✓ best · ✓ ok · ~ rare · ✗ avoid

4. good vs bad

[sad] <laugh> sab kuch khatam ho gaya
[neutral] <laugh> aaj ka mausam saaf rahega
[angry] <chuckle> tumne phir galti ki
[happy] <sigh> kya mast din tha aaj
[excited] <sigh> jeet gaye!
[whisper] <laugh> sab so rahe hain

5. language register

training data is hinglish, romanised hindi with english code-mixing. the model speaks that best. avoid
  • devanagari. the model saw zero hindi script. मैं ठीक हूँ produces garbage.
  • other indian languages (tamil, bengali, marathi, bhojpuri).
  • heavy regional dialects (very bambaiyya, very punjabi).

6. length

silk muga 1 is built around 2 to 30 second utterances and extrapolates reliably up to around 40s. beyond that, tone drifts, pacing slips, and you start seeing repetitions or cutoffs.
  • 2 to 30s: sweet spot, one to three sentences.
  • ~30 to 40s: still works for longer monologues.
  • beyond 40s: split across prompts.

7. examples

tonetranscript
neutralAapka order place ho gaya hai. Confirmation SMS aapke registered number par bhej diya gaya hai.
happy<chuckle> Pata hai tumne kya kiya kal? Pure office mein viral ho gaya.
excited<laugh> Bhai sun, abhi abhi pata chala, wo job mil gayi mujhe!
sad<sigh> Yaar, samajh sakti hoon. Itna kuch hua hai, time lagega.
whisperPhir achanak, kuch khatka hua. Maine darwaza dekha, koi nahi tha.
angryTumne phir wahi kiya. Maine kitni baar bola tha aisa mat karo.
temperature 0.7 is the most reliable inference setting for the v3 fine-tune the API ships.