test prompts and voice settings in the playground
before shipping them into a bot.
write your prompt in latin script only, e.g. yeh ek test message hai, not
यह एक टेस्ट मैसेज है.
silk muga 1 is our more expressive model: a hinglish emotion-TTS model with two
control surfaces, a paragraph tone set by the selector or a [tone] marker, and
discrete inline events (laugh, chuckle, sigh) you place wherever you want them to
sound.
1. paragraph tone
the voice tone selector sets the default tone for every paragraph. type [ at
the start of a paragraph to insert a specific [tone] marker for that paragraph.
| tone | when to use | delivery |
|---|
[happy] | light, positive, casual chat | bright, smiling, mid-energy |
[excited] | high-energy reactions: wins, surprises, hype | loud, fast, pitch-up |
[sad] | loss, disappointment, grief | slow, breathy, low pitch |
[angry] | frustration, confrontation, blame | tight, clipped, sharp |
[neutral] | information delivery: instructions, factual | flat, even, no affect |
[whisper] | secrets, late-night, intimate | quiet, breathy, no voiced energy |
rules
- one tone marker per paragraph; a blank line starts a new paragraph.
- paragraphs without an explicit marker use the selected default tone;
[neutral]
is the default when nothing else is selected.
- the marker applies to the whole paragraph, even if it is inserted after the
first word.
2. inline events
type these directly in the transcript at the position you want the sound. three
are supported:
| event | duration | sound |
|---|
<laugh> | 0.5-1.5s | loud, voiced laughter (haha, hehe) |
<chuckle> | 0.3-0.7s | soft, amused laugh, almost a breath |
<sigh> | 0.4-0.8s | audible exhale, breathy |
rules
- lowercase, angle brackets, no spaces inside.
<laugh>, never <Laugh> or
< laugh >.
- space on both sides when between words. never mid-word.
- position-sensitive.
<laugh> kya baat hai sounds different from
kya baat hai <laugh>.
- stack at most two.
<laugh> <laugh> for a longer/harder laugh. three or more
becomes unstable.
3. tone and event compatibility
events have to match the tone. laughter belongs to high-energy positive states;
sighs belong to low-energy reflective ones. mix them (a laugh in a sad line, a
sigh in an excited shout) and the model fights itself, because the training set
has almost no examples of those combinations.
| tone | <laugh> | <chuckle> | <sigh> |
|---|
[happy] | ✓✓ | ✓✓ | ✗ |
[excited] | ✓✓ | ✓ | ✗ |
[sad] | ✗ | ✗ | ✓✓ |
[angry] | ✗ | ~ | ✓ |
[neutral] | ✓ | ✗ | ✓ |
[whisper] | ✗ | ✓ | ✓ |
✓✓ best · ✓ ok · ~ rare · ✗ avoid
4. good vs bad
[sad] <laugh> sab kuch khatam ho gaya
[neutral] <laugh> aaj ka mausam saaf rahega
[angry] <chuckle> tumne phir galti ki
[happy] <sigh> kya mast din tha aaj
[excited] <sigh> jeet gaye!
[whisper] <laugh> sab so rahe hain
5. language register
training data is hinglish, romanised hindi with english code-mixing. the model
speaks that best.
avoid
- devanagari. the model saw zero hindi script.
मैं ठीक हूँ produces garbage.
- other indian languages (tamil, bengali, marathi, bhojpuri).
- heavy regional dialects (very bambaiyya, very punjabi).
6. length
silk muga 1 is built around 2 to 30 second utterances and extrapolates reliably
up to around 40s. beyond that, tone drifts, pacing slips, and you start seeing
repetitions or cutoffs.
- 2 to 30s: sweet spot, one to three sentences.
- ~30 to 40s: still works for longer monologues.
- beyond 40s: split across prompts.
7. examples
| tone | transcript |
|---|
| neutral | Aapka order place ho gaya hai. Confirmation SMS aapke registered number par bhej diya gaya hai. |
| happy | <chuckle> Pata hai tumne kya kiya kal? Pure office mein viral ho gaya. |
| excited | <laugh> Bhai sun, abhi abhi pata chala, wo job mil gayi mujhe! |
| sad | <sigh> Yaar, samajh sakti hoon. Itna kuch hua hai, time lagega. |
| whisper | Phir achanak, kuch khatka hua. Maine darwaza dekha, koi nahi tha. |
| angry | Tumne phir wahi kiya. Maine kitni baar bola tha aisa mat karo. |
temperature 0.7 is the most reliable inference setting for the v3 fine-tune
the API ships.