prompting guide
how to steer silk muga 1 and silk mulberry 1.5.
test prompts and voice settings in the playground before shipping them into a bot.
silk muga 1#
write your prompt in latin script only — e.g. yeh ek test message hai, not यह एक टेस्ट मैसेज है.
silk muga 1 is a hinglish emotion-TTS model with two control surfaces: a paragraph tone set by the selector or a [tone] marker, and discrete inline events (laugh, chuckle, sigh) you place wherever you want them to sound.
1. paragraph tone#
the voice tone selector sets the default tone for every paragraph. type [ at the start of a paragraph to insert a specific [tone] marker for that paragraph.
| tone | when to use | delivery |
|---|---|---|
[happy] | light, positive, casual chat | bright, smiling, mid-energy |
[excited] | high-energy reactions: wins, surprises, hype | loud, fast, pitch-up |
[sad] | loss, disappointment, grief | slow, breathy, low pitch |
[angry] | frustration, confrontation, blame | tight, clipped, sharp |
[neutral] | information delivery: instructions, factual | flat, even, no affect |
[whisper] | secrets, late-night, intimate | quiet, breathy, no voiced energy |
rules
- one tone marker per paragraph; a blank line starts a new paragraph.
- paragraphs without an explicit marker use the selected default tone;
[neutral]is the default when nothing else is selected. - the marker applies to the whole paragraph, even if it is inserted after the first word.
2. inline events#
type these directly in the transcript at the position you want the sound. three are supported:
| event | duration | sound |
|---|---|---|
<laugh> | 0.5–1.5s | loud, voiced laughter (haha, hehe) |
<chuckle> | 0.3–0.7s | soft, amused laugh, almost a breath |
<sigh> | 0.4–0.8s | audible exhale, breathy |
rules
- lowercase, angle brackets, no spaces inside.
<laugh>, never<Laugh>or< laugh >. - space on both sides when between words. never mid-word.
- position-sensitive.
<laugh> kya baat haisounds different fromkya baat hai <laugh>. - stack at most two.
<laugh> <laugh>for a longer/harder laugh. three or more becomes unstable.
3. tone–event compatibility#
events have to match the tone. laughter belongs to high-energy positive states; sighs belong to low-energy reflective ones. mix them — a laugh in a sad line, a sigh in an excited shout — and the model fights itself, because the training set has almost no examples of those combinations.
| tone | <laugh> | <chuckle> | <sigh> |
|---|---|---|---|
[happy] | ✓✓ | ✓✓ | ✗ |
[excited] | ✓✓ | ✓ | ✗ |
[sad] | ✗ | ✗ | ✓✓ |
[angry] | ✗ | ~ | ✓ |
[neutral] | ✓ | ✗ | ✓ |
[whisper] | ✗ | ✓ | ✓ |
✓✓ best · ✓ ok · ~ rare · ✗ avoid
4. good vs bad#
[sad] <laugh> sab kuch khatam ho gaya
[neutral] <laugh> aaj ka mausam saaf rahega
[angry] <chuckle> tumne phir galti ki
[happy] <sigh> kya mast din tha aaj
[excited] <sigh> jeet gaye!
[whisper] <laugh> sab so rahe hain[happy] <laugh> Yaar tumne phir wahi joke maara!
[excited] <laugh> Bhai jeet gaye, vishwas nahi ho raha!
[sad] <sigh> Pata nahi yaar, kuch samajh nahi aata.
[whisper] <sigh> Itna lamba din tha, thak gayi hoon.5. language register#
training data is hinglish — romanised hindi with english code-mixing. the model speaks that best.
avoid
- devanagari. the model saw zero hindi script.
मैं ठीक हूँproduces garbage. - other indian languages (tamil, bengali, marathi, bhojpuri).
- heavy regional dialects (very bambaiyya, very punjabi).
6. length#
silk muga 1 is built around 2 to 30 second utterances and extrapolates reliably up to around 40s. beyond that, tone drifts, pacing slips, and you start seeing repetitions or cutoffs.
- 2 to 30s — sweet spot, one to three sentences.
- ~30 to 40s — still works for longer monologues.
- beyond 40s — split across prompts.
7. examples#
| tone | transcript |
|---|---|
| neutral | Aapka order place ho gaya hai. Confirmation SMS aapke registered number par bhej diya gaya hai. |
| happy | <chuckle> Pata hai tumne kya kiya kal? Pure office mein viral ho gaya. |
| excited | <laugh> Bhai sun, abhi abhi pata chala, wo job mil gayi mujhe! |
| sad | <sigh> Yaar, samajh sakti hoon. Itna kuch hua hai, time lagega. |
| whisper | Phir achanak, kuch khatka hua. Maine darwaza dekha, koi nahi tha. |
| angry | Tumne phir wahi kiya. Maine kitni baar bola tha aisa mat karo. |
temperature 0.7 is the most reliable inference setting for the v3 fine-tune the API ships.
silk mulberry 1.5#
write the voice in natural language. the model picks up on attributes you mention. the lists below are the vocabulary it understands; weave them into a single sentence rather than listing them as fields.
1. inline tags#
drop these tags anywhere in the text to trigger a sound. they render as part of the performance, not as words.
<laugh> <laugh_harder> <sigh> <chuckle> <gasp> <angry> <excited>
<whisper> <cry> <scream> <sing> <snort> <exhale> <gulp> <giggle>
<sarcastic> <curious>2. voice attributes#
mention any of these in your description.
- age —
20s,30s,40s - accent (global) —
american,british,middle_eastern,asian_american,indian - accent (indian regional) —
hindi,punjabi,bihari,south_indian,bengali,rajasthani,marathi,gujarati,kashmiri,assamese,odia,telugu,kannada,malayali,haryanvi,chhattisgarhi - pitch —
low,normal,high - timbre (realistic) —
deep,warm,gravelly,smooth,raspy,nasally,throaty,harsh,whisper - timbre (creative) — adds
robotic,etherealto the realistic set - pacing —
very slow,slow,conversational,brisk,fast,very_fast - emotion —
neutral,energetic,excited,sad,sarcastic,dry,crying,angry - intensity —
low,med,high - register —
formal,neutral,casual
3. speaking role#
pick a role from a domain to anchor the delivery style.
- social —
youtube_vlogger,social_media_creator,influencer_voice,streamer_companion - podcast —
podcast_host,interviewer - commercial —
ad_narrator,brand_spokesperson,product_demo_voice,sales_pitch_voice - education —
elearning_instructor,kids_story_voice - support —
customer_support_agent,virtual_receptionist,healthcare_assistant - entertainment —
storyteller,social_media_reaction,meme_voice - corporate —
explainer_video_voice,event_host,corporate_training_narrator - viral —
short_form_narrator,meme_voice
4. creative-only attributes#
available when you want a non-realistic timbre (e.g. characters, stylized voices).
animated_cartoon ai_machine_voice alien_scifi seductively flirty anime
cyborg pirate dark_villain demon gangster mafia dramatic_narrator
mythical_godlike_magical spy vampire alpha5. examples#
description: a warm 30s hindi accent voice, conversational pacing, casual
register, sounds like a podcast host walking you through a story.
transcript: aaj ka episode thoda alag hai. <chuckle> ek minute ke liye seedha
baith jao.description: a high pitched 20s american voice, excited, very fast pacing, like
a streamer reacting live.
transcript: oh my god did you see that play, that was insane.description: a deep gravelly low pitched 40s british voice, slow pacing, formal
register, dramatic narrator.
transcript: the door creaked open. nobody was there. and yet, something watched.keep descriptions short and concrete. one sentence with 3 to 5 attributes beats a paragraph of vague adjectives.