prompting guide

how to steer silk muga 1 and silk mulberry 1.5.

test prompts and voice settings in the playground before shipping them into a bot.

silk muga 1#

write your prompt in latin script only — e.g. yeh ek test message hai, not यह एक टेस्ट मैसेज है.

silk muga 1 is a hinglish emotion-TTS model with two control surfaces: a paragraph tone set by the selector or a [tone] marker, and discrete inline events (laugh, chuckle, sigh) you place wherever you want them to sound.

1. paragraph tone#

the voice tone selector sets the default tone for every paragraph. type [ at the start of a paragraph to insert a specific [tone] marker for that paragraph.

tone	when to use	delivery
`[happy]`	light, positive, casual chat	bright, smiling, mid-energy
`[excited]`	high-energy reactions: wins, surprises, hype	loud, fast, pitch-up
`[sad]`	loss, disappointment, grief	slow, breathy, low pitch
`[angry]`	frustration, confrontation, blame	tight, clipped, sharp
`[neutral]`	information delivery: instructions, factual	flat, even, no affect
`[whisper]`	secrets, late-night, intimate	quiet, breathy, no voiced energy

rules

one tone marker per paragraph; a blank line starts a new paragraph.
paragraphs without an explicit marker use the selected default tone; [neutral] is the default when nothing else is selected.
the marker applies to the whole paragraph, even if it is inserted after the first word.

2. inline events#

type these directly in the transcript at the position you want the sound. three are supported:

event	duration	sound
`<laugh>`	0.5–1.5s	loud, voiced laughter (haha, hehe)
`<chuckle>`	0.3–0.7s	soft, amused laugh, almost a breath
`<sigh>`	0.4–0.8s	audible exhale, breathy

rules

lowercase, angle brackets, no spaces inside. <laugh>, never <Laugh> or < laugh >.
space on both sides when between words. never mid-word.
position-sensitive. <laugh> kya baat hai sounds different from kya baat hai <laugh>.
stack at most two. <laugh> <laugh> for a longer/harder laugh. three or more becomes unstable.

3. tone–event compatibility#

events have to match the tone. laughter belongs to high-energy positive states; sighs belong to low-energy reflective ones. mix them — a laugh in a sad line, a sigh in an excited shout — and the model fights itself, because the training set has almost no examples of those combinations.

tone	`<laugh>`	`<chuckle>`	`<sigh>`
`[happy]`	✓✓	✓✓	✗
`[excited]`	✓✓	✓	✗
`[sad]`	✗	✗	✓✓
`[angry]`	✗	~	✓
`[neutral]`	✓	✗	✓
`[whisper]`	✗	✓	✓

✓✓ best · ✓ ok · ~ rare · ✗ avoid

4. good vs bad#

[sad] <laugh> sab kuch khatam ho gaya
[neutral] <laugh> aaj ka mausam saaf rahega
[angry] <chuckle> tumne phir galti ki
[happy] <sigh> kya mast din tha aaj
[excited] <sigh> jeet gaye!
[whisper] <laugh> sab so rahe hain

[happy] <laugh> Yaar tumne phir wahi joke maara!
[excited] <laugh> Bhai jeet gaye, vishwas nahi ho raha!
[sad] <sigh> Pata nahi yaar, kuch samajh nahi aata.
[whisper] <sigh> Itna lamba din tha, thak gayi hoon.

5. language register#

training data is hinglish — romanised hindi with english code-mixing. the model speaks that best.

avoid

devanagari. the model saw zero hindi script. मैं ठीक हूँ produces garbage.
other indian languages (tamil, bengali, marathi, bhojpuri).
heavy regional dialects (very bambaiyya, very punjabi).

6. length#

silk muga 1 is built around 2 to 30 second utterances and extrapolates reliably up to around 40s. beyond that, tone drifts, pacing slips, and you start seeing repetitions or cutoffs.

2 to 30s — sweet spot, one to three sentences.
~30 to 40s — still works for longer monologues.
beyond 40s — split across prompts.

7. examples#

tone	transcript
neutral	Aapka order place ho gaya hai. Confirmation SMS aapke registered number par bhej diya gaya hai.
happy	`<chuckle>` Pata hai tumne kya kiya kal? Pure office mein viral ho gaya.
excited	`<laugh>` Bhai sun, abhi abhi pata chala, wo job mil gayi mujhe!
sad	`<sigh>` Yaar, samajh sakti hoon. Itna kuch hua hai, time lagega.
whisper	Phir achanak, kuch khatka hua. Maine darwaza dekha, koi nahi tha.
angry	Tumne phir wahi kiya. Maine kitni baar bola tha aisa mat karo.

temperature 0.7 is the most reliable inference setting for the v3 fine-tune the API ships.

silk mulberry 1.5#

write the voice in natural language. the model picks up on attributes you mention. the lists below are the vocabulary it understands; weave them into a single sentence rather than listing them as fields.

1. inline tags#

drop these tags anywhere in the text to trigger a sound. they render as part of the performance, not as words.

text

<laugh>  <laugh_harder>  <sigh>  <chuckle>  <gasp>  <angry>  <excited>
<whisper>  <cry>  <scream>  <sing>  <snort>  <exhale>  <gulp>  <giggle>
<sarcastic>  <curious>

2. voice attributes#

mention any of these in your description.

age — 20s, 30s, 40s
accent (global) — american, british, middle_eastern, asian_american, indian
accent (indian regional) — hindi, punjabi, bihari, south_indian, bengali, rajasthani, marathi, gujarati, kashmiri, assamese, odia, telugu, kannada, malayali, haryanvi, chhattisgarhi
pitch — low, normal, high
timbre (realistic) — deep, warm, gravelly, smooth, raspy, nasally, throaty, harsh, whisper
timbre (creative) — adds robotic, ethereal to the realistic set
pacing — very slow, slow, conversational, brisk, fast, very_fast
emotion — neutral, energetic, excited, sad, sarcastic, dry, crying, angry
intensity — low, med, high
register — formal, neutral, casual

3. speaking role#

pick a role from a domain to anchor the delivery style.

social — youtube_vlogger, social_media_creator, influencer_voice, streamer_companion
podcast — podcast_host, interviewer
commercial — ad_narrator, brand_spokesperson, product_demo_voice, sales_pitch_voice
education — elearning_instructor, kids_story_voice
support — customer_support_agent, virtual_receptionist, healthcare_assistant
entertainment — storyteller, social_media_reaction, meme_voice
corporate — explainer_video_voice, event_host, corporate_training_narrator
viral — short_form_narrator, meme_voice

4. creative-only attributes#

available when you want a non-realistic timbre (e.g. characters, stylized voices).

text

animated_cartoon  ai_machine_voice  alien_scifi  seductively  flirty  anime
cyborg  pirate  dark_villain  demon  gangster  mafia  dramatic_narrator
mythical_godlike_magical  spy  vampire  alpha

5. examples#

description: a warm 30s hindi accent voice, conversational pacing, casual
register, sounds like a podcast host walking you through a story.

transcript: aaj ka episode thoda alag hai. <chuckle> ek minute ke liye seedha
baith jao.

description: a high pitched 20s american voice, excited, very fast pacing, like
a streamer reacting live.

transcript: oh my god did you see that play, that was insane.

description: a deep gravelly low pitched 40s british voice, slow pacing, formal
register, dramatic narrator.

transcript: the door creaked open. nobody was there. and yet, something watched.

keep descriptions short and concrete. one sentence with 3 to 5 attributes beats a paragraph of vague adjectives.