get an API key

prompting guide

how to steer silk muga 1 and silk mulberry 1.5.

test prompts and voice settings in the playground before shipping them into a bot.

silk muga 1#

write your prompt in latin script only — e.g. yeh ek test message hai, not यह एक टेस्ट मैसेज है.

silk muga 1 is a hinglish emotion-TTS model with two control surfaces: a paragraph tone set by the selector or a [tone] marker, and discrete inline events (laugh, chuckle, sigh) you place wherever you want them to sound.

1. paragraph tone#

the voice tone selector sets the default tone for every paragraph. type [ at the start of a paragraph to insert a specific [tone] marker for that paragraph.

tonewhen to usedelivery
[happy]light, positive, casual chatbright, smiling, mid-energy
[excited]high-energy reactions: wins, surprises, hypeloud, fast, pitch-up
[sad]loss, disappointment, griefslow, breathy, low pitch
[angry]frustration, confrontation, blametight, clipped, sharp
[neutral]information delivery: instructions, factualflat, even, no affect
[whisper]secrets, late-night, intimatequiet, breathy, no voiced energy

rules

  • one tone marker per paragraph; a blank line starts a new paragraph.
  • paragraphs without an explicit marker use the selected default tone; [neutral] is the default when nothing else is selected.
  • the marker applies to the whole paragraph, even if it is inserted after the first word.

2. inline events#

type these directly in the transcript at the position you want the sound. three are supported:

eventdurationsound
<laugh>0.5–1.5sloud, voiced laughter (haha, hehe)
<chuckle>0.3–0.7ssoft, amused laugh, almost a breath
<sigh>0.4–0.8saudible exhale, breathy

rules

  • lowercase, angle brackets, no spaces inside. <laugh>, never <Laugh> or < laugh >.
  • space on both sides when between words. never mid-word.
  • position-sensitive. <laugh> kya baat hai sounds different from kya baat hai <laugh>.
  • stack at most two. <laugh> <laugh> for a longer/harder laugh. three or more becomes unstable.

3. tone–event compatibility#

events have to match the tone. laughter belongs to high-energy positive states; sighs belong to low-energy reflective ones. mix them — a laugh in a sad line, a sigh in an excited shout — and the model fights itself, because the training set has almost no examples of those combinations.

tone<laugh><chuckle><sigh>
[happy]✓✓✓✓
[excited]✓✓
[sad]✓✓
[angry]~
[neutral]
[whisper]

✓✓ best · ✓ ok · ~ rare · ✗ avoid

4. good vs bad#

[sad] <laugh> sab kuch khatam ho gaya
[neutral] <laugh> aaj ka mausam saaf rahega
[angry] <chuckle> tumne phir galti ki
[happy] <sigh> kya mast din tha aaj
[excited] <sigh> jeet gaye!
[whisper] <laugh> sab so rahe hain
[happy] <laugh> Yaar tumne phir wahi joke maara!
[excited] <laugh> Bhai jeet gaye, vishwas nahi ho raha!
[sad] <sigh> Pata nahi yaar, kuch samajh nahi aata.
[whisper] <sigh> Itna lamba din tha, thak gayi hoon.

5. language register#

training data is hinglish — romanised hindi with english code-mixing. the model speaks that best.

avoid

  • devanagari. the model saw zero hindi script. मैं ठीक हूँ produces garbage.
  • other indian languages (tamil, bengali, marathi, bhojpuri).
  • heavy regional dialects (very bambaiyya, very punjabi).

6. length#

silk muga 1 is built around 2 to 30 second utterances and extrapolates reliably up to around 40s. beyond that, tone drifts, pacing slips, and you start seeing repetitions or cutoffs.

  • 2 to 30s — sweet spot, one to three sentences.
  • ~30 to 40s — still works for longer monologues.
  • beyond 40s — split across prompts.

7. examples#

tonetranscript
neutralAapka order place ho gaya hai. Confirmation SMS aapke registered number par bhej diya gaya hai.
happy<chuckle> Pata hai tumne kya kiya kal? Pure office mein viral ho gaya.
excited<laugh> Bhai sun, abhi abhi pata chala, wo job mil gayi mujhe!
sad<sigh> Yaar, samajh sakti hoon. Itna kuch hua hai, time lagega.
whisperPhir achanak, kuch khatka hua. Maine darwaza dekha, koi nahi tha.
angryTumne phir wahi kiya. Maine kitni baar bola tha aisa mat karo.

temperature 0.7 is the most reliable inference setting for the v3 fine-tune the API ships.

silk mulberry 1.5#

write the voice in natural language. the model picks up on attributes you mention. the lists below are the vocabulary it understands; weave them into a single sentence rather than listing them as fields.

1. inline tags#

drop these tags anywhere in the text to trigger a sound. they render as part of the performance, not as words.

text
<laugh>  <laugh_harder>  <sigh>  <chuckle>  <gasp>  <angry>  <excited>
<whisper>  <cry>  <scream>  <sing>  <snort>  <exhale>  <gulp>  <giggle>
<sarcastic>  <curious>

2. voice attributes#

mention any of these in your description.

  • age20s, 30s, 40s
  • accent (global)american, british, middle_eastern, asian_american, indian
  • accent (indian regional)hindi, punjabi, bihari, south_indian, bengali, rajasthani, marathi, gujarati, kashmiri, assamese, odia, telugu, kannada, malayali, haryanvi, chhattisgarhi
  • pitchlow, normal, high
  • timbre (realistic)deep, warm, gravelly, smooth, raspy, nasally, throaty, harsh, whisper
  • timbre (creative) — adds robotic, ethereal to the realistic set
  • pacingvery slow, slow, conversational, brisk, fast, very_fast
  • emotionneutral, energetic, excited, sad, sarcastic, dry, crying, angry
  • intensitylow, med, high
  • registerformal, neutral, casual

3. speaking role#

pick a role from a domain to anchor the delivery style.

  • socialyoutube_vlogger, social_media_creator, influencer_voice, streamer_companion
  • podcastpodcast_host, interviewer
  • commercialad_narrator, brand_spokesperson, product_demo_voice, sales_pitch_voice
  • educationelearning_instructor, kids_story_voice
  • supportcustomer_support_agent, virtual_receptionist, healthcare_assistant
  • entertainmentstoryteller, social_media_reaction, meme_voice
  • corporateexplainer_video_voice, event_host, corporate_training_narrator
  • viralshort_form_narrator, meme_voice

4. creative-only attributes#

available when you want a non-realistic timbre (e.g. characters, stylized voices).

text
animated_cartoon  ai_machine_voice  alien_scifi  seductively  flirty  anime
cyborg  pirate  dark_villain  demon  gangster  mafia  dramatic_narrator
mythical_godlike_magical  spy  vampire  alpha

5. examples#

description: a warm 30s hindi accent voice, conversational pacing, casual
register, sounds like a podcast host walking you through a story.

transcript: aaj ka episode thoda alag hai. <chuckle> ek minute ke liye seedha
baith jao.
description: a high pitched 20s american voice, excited, very fast pacing, like
a streamer reacting live.

transcript: oh my god did you see that play, that was insane.
description: a deep gravelly low pitched 40s british voice, slow pacing, formal
register, dramatic narrator.

transcript: the door creaked open. nobody was there. and yet, something watched.

keep descriptions short and concrete. one sentence with 3 to 5 attributes beats a paragraph of vague adjectives.