The Dubbing Work Area: Fine-tuning AI Voice Tones and Emotions

Learn how to direct AI voice performance in Stra.ai. A practical guide to the voice directing field, Convert Tone, and how to write prompts that get natural, expressive dubbing results using Gemini TTS.
Mar 30, 2026
The Dubbing Work Area: Fine-tuning AI Voice Tones and Emotions

Getting a technically correct dub is straightforward. Getting a dub that sounds natural, expressive, and matched to the emotion of the original performance takes a bit more direction. This guide covers everything in the dubbing work area that affects how the AI voice actually sounds, including the voice directing field, the Convert Tone button, and how to write prompts that get results.


The three columns

Every dialogue segment in the work area has three columns.

The left column is the source, the original language transcription.

The middle column is the translation, the text the AI will read aloud.

The right column is the voice directing field. This is where you tell the AI how to perform the line, not what to say but how to say it.

The voice directing field is powered by Gemini TTS, which means it responds to natural language instructions the way a voice director would brief a human actor.


How to write voice directing prompts

The voice directing field accepts plain English. You do not need special syntax. Write it the way you would tell an actor what you want.

A useful starting template is:

"Speak in a [emotion or attitude] way."

Then add specifics on top:

"Speak in an excited, slightly breathless way. Fast paced, like breaking news."

"Speak in a calm and authoritative way. Slow and deliberate, like a documentary narrator."

"Speak in a mocking, condescending way. Bored but sharp."

"Speak drowsily, as if half asleep. Trailing off at the end of sentences."

"Speak nonchalantly. Completely unbothered."

A few things worth knowing about how Gemini TTS interprets prompts:

Use "shout" for loud, forceful delivery rather than "scream." The model responds more reliably to "shout" as a direction.

Pacing responds well to explicit instructions. "Speak fast" and "speak slowly" both work as written. You can also describe the context, "speak as if running out of time" or "speak as if explaining to a child."

Emotion stacks. You can combine multiple qualities in one prompt and the model will try to balance them. "Warm but professional" or "excited but controlled" both work.

The more specific and coherent the direction, the better the result. Vague prompts like "sound natural" give the model less to work with than "speak conversationally, like catching up with a friend."


Generating and regenerating audio

Once you have your translation and voice direction set for a segment, click Generate dub in the speaker panel below the work area. The AI generates the audio for that segment.

If you are not happy with the result, adjust the voice directing field and generate again. Each generation produces a fresh take. You can also use Generate selected in the top right corner of the work area to generate multiple segments at once after setting their directions.

If the tone is right but the voice sounds slightly unstable or inconsistent between segments, try generating again without changing the direction. Small variations between generations are normal and a fresh attempt often resolves them.


The Convert Tone button

The Convert Tone button sits in the top right corner of the work area next to Generate selected. It applies a speech style conversion to the translation text itself, changing the register and tone of the written words before audio generation.

Clicking Convert Tone opens a panel with style options. By default the styles shown are in Korean, since Stra.ai is built for Korean language workflows. The default presets cover formal honorifics, narration style, and conversational interview style.

You can add your own custom style by clicking Manage Styles at the bottom of the panel. Give it a name and write a conversion instruction in plain language. For example:

"Change everything to formal USTED address" for Spanish content targeting formal audiences.

"Convert to polite formal register" for Japanese or Korean content.

"Use informal tu address throughout" for Spanish content targeting younger audiences.

This is especially useful for languages with multiple levels of formality or honorific systems, where a direct translation may come out in the wrong register for the target audience even if the words are technically correct.

ElevenLabs vs Gemini TTS

The voice model you selected when creating the project determines which engine generates the audio.

If you chose ElevenLabs, the voice directing field still appears but the performance is shaped more by the voice clone than by the direction text. ElevenLabs excels at maintaining the character and identity of the original speaker's voice.

If you chose Gemini TTS, the voice directing field has full effect. Gemini TTS is built to respond to natural language performance direction and gives you precise control over tone, pacing, emotion, and delivery style. If directorial control over the performance matters for your project, Gemini TTS is the right choice.


What to do next

Your translations are written, your voice directions are set, and your audio has been generated. The next step is exporting the finished project.

Continue here: High-Fidelity Export: Downloading Dubbed MP4s and Clean Audio Tracks

Share article

STRA AI