The model is sensitive to the wider situation surrounding each utterance - it assesses whether something makes sense by how it ties to preceding and succeeding text. This zoomed-out perspective allows it to intonate longer fragments properly by overlaying a particular train of thought stretching multiple sentences with a unifying emotional pattern.
There are a couple of tips for producing emotions:
- Context is key for generating specific emotions. Thus, if one inputs laughing/funny text they might get a happy output. Similarly with anger, sadness, and other emotions, setting the context is key.
- Punctuation and voice settings play the leading role in how the output is delivered.
- Add emphasis by putting the relevant words/phrases in quotation marks.
- For speech generated using a cloned voice, the speaking style contained in the samples you upload for cloning is replicated in the output. So if the speech in the uploaded sample is monotone, the model will struggle to produce expressive output.
These are the best tips for producing emotions but do not guarantee the result. We will be introducing features that will allow for the control of emotions within the text.