UTAU Search and Rescue: Wiki Wednesday #310 - VoiceFace's Astrid / アストリッド

I had a rule when I started that I wouldn't just cover all of the utaus in a family in a row because that might get boring. I have fully ignored that rule by now!

Wiki Wednesday #310 - VoiceFace's Astrid / アストリッド

Astrid is a shape-shifting alien who just is enamored by Earth. She generally appears to be around nineteen years old.

Who is Astrid?

Art from site

Let's explain Diffsinger vocal modes!

This information may change. I feel like the original information of "you cannot have overlapping phonemes" (English 'r' vs Japanese 'r'; both will end up wonky if you label them both with 'r') is being replaced with "different folders have their own phonemes in isolation". I've only seen reference to that in the comments of an issues page on GitHub and how it breaks inputting phonemes into notes without a lot of extra typing (needing to input "a/zh" instead of just "a" or something like that). I have no interest in breaking that feature! (note: This has been fixed.)

When using use cross language synthesis, you are technically using vocal modes. This is probably the best way to explain vocal modes!

Let's pretend you're silly and we are in a universe that is more simple. This is an abstraction! You are training your data set with someone else's English data set. You recorded "a" once at C4. This may in reality break things, but this is a simplification.

Diffsinger learns what your voice sounds like at C4. When you play your vocal mode at C4, suddenly YOU are the voice that is singing! The English data set taught Diffsinger what ay and ey sound like. Diffsinger knows what you sound like at C4, and it knows what ay sounds like, so it can make your voice say "ay" at C4 even though you only said "a".

However, Diffsinger will only know what you sound like at C4. If you trained your one little data set alone, it would growl at you and just make weird noises if you tried to make it sing much of anything. However, when you train it with that English data set... It's just that English data set singing when you try to use your vocal mode. Diffsinger can't know what you sound like at anything other than C4, but it does know what the English data set sounds like at C4! So it uses that knowledge and you end up with something that sounds nothing like you!

From a variance perspective, my current understanding is that all vocal modes will more or less use the same variance model. Not gonna lie, I'm actually not fully certain on what the variance model does. I do know what the duration model does, however. As far as I know, the entire model shares one duration model. This model determines how long each phoneme will be within the output singing. One of the big parts of changing your singing style is changing how long your phonemes are. This is a small nuance that gets lost when training a multispeaker model... But, as we saw last week, not big enough of an issue to warrant all of the time and money needed to make separate appends for most people!

So... Vocal modes are actually cool and magical and can make really cool morphs between voices. You can even use a curve to morph the different modes together! Separate appends wouldn't morph together like that.

How Is Astrid's Bank?

Astrid has a three pitch CVVC bank. She has a lovely and sweet voice!

Where can I download Astrid?

You can find her on her official site. She is lovely!

UTAU Search and Rescue

Wednesday, February 12, 2025

Wiki Wednesday #310 - VoiceFace's Astrid / アストリッド

Wiki Wednesday #310 - VoiceFace's Astrid / アストリッド

Who is Astrid?

How Is Astrid's Bank?

Where can I download Astrid?

No comments:

Post a Comment