When it comes to vocal synths, we should strive to remember to just have fun! This article will go in to detail about handling of syllables like "hya" and "kya" in Japanese. If you are happy with how your UTAU or Diffsinger model sings in Japanese, then please ignore this article.
Introduction
Okay, so from a perspective of having the best product, it is important that I explain a quirk of Diffsinger that existed at the time of writing this.
I trained two Lewy banks using the same Japanese Lewy data and the same English Nicky data. The difference was the labels. The first only changed phoneme names when there was an obvious conflict. I knew it would break things if both there were Japanese Rs and English Rs going by the same name!
The second was multidict. There were no shared phonemes aside from a little mistake I made, SP, and AP. (Maybe there were a few more, this was a long time ago.)
With the first model, Lewy struggled to pronounce things naturally. This didn't make much sense to me at first.
But once I used the multidict model, I understood the issue.
It's tempting to think that the pronunciation of each phoneme is a mish mash of every instance in the whole model, but it seems to be siloed so that each voice pronounces each phoneme based on their own data... Given the data exists.
Lewy's voice provider never recorded English so there's was obviously never an instance of Lewy saying "t" as "ch". Therefore she couldn't naturally say "tree" in the original model.
However, in the multidict model, Lewy did not have "en/t". Because of that, she took "en/t" entirely from Nicky's data. She could say all forms of "en/t" naturally. She could say tree as easily as she said tea.
If you are training as multispeaker with a data set recorded by a Japanese native speaker, I feel like it is safe to tell a good portion of users to, in a sense, label your data set correctly.
On labeling overseas data sets
What does that mean?
If you record "hya" as "h" + "ya" then label it as "h" then "y" then "a". "hy" is not "h + y".
This is what "hy" sounds like:
source: Wikipedia
If you don't want to record it correctly, then that's fine! If you want your model to have an accent, that's fine! You can actually split the difference here. You can label it correctly by using "h" + "y" and write a dsdict that does not use phonemes you did not include. So you can easily make it so that when you type hya in a note in OpenUtau, your model says "h y a". This is a lovely compromise because that means that if you were to train alongside a native Japanese dataset in the future, you would inherit the correctly palatalized consonants.
This is where the confusion about if you label ki as [ky i] or [k i] comes from. If you think that kya is k + y + a, then yes, ki is not k y i. However, looking at the hy example shows that hi is not [h i]... It is the completely distinct [hy i], just like in hya. This pattern holds out through basically all of Japanese. If you followed the wisdom of mi = [m i] to the full extent, you would be obligated to label chi as [t i] and shi as [s i]. This is because ch is roughly the palatalization of t. Sh is roughly the palatalization of s.
If you pronounce ki as [k i], then yes. Label what is there and not what technically should be there.
Now that that's over, let's look at some palatalized consonants.
Demonstrations of phonemes
ch, sh, j
It feels unnecessary to demonstrate the differences between these. t is hard to mistake for ch and s is hard to mistake for sh. The reason "ti" and "si" don't have a dedicated Hiragana symbol is because naturally all consonants proceeding an "i" will be palatalized.
Now, why is the palatalized form of "z" not the rather rare sound in "Asia"?
Generally, and I was shocked when I realized this, Japanese people aren't saying "z". They're saying "dz". Sometimes they can say "z" as an American would and sometimes they make the sound in "Asia" when they say "ja". The Japanese vocaloid phoneme set accounted for this so that you could pick and chose if it was dz or z. I believe the default was "z".
I can't remember Japanese UTAU banks ever using "dz" exclusively for "z". However, labeling singing recorded by native Japanese speakers made me realize that the natural state of "zu" is actually "dzu".
No comments:
Post a Comment