UTAU Search and Rescue: On Labeling Palatalized Consonants for Japanese

When it comes to vocal synths, we should strive to remember to just have fun! This article will go in to detail about handling of syllables like "hya" and "kya" in Japanese. If you are happy with how your UTAU or Diffsinger model sings in Japanese, then please ignore this article.

Introduction

Okay, so from a perspective of having the best product, it is important that I explain a quirk of Diffsinger that existed at the time of writing this.

I trained two Lewy banks using the same Japanese Lewy data and the same English Nicky data. The difference was the labels. The first only changed phoneme names when there was an obvious conflict. I knew it would break things if both there were Japanese Rs and English Rs going by the same name!

The second was multidict. There were no shared phonemes aside from a little mistake I made, SP, and AP. (Maybe there were a few more, this was a long time ago.)

With the first model, Lewy struggled to pronounce things naturally. This didn't make much sense to me at first.

But once I used the multidict model, I understood the issue.

It's tempting to think that the pronunciation of each phoneme is a mish mash of every instance in the whole model, but it seems to be siloed so that each voice pronounces each phoneme based on their own data... Given the data exists.

Lewy's voice provider never recorded English so there's was obviously never an instance of Lewy saying "t" as "ch". Therefore she couldn't naturally say "tree" in the original model.

However, in the multidict model, Lewy did not have "en/t". Because of that, she took "en/t" entirely from Nicky's data. She could say all forms of "en/t" naturally. She could say tree as easily as she said tea.

If you are training as multispeaker with a data set recorded by a Japanese native speaker, I feel like it is safe to tell a good portion of users to, in a sense, label your data set correctly.

On labeling overseas data sets

What does that mean?

If you record "hya" as "h" + "ya" then label it as "h" then "y" then "a". "hy" is not "h + y".

This is what "hy" sounds like:

source: Wikipedia

If you don't want to record it correctly, then that's fine! If you want your model to have an accent, that's fine! You can actually split the difference here. You can label it correctly by using "h" + "y" and write a dsdict that does not use phonemes you did not include. So you can easily make it so that when you type hya in a note in OpenUtau, your model says "h y a". This is a lovely compromise because that means that if you were to train alongside a native Japanese dataset in the future, you would inherit the correctly palatalized consonants.

This is where the confusion about if you label ki as [ky i] or [k i] comes from. If you think that kya is k + y + a, then yes, ki is not k y i. However, looking at the hy example shows that hi is not [h i]... It is the completely distinct [hy i], just like in hya. This pattern holds out through basically all of Japanese. If you followed the wisdom of mi = [m i] to the full extent, you would be obligated to label chi as [t i] and shi as [s i]. This is because ch is roughly the palatalization of t. Sh is roughly the palatalization of s.

If you pronounce ki as [k i], then yes. Label what is there and not what technically should be there.

Now that that's over, let's look at some palatalized consonants.

Demonstrations of phonemes

Demonstrations taken from LICCA's Japanese data set. Because it can be a bit hard to find things like Mya or pyu, most palatalized consonants will be followed by "i". This may make it seem like they are never followed by a tiny "y" sound. Sometimes there will be a movement of the tongue that you can visibly see. It's best to ignore it and treat it as part of the vowel.

All images come from Praat.

ch, sh, j

It feels unnecessary to demonstrate the differences between these. t is hard to mistake for ch and s is hard to mistake for sh. The reason "ti" and "si" don't have a dedicated Hiragana symbol is because naturally all consonants proceeding an "i" will be palatalized.

Now, why is the palatalized form of "z" not the rather rare sound in "Asia"?

Generally, and I was shocked when I realized this, Japanese people aren't saying "z". They're saying "dz". Sometimes they can say "z" as an American would and sometimes they make the sound in "Asia" when they say "ja". The Japanese vocaloid phoneme set accounted for this so that you could pick and chose if it was dz or z. I believe the default was "z".

I can't remember Japanese UTAU banks ever using "dz" exclusively for "z". However, labeling singing recorded by native Japanese speakers made me realize that the natural state of "zu" is actually "dzu".

k

Licca has a bit sharper "k" sounds than others, but the contrast between k and ky is very apparent.

Here is an example of her k sounds:

Here is an example of her ky sounds:

As you can see, not only is it louder, it is also longer. You can see a difference between k and ky in the spectrogram.

That darker band is not present in k even when k is longer or louder because of something like emphasis.

g

The difference between k and g is voicing. The same pattern holds up. The spectrogram is darker at the top in gy and darker in the bottom for g.

s

While I did state that it was silly to go into s vs sh, it is interesting to see the difference in the spectrogram.

n

Let's note something important here. While sometimes to untrained ears Nya can sound like "n + ya", the tongue is visibly moving into place before the ny.

Sometimes the tongue is moving back slowly enough that it may sound a bit like a "y" follows the consonant, but the opposite can also happen with liquids.

h

Not only is the spectrogram lighter for hy, you can also see the movement of the tongue into the correct position.

b and p

"py" is extremely rare. The easiest way to find a song with lyrics containing "py" is to give up and just sing a line from an UTAU VCV recording list.

Let's take a moment to explain why that "y" is so big with byo. To say a palatalized consonant, your tongue needs to be at the top of your mouth. There isn't a noticeable "y" after "ky" because you use your tongue to say "k" in the first place. But with b, you say it entirely with your lips. So your tongue has more noticeable movement which you can hear and see.