UTAU Search and Rescue: Trying To Make a Diffsinger G2P Phonemizer Plugin

So, hey guys! I'll just be real with you... I haven't actually gotten this to work all the way to the end. But it was really painful to try and get this all to work so... I'll share my current knowledge with you.

Diffsinger does not need a traditional phonemizer. The timing is all taken care of by the duration model. Technically with a good enough dsdict, you'll be happy with just using the default DIFFS phonemizer. Heck, you can just type phoneme hints and be happy.

But by creating a phonemizer, you are able to make it so that people can type in any word (or nonsense) and get results... Whereas using a dsdict method means that if a word isn't in the dsdict, it just doesn't know what the heck is going on.

There are three ways to make your phonemizer usable as far as I know. The first is by handing it over to the people in charge of OpenUtau for them to put into the newest release. You really wanna make sure it's good before you do that! The second is by recompiling OpenUtau after creating the phonemizer. This would be great for testing, but very annoying to share. The final method is releasing it as a plugin. I feel a little bad I'm writing this before I actually complete the project, but I just keep running into problems that I'm not equipped to handle given I just woke up.

So, if you're interested in learning how to do this, keep reading!

As a precaution - Anaconda and Visual Studio are pretty hefty. If you have a computer with little hard drive space, you may want to sit this one out because it takes up a lot of space. That's why I've been doing this all on my laptop and not my tablet.

I am a Colab girl. I love Google Colab. However, I was unable to get the Colab that had been provided to work at all... Though shout out to the person behind it because I wouldn't have been able to figure out the correct requirements without it.

Credits:

OpenUtau (and g2p code): Stakira

Google Colab (that I used to figure out the correct requirements): LotteV

Repo to create phonemizer plugin: Tyler (spicytigermeat)

Update 2025/03/05 -

I made it work!!!

Training a G2P

There are many different architectures of G2P models. I thought maybe I could use something like MFA to train, but that seems to not be the case. The code intended to train G2P models for OpenUtau is within the GitHub repository.

You don't need to get GIT for this step. Where the green button that says "< > Code" is, you can click that and select the option to download as ZIP.

Put that wherever you want! Now... I struggled here because I wasn't thinking the best.

You need to download Visual Studio. Community edition is free! You also need the C++ build tools... Or was it make tools? Later, you'll also need .NET development tools. I didn't know I'd need Visual Studio, so I just downloaded the C++ stuff on its own. If you are downloading Visual Studio before starting this, I'm pretty sure you'll have the option to download all of that stuff.

I tried without Anaconda to save space and it didn't work. The issue was that everything was peachy keen until ONNX refused to install at all. I trained a whole model without Anaconda and it just couldn't export.

So... I had to go and get Anaconda.

Setting up your Environment

Alright! So a fun thing about Anaconda is that there's just an Anaconda navigator. You can make a new environment in there and open it in the terminal. This is a small step that's removed from not having the Navigator, but it's nice.

Requirements

According to the Colab, the requirements file had errors at the time of the Colab being released. What I did was one by one install each one individually and it worked!

Here are each of the requirements:

Install antlr4-python3-runtime:

pip install antlr4-python3-runtime==4.9.*

Install hydra-core:
```
pip install hydra-core==1.3.2
```
Install omegaconf:
```
pip install omegaconf==2.3.0
```
Install torch:
```
pip install torch
```
Install torchaudio:
```
pip install torchaudio
```
Install editdistance:
```
pip install editdistance==0.6.2
```
Install tqdm:
```
pip install tqdm==4.65.0
```
Install onnx:
```
pip install onnx
```

There is a chance that I removed the version numbers. You can put these in a requirements file or you could paste them all at once. But I had so many darn errors without Anaconda, I wanted to be safe. Aside from onnx, they all installed pretty quickly so it wasn't too much of a waste of time to sit here and manually add each one.

Training

The instructions for training are on the OpenUtau GitHub Wiki. It's important that you edit the cfg.yaml. You will also need to edit train.py. You need to edit it to look the correct directory (the one your dictionary and cfg.yaml are in).

The dictionary is just a tab spaced txt file with the word and then each phoneme for that word separated by spaces.

So it may look like this:

tab t ae b

take t ey k

I just ended up in the g2p directory so I didn't run

python g2p/train.py

I just ran

python train.py

On Dictionaries...

There are actually a lot of dictionaries for a lot of languages online. I'm pretty sure you could even illegally use some! I mean, you should only use things as their licenses allow. But you don't need to sit down and write out new ones by hand.

Here is an example of a lot of dictionaries. Being frank, OpenUtau shouldn't have that much issue with anything that's the correct kind of unicode. If you magically knew that users would never need phoneme hints, I'm pretty sure you could make a model that only uses IPA for its phonemes. I have no idea how to type IPA symbols, so I would be very lost if a model wanted me to type IPA symbols to use phonetic hints.

It should be easy enough to find and replace IPA symbols using text editors. The only dictionary I found for Norwegian was IPA with no spaces. That one... If I used that one, it would be a mess to try and get right.

Packaging your G2P

To package your trained G2P, you need a dict.txt, a phones.txt, and your g2p.onnx.

The dict.txt is just your dictionary that you trained with.

The phones.txt is all of the phonemes and if they are vowel or -.

This is an example of a phones.txt:

a vowel

b -

B -

ch -

d -

D -

e vowel

You will want to name your package "g2p-(language code).zip".

If you want your phonemizer to be added to the main repo or used in a recompiled version of OpenUtau, here is the information you need.

But if you don't want that...

Creating the Plugin

So, get Visual Studio if you don't already have it. You don't need Git, but you will need to clone a repo onto your own computer. Visual Studio gives you that option!

This is the repo you want to clone. You'll want to open the .csproj file and then do any edits you need to do. This is where I fell apart. I did all of the edits and then it told me to change it to a .NET 6 whatever it is and everything broke!

I know you'll need to add two things. Microsoft.ML.Onnx (or was it Onnxruntime?) using the NuGet package manager. Then you'll want to add one of the OpenUtau dlls as a reference. Those aren't in any repo as far as I know - you get them from one of the OpenUtau releases. You can open the folder in the Windows Explorer and delete the g2p-th.zip and replace it with your own. Just remember to replace the text every time g2p-th comes up!

Ideally, and I know I'm missing steps because the dictionary I used was basically useless due to mistakes on my end so I gave up, you'll be able to build the plugin and put the dll in the OpenUtau plugin folder... And it should work! If it doesn't, I'm stumped because I haven't even gotten that far.

Conclusion

I don't actually think I'll have much need for this in the end. But... It was eating me up not knowing how to do this! Now that I know how to do everything (except for possibly the very last steps) I feel peace. I don't need to actually do it - I just needed to know how to do it to stop feeling absolutely crazy over not knowing how things worked.

Please let me know if this was any help! If there's missing information, please let me know. My memory is pretty bad so I may have forgotten whole chunks of the process. I would love it if this helped anyone make their own phonemizers!

Update!!

Okay so... I struggled here for far too long but I just was throwing errors at ChatGPT and being confused. It's really simple.

You can't drag and drop onto the Resource.resx. (heh, I had seen myself writing that exactly in a dream.) You need to press the + button with the Resource.resx open in Visual Studio. It's a whole hooplah that I can't even remember about how it couldn't find the onnx runtime so I needed to get it from OpenUtau itself.

No idea how to get it to work with the phonetic hint thing as a plugin, but close enough!! Yeah!!

so yeah. Edit the .cs files from Tyler's github repo after cloing it in Visual Studio. Grab the dlls from OpenUtau and set them as references. Set the zip to an embedded resource. Or maybe you don't need to if you directly add the resource, and I must stress this because I would have binge watched like four hours of 90 Day Fiance by now if I had known this, by clicking the green + sign instead of literally any other way.

I don't have a Czech model to test this with yet, but testing it using phonemes a Japanese model would know... It works!!

UTAU Search and Rescue

Thursday, February 27, 2025

Trying To Make a Diffsinger G2P Phonemizer Plugin