UTAU Search and Rescue: Wiki Wednesday #315

you gotta remember, sometimes people just straight up lie in recipes you find online.

Wiki Wednesday #315 - ZOTTALOID TAMA / 玉

TAMA is a genderless 17 year old who is based on an egg. His creators use he/him pronouns for him. He has a giant fish plushie and a little yellow duck.

Who Is TAMA?

Art from wiki

How long should I train my Diffsinger model?!

Honestly, I have no idea. I've never really seen a firm number on what to aim for. For most people, the answer is genuinely "for as long as you're willing to wait."

I train locally using the Colab via Docker. Because of this, I do not have access to Tensorboard. Because of this, I have no way of seeing how the training is going until I export the entire model. I start training and then wander away to watch 90 Day Fiance. I go for 100,000 steps each for variance and acoustic. If I'm super excited and can't wait to hear the result, I'll stop at 80,000. I have really spotty memory. I think there were cases when 60,000 steps wasn't enough, but that might have just been dataset issues. So I go overboard as one piece of throwaway advice from some random person I can't remember suggested.

But when you're stuck with using Google's GPU for just under two hours a day before being booted off... Yeah, 60,000 steps is a marathon. Because in reality, that's more like 120,000 steps. All you can do for acoustic is to listen to the test segments in Tensorboard and see if they're okay. I have no idea how to know via Tensorboard if your Variance model is doing okay.

So, let's say you can get 20,000 steps a day from the Colab. That's an entire week of training for 60,000 steps for each model! If you can only get 10,000 steps? That's almost two weeks of just showing up, setting it up, and waiting.

How can you get around this?! Well, I do believe that data sets with more audio tend to need less training to sound great. And here's a huge thing: training cannot fix all problems. If you do not have your data set singing a certain vowel at a certain pitch, it will growl at you. However, even if you have all of the correct data and don't train it enough... It will also growl at you. Growling can mean several different things.

Is there a downside to training too much? In a broad sense, training an AI model for too long will lead to it copying the training data too closely. But like, I mean, I don't know at what point that's an issue with Diffsinger or what that would even look like given how Diffsinger works. Theoretically, this is an issue. In practice, I don't even know how you'd test for that. I mean, you could try it, but how would you judge the results against each other?

If I was stuck using Google's GPU, I would strive for 60,000 steps and see if I go totally crazy while waiting. Since I don't have to do that, I aim for 100,000 steps. If you have the fortitude to do that using Google's GPU, your patience has my respect! I did do that out of ignorance my first time. It was not fun!

(Note: Steps trained with a batch size of 3 are not the same as steps trained at a batch size of 9. However, people using Google's GPUs are probably all using the same step size and can compare step sizes with no issue of differences in batch sizes.)

How Is TAMA's bank?

Tama has a one pitch romaji encoded CV bank. In the OTOs, for the majority of entries the preutterance and overlap are both set to zero. His voice is so adorable!

Where Can I download TAMA?

You can find him on his official site. He is cool!

UTAU Search and Rescue

Wednesday, March 19, 2025

Wiki Wednesday #315 - ZOTTALOID TAMA / 玉

Wiki Wednesday #315 - ZOTTALOID TAMA / 玉

Who Is TAMA?

How Is TAMA's bank?

Where Can I download TAMA?

No comments:

Post a Comment