progress!
Wiki Wednesday #308 - VoiceFace's Nemo / ニモ
Nemo is genderfluid and has no age. No one knows if they are even human.
Who is Nemo?
Art from official site |
Is Diffsinger a singing synth? The answer may surprise you!
Obviously, you load Diffsinger into OpenUtau and give it a UST. It sings at you. If that's not a singing synth, then nothing is!
But Diffsinger doesn't produce audio. It works the same as the image generator Stable Diffusion. It takes your prompt and starts out with a bunch of noise. It dreams what you're asking for and then refines that for as many steps as you give it. Then, that generated image is given to the vocoder which turns that image into sound! That is why people fine tune their own vocoders so that their model is even more like them! (Note: this made little sense to me until I learned that you can make it to that specific models always use their own specific, packaged with them vocoders.)
UTAU and old school VOCALOID worked by taking snippets of singing and then chaining them together. This led to results that I thought were totally human for at least a decade! Except for English. I thought English was special in being nearly impossible to make work in vocal synths, but that was just because I wasn't that great at Japanese. I didn't realize how different VOCALOID Japanese and human Japanese was until I was in the middle of labeling a Japanese data set!
However, the most popular way to create Diffsinger models ignores all of that. Instead of focusing on a model that excels at one target language, people will record separate languages and mix them together. While the voice will be more realistic than UTAU or vocaloid, it won't have the same "punch you in the face" realism as a single language bank.
If you are like me and only want monolingual banks (with a few little fun touches for the Japanese in the dsdict), then Diffsinger feels categorically different than any vocal synths before Synth-V. But, if you want to mix languages and have more fun, then it does feel much more like another vocal synth.
You personally don't need to record yourself singing in other languages - you can use public data sets to give your model almost any language!
I'm a weirdo and I know it. What I personally feel shouldn't deter you from giving your model all the languages that you want! I think I'm the only person I've seen who prefers just one language per model.
But, this may partially stem from what I'll talk about next week.
As a note, I'm still fine tuning what makes a realistic English Diffsinger. I stay far away from the fandom, so I only have my own notes to go off of in this regard. I've had really good results with twenty minutes of data using a specially curated list sang in a slightly boring tone. I'm trying to record at least forty-five minutes in a punk tone for my next test! Thank goodness for the Harvard Sentences... I'd be lost without them!
How Are Nemo's banks?
Nemo only has one bank, but I'll treat it as two because I just gotta give in and try C+V English at some point... Despite just like, it's not my jam.
So, I don't know if Nemo has otos that are considered correct for this method, but that is not the done thing!
So Disclaimer. I use OpenUtau for Diffsinger and that's it. My poor little tablet can't handle OpenUtau beyond basic troubleshooting. But, I don't need a phonemizer because I do CVVC voice copy. Converting to C+V is just a matter of renaming a few things. The issue is the OTOs. Given my knowledge of how things work, I retooled the OTO to do the done thing. Nemo has an adorable voice that lends itself well to this kind of artificial sound.
The rest of the bank is a four pitch CV+VV. The voice is said to be inspired by Eric Cartman. I didn't hear it in the English, but I can hear it here. Nemo has a nice and unique voice.
Where Can I download Nemo?
You can find Nemo on their official site. They are a cool UTAU!
No comments:
Post a Comment