UTAU Search and Rescue: Using English C+V As a Voice Copier

I never thought I'd be C+V anything!

Using English C+V As a Voice Copier

Something I wish I had learned a long, long time ago was to just let people have fun. How people use UTAU doesn't affect me. If the done thing is suddenly to put the green line after the red line, all that means is that I'll figure out it's a thing everyone does for some reason and just silently fix it for my renders.

As intended, English C+V banks are unusable for this blog. I can't run OpenUtau nearly as well as OG UTAU. I'm deeply uncomfortable with the results of non-AI phonemizers. The only intended way to use these banks is with a specific OpenUtau phonemizer.

But these banks are getting popular. I need to know how to use them!

Aliases

Let's start off by asking... Are you going to use this bank more than once? For me, I'll use a bank once and then it stops existing for me. In that case, completely changing the alias system would be silly.

But if I were to know I was going to regularly use a bank, then it would make absolute sense to make it work with my textgrids.

I have been labeling a lot of English using Arpabet. By editing the aliases in the following ways, I will break the bank for anyone who wants to use it in OpenUtau, but it will mean I'll be turning my labels into USTs with only one little thing that I should edit in the textgrids that isn't fully necessary.

The original C+V bank I used had "- C" and "C -" aliases with no "C" aliases. The bank my friend gave me for this article includes "C". We can just disregard all of the "- C" in this case! (note: "- C" would be useful for the beginning of phases, but it isn't really necessary to preserve that as it would require editing the UST to include.)

To remind everyone, do not proceed if you use this with a phonemizer in OpenUtau. While this bank would work with USTs made for it in OpenUtau, it will not work with USTs that is a phomemizer. This is only for people who make voice copy USTs.

The first thing I realized using C+V is that every consonant should just use "- C" with one big exception: plosives. Things like "d" and "k" must use "C -".

As a note, diphthongs are annoying. There is kind of no way to naturally make those happen without VC outside of whatever the phonemizer must be doing. You can simply just replace "ay" with "ae" and "y" instead of worrying about those. Or, you could just alias it so that you can go "ay -" "-ay" like I did with my old English lists.

Alright! So let's delete all the entries we don't need. In the case of my friends bank, that's both "- C" and most "C -". If the bank you are using lacks "C" aliases, you can just find and replace "= -" with "=" in a text editor. This will affect "- V" also, but that's no big deal.

All those edits got the OTO down to a slim 40 lines!

Edits to Textgrids

There are two places in which you will need to edit Diffsinger label Textgrids to work for this. The first is plosives. Using the method I use, the preutterance falls right before the plosive. In CVVC, a VC for plosives means silence. You should replace likely the majority of the labels for plosives with rests. Unreleased plosives are replaced completely by rests.

As said before, it is up to you how you handle diphthongs, but they will need to be handled.

A big note is that you need to make sure the script you use to export your textgrids to USTs has a BPM of like 360. Single phonemes are tiny!

OTOs

Something about C+V is that the timing, by necessity, will have to be slightly off.

Starting with vowels, ideally, you would want the vowel to start right with a preutterance and overlap of zero. This would be really choppy and not sound very good! So we need to simulate my OTO style of "red line on the FULL vowel". I haven't tested much, but it looks like an overlap of 30 and a preutterance of 40 is good.

Here Is a question. Should overlap be set as equal to preutterance? I mean, yeah. But it's not aesthetic, so I'd prefer not. You want as little of the vowel before the preutterance as possible while still crossfading so that nothing is abrupt and wonky.

You may run into a problem that can't be fixed with OTOs with plosives. It's actually pretty unnatural to say things like "bat" or "sag" with the consonants being the way they need to be to work correctly. The only option to create something that would work if the voice provider recorded it naturally (unreleased) would be to edit samples to add silence to the original "- C".

The way I oto the plosives is to set the preutterance to touch the top of the consonant. It's important to use the original "C -" so that the consonant is followed by silence.

Consonants other than plosives are just treated like vowels.

Okay, so how does it sound?

I never used it before, but you definitely want to play with the P flag. P0 just doesn't make samples all the same volume. I'll include the same render, one with P0 and one with P100.

So! I made Kao Nashi English C+V sing "if you're happy and you know it (clap your hands)" using labels from one of my Diffsinger data sets. I feel like I could pick at it and pick at it more, but I think it's good for someone who had their first time using C+V English earlier today! (it was technically yesterday, but I haven't slept yet!)

The first render is at P0 and the second is at P100. My ust export script includes the intensities of notes. Anything being off pitch has nothing to do with the bank - that's just my singing.

In conclusion...

I can use English C+V without a phonemizer! That's pretty cool.

UTAU Search and Rescue

Wednesday, December 11, 2024

Using English C+V As a Voice Copier

Using English C+V As a Voice Copier

Aliases

Edits to Textgrids

OTOs

Okay, so how does it sound?

In conclusion...

No comments:

Post a Comment