Hey there!
I love Diffsinger. I love labeling. I love testing hypotheses.
For a lot of vocal synth users, there's an easy solution! Just record your own voice!
Someone who was probably trying to troll commented this on one of my videos:
Of course I was happy to be told I sound like a woman! I had it made up in my mind that my high range sounded like SpongeBob SquarePants. I don't like Lois's voice, but, you know, girl. (Note: I am a cis woman. When I was younger I basically spoke just like Chris-Chan. I overcorrected really, really hard.)
I'm working on two data sets using my voice because I have the recordings and why not?
But what I would really like is to find someone to record for me!
So, I'll start by saying that you don't need to be a good singer. Diffsinger is really, really kind. If you never use the model's pitch model, the Diffsinger of a "bad" singer who lets themselves pitch shift their samples for range will sound more or less as competent as that of a "good" singer.
If you don't think you're a great singer and you're embarrassed for people to hear your singing, I have great news! You cannot (to my knowledge) reverse engineer a Diffsinger bank to hear the original audio. Only one person would hear your singing... And that's me! I'm not critical of singing these days. I'll like basically anything I'm given.
So... I'm very manic and I very much have one specific thing I want to do. I want to find a Chinese speaker to record a Chinese data set for me. I'm holding myself back from going and harassing random people because that's pretty weird behavior.
But outside of wanting to do a Chinese project, Diffsinger models are like Pokemon to me. I want to catch them all! And by catch I mean create. I don't touch things I didn't make because I didn't "catch" that Pokemon. It would be traded!
So let's talk about recording for a moment.
Microphone quality is just, you know, one of the easiest things to get bullied for in the vocal synth fandom. I had to deal with it a lot. My computers were so busted, it didn't matter how good my AT2020 was. I didn't even know that was possible until I attached the microphone to my phone.
The thing about Diffsinger is that it's going to add engine noise. In my experiments, my most "high quality" sounding clear data set came from my phone. The one on a fancy microphone sounded "better", but I think a large part of the fandom would assume that the one from my phone was the "high quality" data set. You have to straight up put effort into finding a microphone that sounds actually genuinely low quality. But I still love the softness and sweetness of the data set I recorded using the worst microphone I could find. It's low quality and a bit hard to understand, but it's just so soft and nice to me!
So, the next part is what to record.
I've made four different Japanese models for three different people. One of them needed no notes. Everything was there. It wasn't even recorded for Diffsinger. She just gave me the raw stems for covers she did.
The other two just needed to have one thing added and that was an instance of the phoneme "py". All of the data was either nursery rhymes or vocaloid songs.
English is not so polite at all. It's a very annoying language because there are so many vowels and you need each vowel to exist at each pitch range within the bank so that there's no growling. There's also so... So much CC that can get wonky! That's why I made a special Diffsinger list that covers everything.
I'm sure that every language is on a spectrum like that.
As a note, I would definitely be willing to use my own data sets to help train English models. That way, you could have the luxury of just singing like five songs and get a result that doesn't growl at you. Or in the case of non-English speakers, a model that sings in English.
As far as labeling goes, if I have phonetic transcriptions, I can do literally anything. I have labeled a German dataset already! But the only languages I wouldn't need help with are English as long as I have the lyrics, Japanese as long as I can get my hands on the romaji, and Chinese if I have access to the pinyin or jyutping lyrics.
Now, a lot of people do charge for labeling models. For me, just having someone sing for me and being given the chance to make a Diffsinger model feels like a gift. It's another Pokemon to put in my box! So in my eyes, it would be a trade.
But no RVC stuff. It's gotta be coming from a human throat without any AI or UTAU. It's gotta come straight from a human throat. Maybe some compression, equalization, pitch shifting... But no throwing in USTs.
I'm not exactly expecting that this will result in anyone singing for me.
But if you're interested, here is my Twitter and here is my Bluesky.
No comments:
Post a Comment