Text-to-speech Models Suck

Text-to-speech Models Suck

Prompt: stand-up comedy, live comedy, monologue

Lyrics

[Audience cheers]
Okay, okay...
Okay, so you know how there's all this buzz around A.I. voice generation?
Like, text-to-speech, voice cloning, yada yada yada...
And all those apps coming out telling you that
[sarcastic tone] "we got the most realistic one".
[Audience laughs]
I don't think I'm gonna have to drop names here, you know which ones I'm talking about.
[Audience laughs]
You know who's NOT telling you they have the best text-to-speech?
Udio.
And I know what you're gonna say,
"hey Kenny, that's not a text-to-speech model".
[Audience laughs]
[firmly] Yes it is.
[Audience cheers]
It is a text-to-speech model, if you WANT it to be.
And if you don't mind the whole "audience laughing in the background" stuff
[Audience laughs]
Exactly.
But I'm not even kidding, it's PHENOMENAL.
People tell me "wow, Kenny, did you try out the new Eleven Labs text-to-speech model?"
Oh, oops, there went the name-drop anyway I guess.
[Audience laughs]
And yeah, I play around with it for three seconds and am immediately bored.
[Audience laughs]
[yelling] CAUSE IT CAN'T YELL WHEN I WANT IT TO.
It just refuses to do emotion.
Nothing realistic about that, if you ask me.
[Audience cheers]
And then you find this random model that generates music.
And it's like "wow, so A.I. can do emotions after all".
So yeah,
apparently a random music model does text-to-speech better
than the best text-to-speech model.
[Audience laughs]
What kinda timeline is this?
[Audience cheers]
0:00
0:00