Hi, I operate a weekly anime night for my friends, where I curate a list of the great and/or remarkable series to show them the highlights of the medium. I've been doing this for about 2 years now, and we've watched somewhere around 20 series so far ranging from one to two cours (basically seasons, 11-13 episodes).
Even though I usually watch shows in the original language to get the Author Intended Experience™, I've played them here dubbed for the sake of viewing convenience. This is where we run into a problem though, and a subsequent solution that I've been fascinated by.
The big official localization platforms, whether for streaming or physical releases, do not bother to add closed captions to dubs. Because this content is not technically on western TV, they aren't subject to regulations requiring closed captions (abbreviated to CC from here on) and thus don't bother because even small costs are too much for a feature few people outright demand.
Some prestige companies like Gkids actually give a shit and bother to do this on their releases, but they're a minority. Even Netflix is often guilty of this.
Before anyone asks: no, the dub script is not the same as the subbed version. Dubs naturally alter dialogue to fit into differences in the time it takes to say the same thing in English vs Japanese, and to make the dialogue flow better in English vs the straightforwardness of the subtitled scripts. Trying to watch the dub with the subbed captions will cause constant audio-visual disconnects, it's not a good experience.
So, an unavoidable issue with hosting viewing events is that people will naturally chatter at certain points, and not always recognize when they're distracting from important dialogue. It leads into various awkward situations, especially if you've got anyone more interested in the show than the social experience.
You might think that adding CC in if you already know how to use software like Aegisub isn't a huge deal. Maybe an hour? Nah. Every 100 lines is around 40-60 minutes. An action-heavy show like Gurren Lagann will have a brisk 200 lines, whereas something dialogue-heavy like Oddtaxi can have over 400.
I've gone to the effort a few times, but doing that for multiple shows is a genuine drag on my week, and I'll basically never do it if I haven't watched the show because that's not a good way to experience it for the first time. Recently though, I caved and set up Whisper, an open-source AI transcription model.
My ethical standard for AI is that if it helps someone do something they were already capable of doing, but faster/easier, then it's simply a useful tool. If it's being deployed in order to outright replace someone paid to do that thing, however, then we're in the territory of "until we're in the idealistic automated luxury communist society, this is a really immoral thing to use or work on."
As I had this draft sitting around, something fucky happened with Crunchyroll and one of the shows they streamed for the fall 2023 season, The Yuzuki Family's Four Sons, had clear signs of machine translation. Not "this has wonky translations from an obvious novice," we're talking "this machine does not have context for what's going on and is making bizarre assumptions about what the characters are saying."
It's possible that this was the result of an individual bad actor in the system, but considering these departments typically have managers who review the scripts, and CR is an infamous company for how they pay their translators slave wages, it's far more likely that this was a management-level decision to test the waters for this kind of thing on one of the lesser-known shows from this season.
Fortunately for us, machine translation still struggles when it comes to context in languages like Japanese, and that's not likely to change anytime soon unless you're overly-confident in the tautology of technology always getting better. As much as big companies like this want to cut out the workers from the equation, they can't do that without a massive sacrifice in quality that would be beyond what any paying customer would find acceptable.
All that aside, this isn't a very in-depth test, but before all this went down I wanted to see how well these models would handle transcription followed by translation, then compare that against a fan-translation. The results can sometimes impress, but are still inherently limited by that aforementioned context issue.
In this scene from Vinland Saga, Floki, a general of the Viking forces is paying a pirate, Askeladd, under the table to assassinate a deserter from the war, the current main character Thors.
Here's the transcription for the fan-translation:

Floki: The broad details have not changed. You will be compensated with five pounds of gold. I will only pay you in exchange for his dead body. His ship and cargo are yours to do with as you please. Just kill Thors.
Askeladd: Thors, eh? I've heard tell of this "Troll of Jom." You really want us to kill him? I thought he was a big hero to you folks.
Floki: If he were a hero, I wouldn't have told you to kill him. He has flagrantly flouted the precepts of our band. He deserted in the face of the enemy. Orders to have him executed were issued fifteen years ago.
Askeladd: I knew you folks were picky about your laws and all, but fifteen years? My goodness.
Pretty straightforward, right? It's easy to underestimate the work it takes to get a translation like this that both conveys the original meaning, while also emulating the personality in the original in the target language. especially when you see what AI tries to do with the material.
We'll start with the default "small" model:

Floki: It's not a big change, but the reward is 5 pounds of gold. If you don't exchange it with his corpse, you won't be able to pay. It's okay to like his ship and his sins. Kill Taurus.
Askeladd: Hey, don't say that, Taurus! Taurus, right? You're the boss of Yomu, aren't you? I've heard about it. Are you okay with killing him? You're a hero, aren't you?
Floki: If you're a hero, you can't kill him. He's a huge rebel. He's an enemy and a king. His first order has been given to him 15 years ago.
Askeladd: I know you're being so modest. It's been 15 years.
The smaller model gets caught up on proper nouns, especially western ones spoken in a Japanese phonetic pronunciation, but that's an easily-fixable issue. A more difficult general error to fix here is that, because of the differences in context for Japanese—"I" and "you" are rarely said, and statements are direct instead of needing an empty subject container (e.g. "it's" or "this") like in English—the AI just doesn't know who the characters are referring to at any given time.
Shorthand or colloquialisms also trip up the machine here. "Take the ship if you like" is confused as "You can like his ship," "I know you're picky/particular about your laws" becomes "I know you're being so modest," etc. If you weren't fluent in the original language at all and were told to unscramble this into a comprehensible script, you'd have a damned tough time.
Now for the medium model:

Floki: There's no big change. The reward is 5 pounds of gold. If you don't exchange it with his body, I won't pay you back. You can do whatever you want with his ship and his sins. Kill the Tors.
Askeladd: Tors, right? Youmu and Troll. I've heard of them before. Is it okay to kill them? They're heroes to you, aren't they?
Floki: If they're heroes, I won't tell them to kill them. They're serious military criminals. They were sentenced to 15 years in prison.
Askeladd: I know you guys are noisy about the rules, but 15 years, huh?
It's a lot closer, at least. The context becomes a lot less muddy, though it still has to make a generic "they" guess with the subject. I'm not sure how "his crew" keeps getting translated as "his sins" though. The original statement is:
"奴の船とが好きにしていい" (Yatsu no fune to ga suki ni shite ii)
"You can do whatever you want with his ship."
The "to" particle after "his ship" implies a collective grouping, i.e. including the crew in addition to the ship. It might be mishearing it as ”罪と船 (tsumi to fune)" or "sins and ship" to get this result. I'm still a rank amateur at this language, so that's my best guess.
Another example is the mention about his execution order being 15 years old gets confused as a prison sentencing somehow. You can get the gist of the scene, but there's still too many misshapen pieces to be able to accurately convey the conversation.
Finally, let's throw the large model at it. This requires a fair amount of processing power, so it's not an option available to someone on a toaster PC. Let's take a look:

Floki: I'm not going to make a big change. The reward is 5 pounds of gold. I won't pay you unless you exchange it for his body. You can do whatever you want with his ship and the loot. Kill Thors.
Askeladd: Thors, huh? Youmu's troll. I've heard of him. Is it okay to kill him? He's a hero to you, isn't he?
Floki: If he's a hero, I won't ask you to kill him. He's a serious military offender. He's a fugitive. He was executed 15 years ago.
Askeladd: I know you guys are loud and noisy, but it's been 15 years.
We're at the point where it can at least get Thors' name correct, and I would call this 90% there as far as getting what the conversation is here. If you were a viewer who was understanding of the AI's limitations, you could make reasonable assumptions about the original meaning of these lines at this point.
But 90% is not 100%. It still can't understand "ship and the crew" here and goes with a guess informed by the language model (ships have loot, naturally). It can't get the distinction between a regular possessive statement and a title since both use the same "の (no)" particle; Thors is the "Troll of Jom," he's not a troll belonging to someone named Jom.
It's almost there on the execution order, but can't understand what part of that statement that the continuous-tense particle "いる (iru)" is referring to so it assumes that an execution happened 15 years ago. Somehow Askeladd's statement about how strict their laws are actively regresses from the previous model and completely misses the meaning.
We're also just talking basic translation here. It might be almost there on conveying the exact words being said, but it has no capability to convey personality. Japanese is actually relatively easy to identify character formality just through what type of punctuation they use. The AI can't distinguish that, so we get a very dry translation with no distinction between the personalities of these two characters, where a human being could easily identify it through both conversational style and clear visual differences in body language.
Maybe that will be good enough for these services and for customers. They'll just take the large model output, clean up the script a little, and give a result which at least tells you what's going on in the most basic sense. I sure hope not, because even someone who has clear biases in how to present nuances in language (e.g. older fansubs overusing curse words) will have a better fundamental understanding of the source they're working with than a machine. As with most things to do with AI, a lot of the humanity inherent in art can't fit through the sieve of 1s and 0s.
Whisper is terrifically useful for cutting out most of the work of transcribing. It'll probably get used in the future for same-language professional CC, if it isn't already. But for localization, it still has to hand that mostly-accurate transcription over to a translation module, which continue to be full of faults derived from the complicated nature of context. Ideally it'll stay that way for a long while so that the expertise of localizers doesn't get further diminished by big business.
