Why high-quality automatic subtitling is harder than it looks

Subtitles are more than text on screen. Explore why segmentation, typography, and AI make automatic subtitling far more complex than it first appears.

At first glance, automatic subtitle generation seems straightforward. Speech recognition produces text, the text is displayed on screen.  Job done.

In practice, it’s far more complicated. . Subtitles are not simply transcripts placed over a video. They are built the way they are so people can read them easily while still following what is happening on screen. The viewer should never feel like the text is competing with the story.

Getting that balance right turns automatic subtitling into a surprisingly difficult problem.

Subtitles are a visual experience

One of the most underestimated aspects of subtitles is their visual presentation. Viewers read them while watching the video at the same time, so that readability has to be immediate and effortless.

That’s where segmentation comes in.

Segmentation is the way speech is divided into subtitle blocks and lines. It determines:

  • Where one subtitle ends and the next begins
  • How text is split across one or two lines
  • Which words remain together
  • How easily the subtitle can be read in the available time

When segmentation works well, subtitles feel almost invisible. When it does not, viewers slow down, reread lines, or lose the flow of the dialogue.

That’s why segmentation needs to be viewed as a core problem in subtitle generation, rather than a simple formatting step.

Different domains, different expectations

Subtitles should not look the same everywhere. There’s no universal standard, and different platforms and industries expect different styles.

For example:

  • Some applications prefer bottom-heavy subtitles, where the second line carries more text.
  • Others favour balanced lines, where both lines have similar lengths.
  • Some prefer top-heavy subtitles, where the first line contains more content.

These differences aren’t arbitrary. They usually reflect platform guidelines, accessibility requirements, or simply the viewing habits audiences have developed over time.

This means an automatic subtitling system has to adapt to different segmentation profiles while still keeping the text comfortable to read.

Every language has its own typographic rules

The challenge becomes even greater when multiple languages are involved.

Each language comes with its own typographic conventions and segmentation rules, such as:

  • Restrictions on where line breaks may occur
  • Rules about splitting grammatical structures
  • Punctuation placement
  • Handling of clitics, particles, or compounds
  • Natural reading rhythm

A segmentation strategy that works well for English may be inappropriate for German, Spanish, or Japanese. When those linguistic differences are ignored, subtitles often feel awkward or unnatural.

Handling these nuances properly requires both language-specific knowledge and specialised algorithms.

Learning from real subtitle data

Theory alone is not enough to build good subtitle systems.

For example, we study large volumes of subtitle material across many languages and domains. This includes both publicly available datasets and carefully curated private collections.

Looking at real subtitles helps us understand how professional segmentation works in practice and how different domains structure their subtitles.

At the same time, we actively follow academic and industry research in automatic subtitle generation from audio. The field is evolving quickly, and keeping up with that research helps us keep improving our systems.

Combining multiple technologies

One common misconception about automatic subtitling is that a single model can solve the entire problem.

In reality, producing high-quality subtitles requires multiple specialised technologies, each optimised for a specific task.

AI models for fluency and readability

We use advanced AI models to refine subtitles so that the message reaches the reader as smoothly as possible.

These models help:

  • Improve sentence flow
  • Reduce awkward phrasing
  • Adapt spoken language for reading
  • Preserve meaning while enhancing clarity

This step ensures that subtitles feel natural to the viewer, rather than reading like raw transcriptions of speech.

Deep neural networks for speech tagging and localization

Modern deep neural network (DNN) models help us analyse the structure of speech and detect language-specific nuances.

These models support:

  • Accurate alignment between speech and text
  • Identification of linguistic patterns
  • Handling of multiple languages and dialects
  • Adaptation to locale-specific conventions

This allows us to treat subtitles as language-aware content rather than plain text.

Typographic algorithms for precise compliance

AI models provide flexibility, but subtitle formatting also requires precision.

We rely on traditional typographic algorithms that enforce specific subtitle profiles. These algorithms ensure:

  • line length constraints are respected
  • reading speeds remain comfortable
  • segmentation follows typographic rules
  • visual balance matches the desired style (bottom-heavy, balanced, or top-heavy)

By combining AI-driven refinement and deterministic typographic control , we can maintain natural language flow while still meeting strict formatting compliance.

Rigorous evaluation is essential

Improving subtitle systems requires careful evaluation.

We test our systems across multiple domains, languages, and content types to make sure performance holds up in real-world conditions.

To do this, we rely on metrics used by both industry and academia.

Reference-based metrics

One important metric we use is SubER (Subtitle Edit Rate).

SubER measures how much editing would be required to transform our automatically generated subtitles into high-quality human references. It provides a clear indication of segmentation accuracy and formatting quality.

Reference-free quality assessment

We also employ reference-free QA-style metrics that analyse subtitles without relying on human references.

These metrics allow us to examine detailed aspects:

  • Compliance with subtitle profiles
  • Segmentation quality
  • Typographic adherence
  • Formatting consistency

By combining reference-based and reference-free evaluation, we gain a more complete understanding of subtitle quality.

Why the problem is genuinely difficult

Automatic subtitling sits at the intersection of several complex fields:

  • Speech recognition
  • Linguistics
  • Typography
  • Artificial intelligence
  • Human readability research

Each of these areas brings its own challenges. Combining them into a system that produces fluent, readable subtitles across languages and domains is not straightforward.

Turning speech into text is only part of the task. The real challenge is producing subtitles that people can read comfortably, quickly, and without losing track of what is happening on screen.

When subtitles work well, viewers barely notice them… and that’s exactly the point.

Webinar: Watch now!


Semih Altinay, VP AI Solutions at Phrase, guides you through the ins and outs of Phrase Studio.

See Phrase Studio in action and learn how to convert spoken content into high-impact, multilingual assets.

Watch on demand

Keep exploring

Blog post

Forcing the fit: MQM in the age of automated evaluation

MQM was the most robust answer we had for human evaluation. Automated evaluation is a different problem, and the seams are starting to show.

Phrase localization beyond text webinar

Blog post

Multilingual video localization, with the assets you’ve already built

When AV runs outside your TMS, terminology drifts and consistency breaks down. Phrase Studio closes that gap by running subtitling, dubbing, and voice-over inside the same workflow your text content already uses.

Blog post

Phrase is the Language Intelligence Platform.

Most announcements work like this. A company describes what it intends to build, frames it as a vision, and asks you to believe the future they’re describing. This post is different (and longer than most!).

Kevin O'Donnell, founder of Global10x and former VP of International Growth at Dropbox, joined Jason Hemingway on the In Other Words podcast to talk about the build-versus-platform question and why he now tells companies to stop doing what he did at Microsoft.

Blog post

Should you build or buy your localization infrastructure?

The instinct to build is strongest in engineering-led companies. But maintaining a localization platform over time is a different problem than building a proof of concept — and most organizations underestimate the difference.

A proven joint venture model generated billions across Europe, but when it reached the US market, it failed. The technology worked and the business case was fully established, yet the operating model simply wasn't portable. This article draws on insights from Elaine Barsoom, who led innovation partnerships at both American Express and Nike. It explores how organizational behavior and fragmented operating models undermine AI adoption long before the technology itself fails. The companies seeing meaningful returns from AI recognized early that the technology was never going to be the hardest part. Redesigning how people work around it is.

Blog post

A model that generated billions in Europe didn’t survive the US market

The companies capturing real value from AI aren’t the ones with the biggest budgets or the most sophisticated tooling. They’re the ones that understood early that the technology was never going to be the hardest part.