Why the post-editing paradigm is breaking down in the age of LLMs

Dr. Alon Lavie, VP of AI Research at Phrase

Strategic Advisor, AI

Last updated on May 22, 2026.

I recently had the opportunity to join Ana Guerber of Arenas on The Visible Art of Translation podcast, where we discussed the evolving role of machine translation and AI in literary and professional creative translation. What began as a broad conversation about quality and creativity quickly converged on a more fundamental question:

Overview

Are we still thinking about translation, and translation workflows, in the right way?

This two-part article picks up that conversation, focusing on the implications for how translation quality is evaluated and how workflows are designed.

Over the past two decades, the field has made remarkable progress. Machine translation systems have improved dramatically, and automated evaluation metrics have become far more sophisticated. Yet, many of the workflows and assumptions that underpin how we use these technologies have remained largely unchanged.

With the emergence of large language models (LLMs), that mismatch is becoming increasingly visible.

Rethinking what “quality” actually means

A useful place to start is with the concept of translation quality itself.

In much of the industry, quality is often treated, implicitly or explicitly, as a single measurable dimension. In reality, it is far more complex. At a minimum, a high-quality translation must be faithful to the source and grammatically correct and fluent in the target language. These are necessary conditions.

They are not sufficient.

What constitutes an outstanding translation depends heavily on the context in which it is used. In literary translation, more creative qualities such as voice, nuance, and reader engagement become central. In medical or technical domains, clarity, precision, and terminological consistency dominate. In marketing, tone, cultural alignment, and brand voice take precedence.

Translation quality is therefore not a single scalar value. It is multi-dimensional and use-case dependent.

This observation is not new. Frameworks such as MQM (Multidimensional Quality Metrics) were developed precisely to allow human evaluators to assess quality across multiple dimensions including accuracy, fluency, terminology and style, rather than collapsing everything into a single score. MQM made this multi-dimensionality operational for human evaluation.

What is new is the growing tension between this reality and how quality is often operationalized in automated settings.

Progress in evaluation, and its limits

The field has made significant advances in automated translation quality evaluation. Metrics have evolved from surface-level comparisons to increasingly sophisticated neural and now LLM-based, approaches that better correlate with human judgments.

A key milestone was the development of metrics such as COMET, which leverage neural representations to capture semantic adequacy and fluency more effectively, particularly at the system level. More recently, quality estimation (QE) approaches such as COMET-QE and production-oriented scores like Phrase Quality Performance Scores have extended this paradigm by estimating quality without the need for reference translations.

These advances have been essential. They have enabled large-scale system comparison, faster iteration, and more consistent evaluation practices. They have also supported dynamic routing of content, balancing risk and quality requirements, and the effective allocation of human resources to where translation quality is most lacking and needed. We’ve addressed these advances in some of my earlier blog articles.

But important limitations remain.

Even the most advanced translation quality metrics are still approximating an abstract concept that is inherently contextual, nuanced, and sometimes subjective.

This becomes especially apparent at high levels of quality, where differences are more subtle and difficult to capture reliably at the segment level. QE systems can provide useful signals, but their reliability for fine-grained, context-dependent judgments, especially in creative domains,remains limited.

In other words, automated evaluation has improved dramatically, but it has not fully “solved” the task of translation quality assessment across the broad spectrum of use-cases, content-types and languages. The gap between what we can measure and what we actually care about is still very real.

The legacy workflow: MTPE

Despite all the technological progress, one aspect of the translation pipeline has remained surprisingly stable.

The dominant workflow is commonly referred to as MTPE (Machine Translation Post-Editing):

MT Generates a draft → hand it to a human → post-edit to final quality

This model emerged naturally from earlier MT systems, where machine output was imperfect but usable. Human post-editing provided a practical way to bridge the gap between MT and the desired levels of production quality.

Crucially, MTPE did not emerge in isolation. It was layered onto translation memory (TM)-based workflows, where translation has long been structured around retrieving segment-level matches and editing them. Machine translation effectively became another “match source,” fitting neatly into a paradigm of retrieval followed by editing.

Over time, this workflow became deeply embedded in pricing models, project management practices, and the very definition of the translator’s role.

It also embeds a set of assumptions:

The machine produces a static draft
The human’s role is to fix that draft
Efficiency comes from reducing post-editing effort

For a long time, these assumptions were reasonable.

They are increasingly questionable today.

Why is this model breaking down?

Large language models fundamentally change what is possible. They are not just better generators, they can incorporate context, follow complex instructions, and adapt dynamically.

Yet in practice, we often still use them as static text generators. Prompt once, generate a translation, then pass it to a human for revision.

In a rapidly increasing set of scenarios, this is the wrong abstraction.

By doing so, we underutilize their most important capabilities. It also places unnecessary burden on human translators, who end up correcting outputs that could have been improved earlier in the process.

The model is only one part of the equation. What ultimately determines output quality is the orchestration layer around it. This includes the context, instructions, quality signals, and feedback loops that shape what the model produces. That is where the real leverage sits, and it is largely absent from standard MTPE as it is practiced today.

In creative domains such as literary translation, the problem is even more acute. Post-editing can constrain the translators’ creative process, anchoring them to choices they would not have made independently.

More broadly, MTPE assumes a one-shot interaction between human and machine. But LLMs enable something fundamentally different. An ongoing, adaptive, collaborative process.

We are only beginning to explore what that might look like.

A transition moment for the field

These emerging new capabilities do not mean that current practices become obsolete overnight. Change will be gradual and uneven across domains, language pairs, and use cases.

But the direction is clear.

We are entering a phase where progress is no longer driven primarily by better models, but by how those models are applied and used. That includes how they are integrated into workflows and where human expertise is applied most effectively in the overall translation task.

What this points to is an orchestration gap, the difference between what an LLM can produce when prompted in isolation versus what it produces when supplied with the right context, quality criteria, and domain knowledge. Recent research consistently shows that orchestration quality matters just as much as model selection. A well-orchestrated system running on a weaker model often outperforms an isolated call to a state-of-the-art one. That gap only widens as models improve, and it is largely where the value in modern translation systems will be built.

If the traditional post-editing paradigm is no longer sufficient, the natural next question becomes what should replace it?

That question leads to a different conceptualization of the automated translation process, built around intelligence rather than generation and correction. Where context, quality signals, and domain knowledge are built into the process from the start, not retrofitted by a human at the end.

In Part 2 I’ll explore what that looks like in practice. The workflows, interaction points, and design principles that move beyond post-editing toward something more dynamic and adaptive.

Home | Resources | Blog

Phrase Language AI

Our sophisticated, secure and scalable AI translation capabilities.

Discover Phrase Language AI

Forcing the fit: MQM in the age of automated evaluation

MQM was the most robust answer we had for human evaluation. Automated evaluation is a different problem, and the seams are starting to show.

Why video localization belongs in your core workflow

For many localization teams, video still sits outside the system. Text moves through mature workflows, while audio and video rely on separate vendors, tools, and review processes. The next step is not simply localizing more video, but integrating it into the operating model for global content.

Localization workflow automation: 10 ways modern teams scale global content

Discover ten practical localization workflow automations from CMS integration to custom MT training that reduce errors, streamline processes, and help teams scale effortlessly. Learn how automation can revolutionize your localization strategy

Why accessibility matters in translation tools: Improving our CAT editor

Accessibility is shaping the future of translation tools. Learn how WCAG-aligned improvements in the Phrase CAT editor are helping linguists work faster, with greater clarity and fewer barriers.

Build or buy? You’re probably asking the wrong question

Building a translation prototype is easy. Making it work at scale is another story. This article explores why most organizations succeed with a hybrid approach: buy the infrastructure and build the capabilities that differentiate their business.

Want to find out more?

Get in touch

Request a demo