Forcing the fit: MQM in the age of automated evaluation

MQM was the most robust answer we had for human evaluation. Automated evaluation is a different problem, and the seams are starting to show.

I spent eight years building technology to make machines evaluate translations the way trained human reviewers do. COMET and the metrics that followed were, in large part, an effort to teach a model the MQM protocol.

I now think that was the wrong goal; MQM is sound, but it was built for human judgement, and a machine reproducing it inherits a job it cannot do well. Many of the AI LQA tools now arriving chase that same goal, and it is quietly holding the category back.

Why MQM looks the way it does

MQM exists to solve one hard problem: human subjectivity. Two qualified linguists can read the same segment and disagree about whether something is an error and how serious it is. A granular, well-defined typology constrains that disagreement and gives reviewers guardrails, which is why it has become the backbone of analytic quality evaluation across the industry.

Even so, interannotator agreement is famously hard; we have revised and reshaped the framework for years and it remains difficult to get two people to score the same content the same way.

That is the framework we then handed to machines, on the reasonable assumption that the best automated evaluation is the one that best imitates an expert human.

There wasn’t a better option at the time. There is now, and that assumption is where the trouble starts.

What automated evaluation actually asks of a model

Ask a model to do MQM-style evaluation and you have really asked it to do two jobs. First, find what is wrong. Second, decide which category in the typology the problem belongs to.

The second job is harder than it looks. Handing the model the category definitions does not solve it, because applying a definition is an act of judgement, not a lookup.

Deciding which category an error belongs to is exactly the step where trained humans disagree with one another, and the model’s sense of how to apply those categories is shaped by the same inconsistent annotations it learned from.

The noise enters twice: once in the judgement itself, and once in the examples it learned that judgement from. You have stacked a judgement-heavy classification task on top of detection, and that classification rests on interpretations we already know to be unreliable.

The most recent evidence bears this out. The WMT25 shared task on automated translation evaluation found that accurate error detection, and balancing precision against recall, are still persistent challenges, and that at the segment level reference-based baseline metrics outperformed large language models.

Segment-level judgement is exactly what LQA is. So even the first job, reliable detection, is not solved; the second job sits on top of it.

There is a sensible-looking response to this, and a lot of tools have reached for it: make the task easier.

Shrink the typology. Give the model a handful of categories instead of a hundred and its results look much better, because you have changed the question from “what kind of error is this” to something closer to “is this one of these five things.”

The evaluation numbers improve. The problem is what you have quietly done to get there.

When you reduce the typology, you also reduce the coverage of what the customer cares about, and you do it on their behalf without asking them. A customer defines what good looks like in their style guide. They have thought about their brand, their market, the things that make their content land or fall flat.

Then we evaluate that content against a generic, pre-decided subset of error types that nobody mapped back to their style guide. Whatever they cared about that falls outside the subset is now invisible, and they never chose that blind spot.

Notice the bind. Keep the full typology and you overwhelm the model, which is what the shared-task results show. Shrink it and you quietly decide which of the customer’s concerns stop being measured. Both roads fail for the same reason: a generic catalogue of error types was never the right unit of evaluation.

The problem is not how many categories it has; it is that the catalogue is standing in for what this particular customer means by quality.

I have watched this happen. A customer has an expectation that is peripheral to the typology, or that you can only represent by squeezing it into a category that doesn’t quite fit. That is the glass slipper; you can force the foot in, but you shouldn’t be surprised when it pinches.

A concrete case makes it clearer. Picture a subscription business localising its checkout and renewal messaging for a new market.

What actually creates risk for them is rarely one tidy error type; it is a combination. A renewal date or price that reads wrong in the local format, a term like “free trial” or “auto-renew” that carries a slightly different connotation in the target language, and a tone on the payment reminder that lands as pushy rather than reassuring.

Taken one at a time, several of those might score as minor, or fall outside the typology altogether. Taken together, at the moment someone is deciding whether to keep paying, they are what produces cancellations and chargebacks.

The customer understands this about their own funnel. A generic typology, and especially a reduced one, has no way to encode “these particular things, in combination, in this context, are what hurt us.”

That intersection is specific to the customer, and it is exactly what gets flattened when evaluation is scoped around standard categories rather than around their own definition of what matters.

What customers are asking for now

Two things are changing at once. The first is that buyers increasingly care about outcomes rather than scores. Linguistic quality measured in the abstract is not always what matters most; what stakeholders need to demonstrate, often to people upstream who do not live in localization, is that a piece of content did its job. Did this version perform better, by our standards, than the last one?

The second is interpretability. Hand an MQM scorecard to someone who does not live and breathe quality frameworks and they will not want to read it.

They do not want a breakdown of penalty points across dimensions. They want to know that the things they said they cared about were checked, and whether those things passed.

A framework built to give expert reviewers granular, defensible detail is not the right interface for a marketer who needs a yes or a no.

A different starting question

So we stopped trying to make the machine imitate the human protocol, and we changed the question. Instead of asking which error types from a universal taxonomy we should check, we ask what quality means for this content, this audience, and this point in the workflow.

That is the thinking behind Quality Profiles. The premise is straightforward: you already have standards. Your style guide, your termbase, your sense of what your brand sounds like in each market. Rather than retrofitting all of that onto a generic typology and then interpreting noisy signals back into business meaning, you evaluate against the standards you already hold.

You get a profile shaped around your content rather than a global LQA instance you have to tune toward your needs. The evaluation understands your language, not a one-size-fits-all definition of good.

This is also where the frontier work on evaluation has been pointing. The discipline of building good evals, of defining clearly what you want a system to achieve and measuring directly against that, is exactly the muscle the wider AI field has been developing. It maps far more naturally onto purpose-defined quality than onto a fixed error taxonomy.

None of this means throwing away what works. Some checks are simple and deterministic, length, formatting, forbidden terms, and they are better handled as cheap rule-based checks than as a job for an expensive AI judge.

Others genuinely need reasoning, and that is where you spend your agentic budget. And where a customer wants MQM, for human review, for compliance or audit reasons, or for a programme already built around it, it is not going anywhere; MQM remains the right tool for human evaluation, and if you want its categories applied automatically, a Quality Profile can be configured to mirror them. We are not asking anyone to reinvent the wheel; we are saying the framework was built for a problem that is no longer the only one we have.

We shipped an Auto LQA capability in 2024, and we are moving on from it. The ceiling we hit was in the approach, not the build; automating the steps of a human review process gives you a faster human review process, not a better answer to what quality means. Hitting that ceiling ourselves is the clearest reason we are confident about building differently now.

The industry does not need a more complete version of MQM-based automated LQA. It needs a clearer answer to an older question: what is this content for, and did it do that job?

I would not pretend we have finished answering it. But I am fairly sure that is the question, and I would rather work on the right problem than automate our way around the wrong one.

Keep exploring

Phrase localization beyond text webinar

Blog post

Multilingual video localization, with the assets you’ve already built

When AV runs outside your TMS, terminology drifts and consistency breaks down. Phrase Studio closes that gap by running subtitling, dubbing, and voice-over inside the same workflow your text content already uses.

Blog post

Phrase is the Language Intelligence Platform.

Most announcements work like this. A company describes what it intends to build, frames it as a vision, and asks you to believe the future they’re describing. This post is different (and longer than most!).

Kevin O'Donnell, founder of Global10x and former VP of International Growth at Dropbox, joined Jason Hemingway on the In Other Words podcast to talk about the build-versus-platform question and why he now tells companies to stop doing what he did at Microsoft.

Blog post

Should you build or buy your localization infrastructure?

The instinct to build is strongest in engineering-led companies. But maintaining a localization platform over time is a different problem than building a proof of concept — and most organizations underestimate the difference.

A proven joint venture model generated billions across Europe, but when it reached the US market, it failed. The technology worked and the business case was fully established, yet the operating model simply wasn't portable. This article draws on insights from Elaine Barsoom, who led innovation partnerships at both American Express and Nike. It explores how organizational behavior and fragmented operating models undermine AI adoption long before the technology itself fails. The companies seeing meaningful returns from AI recognized early that the technology was never going to be the hardest part. Redesigning how people work around it is.

Blog post

A model that generated billions in Europe didn’t survive the US market

The companies capturing real value from AI aren’t the ones with the biggest budgets or the most sophisticated tooling. They’re the ones that understood early that the technology was never going to be the hardest part.

Kevin O'Donnell, founder of Global10x and former VP of International Growth at Dropbox, joined Jason Hemingway on the In Other Words podcast to talk about the reporting gap that's hiding international growth problems in plain sight, and why the teams closest to the data are making it worse.

Blog post

Why most companies don’t have an international dashboard, and what it’s costing them

When you break global revenue down by individual market, patterns emerge that should be driving strategic decisions. Most companies don’t have the reporting infrastructure to see them.