Revolutionizing Enterprise MT: LLM-Based Adaptation for Precision and Scale

Over the past year, our mission at Phrase has increasingly focused on leveraging the rapid innovations emerging from Generative AI to unlock opportunities for translation automation at scale for our Enterprise customers.

Machine Translation (MT) is of course a central pillar of our translation automation technology stack. Until recently, the primary challenges with MT were centered around basic linguistic translation accuracy.

This involved ensuring that MT-generated translations are grammatically correct and fluent in the target language and are an accurate reflection of the meaning of the source text.

With the latest generation of LLM-based MT technology, these issues have largely (if not completely) been resolved.

However, in the realm of Enterprise translation use-cases, a primary remaining challenge for MT lies in addressing the nuanced enterprise-specific requirements for “fit-for-purpose” translation.

This includes strict and consistent adherence to enterprise-specific terminology, branding, formality, and style preferences. Generative AI and large language models (LLMs) are emerging as pivotal new tools for adapting MT to meet these requirements.

Overview

The evolution of enterprise-specific MT adaptation

The challenge of adapting MT to generate enterprise-specific translations is not new. Since the emergence of data-driven MT technology (Phrase-based MT and later Neural MT), it’s been well understood and established that the data used to train an MT system largely determines the language choices that it generates.

Since the archive of an enterprise’s previous translations largely reflects its preferences, it has become common practice to train dedicated, enterprise-specific MT systems that leverage these linguistic assets.

This approach is known as “Static Adaptation”. With LLMs and advancements in Generative AI, a new approach—dynamic on-the-fly adaptation—has gained prominence.

It leverages real-time retrieval of the most contextually relevant translations from existing linguistic assets and provides these as context to the MT generative model as it generates a translation.

Both of these adaptation methods have their merits and drawbacks, and understanding their differences is crucial for enterprises aiming to optimize their translation processes.

Static adaptation: The traditional approach

Static adaptation involves training dedicated custom MT models using enterprise-specific data, primarily translation memories (TMs) and curated termbases (TBs).

Neural MT technology is particularly well suited for training statically-adapted custom models. The typical starting point for the adaptation is a strong pre-trained general bilingual or multilingual MT model, previously trained on a large volume of general data.

In the secondary adaptation stage, this model is then “fine-tuned” on the collection of enterprise-specific data, resulting in a new model, specifically suited to translate content for the given enterprise.

To the extent that the enterprise-specific data resources adequately adhere to the enterprise’s desired terminology, branding, formality and style, the new model implicitly learns to generate language that largely adheres to these preferences.

This method ensures a high degree of customization, as the system learns from significant archives of previous translations.

Phrase NextMT – our neural-technology MT product, was largely developed to allow our enterprise customers and LSP partners to easily train and use in production custom MT models trained based on static adaptation. For details, see how Phrase Custom AI unlocks new machine translation possibilities.

Statically-adapted MT systems have several advantages, but also some clear drawbacks and disadvantages.

The primary value of a statically-adapted MT model is in its ability to generate translations that adhere far better to the enterprise’s language requirements.

The MT-generated translations from such custom models therefore require far less human editing. In some use-cases, they are far more suitable for publication forgoing human editing and review altogether.

But the drawbacks of this approach are significant, for both the enterprise and the MT developer/provider:

Training the custom models requires significant amounts of customer-specific data and is resource-intensive and time-consuming.
The resulting custom model is – as the method’s name suggests – static. As enterprise content and language evolve over time, the model needs to be frequently retrained.
The consistency and effectiveness of the resulting custom MT model is largely dependent on the consistency and quality of the underlying enterprise training data.
The custom model cannot adapt to language requirements that are fluid or changing, or that are only available at translation time.
Maintaining and operating a large collection of custom models in production can be a challenging and costly engineering endeavor.

So in summary, despite its strengths, static adaptation often struggles to keep pace with the dynamic nature of modern localization business demands.

On-the-fly adaptation: The LLM revolution

On-the-fly dynamic adaptation, powered by LLMs, represents a more flexible and efficient approach to adapting MT to specific language requirements.

Through new techniques such as in-context and “few-shot” learning, LLMs can adapt translations in real-time by leveraging small, contextually-relevant examples.

This eliminates the need for exhaustive model training and retraining.

How It Works:

Few-shot learning relies on presenting the model at translation time with a minimal set of relevant high-quality translation examples.

LLMs use this data to dynamically align their generated translation output at the time of translation with the enterprise’s specific linguistic instructions, as exhibited in the provided examples.

The examples themselves are retrieved from the enterprise’s available linguistic assets (TMs and TBs), at the time of translation.

This approach has several clear advantages at the time of translation:

No custom models to train. Adaptation is done on-the-fly with a single LLM-based MT model.
No retraining required. The ongoing updating of the underlying enterprise assets (i.e. TMs) ensures that the most relevant and up-to-date examples will be retrieved.
Incorporating dynamically changing requirements. Aspects of the translation that are dynamic in nature such as the desired level of formality (formal or informal language) can be specified for each requested translation.

However, on-the-fly few-shot adaptation also has some drawbacks of its own:

Translation quality and adaptation effectiveness largely depends on the quality of the retrieved examples, and crucially – on the similarity of these examples to the content that is to be translated. When good examples are lacking or unavailable, translations can revert to being largely generic in nature.
Optimal performance is dependent on the availability of a robust and effective example retrieval module.
MT speed and latency may be impacted due to the more complex processing at run-time and the nature of the underlying LLM models. This may have an adverse impact on performance and costs.

The role of fuzzy match retrieval in on-the-fly adaptive MT

A cornerstone of on-the-fly adaptive MT is the ability to retrieve and leverage relevant examples.

“Fuzzy Match” retrieval, a concept borrowed from traditional computer-assisted translation (CAT) tools, can play a crucial role here.

Fuzzy matching has long been used to enhance translator productivity in CAT tools. This retrieval method is designed to search within translation memory archives for translation examples that are most similar but not identical to the current content.

Their use within CAT tools is to provide the translator with a highly similar translation that can then be transformed into a translation for the current segment at hand.

However, the same fuzzy-match retrieval method can be used as the retrieval step for on-the-fly few-shot MT adaptation.

At Phrase, this idea and innovation was already identified and implemented as a capability within the neural Phrase NextMT solution. Phrase NextMT was specifically designed to retrieve and incorporate a single-best fuzzy match example into its translation generation.

However, the methodology for doing so was far less advanced and effective than what is currently possible with LLM-based technology.

Our new LLM-based MT solution – Phrase Next GenMT, was designed and developed to leverage fuzzy-match retrieval for few-shot on-the-fly MT adaptation.

Benchmarking the impact of fuzzy match retrieval

So how well does fuzzy-match retrieval work in practice, when incorporated within a few-shot on-the-fly adaptation MT implementation? We recently conducted a benchmarking evaluation study specifically designed to answer this experimental question.

It compared the performance of baseline MT systems (without fuzzy-match retrieval) and systems integrated with fuzzy-match retrieval, for both Phrase NextMT (our neural MT solution) and Phrase Next GenMT (our LLM-based MT solution).

Data:

The data for this benchmarking study was extracted from recent data translated on the Phrase Platform for five representative enterprise customers with extensive translation memories (TMs). It covers 13 language-pairs (all of which had English as the source language) and multiple domains and content-types.

For each segment in the data, we retrieved multiple examples from the customer TMs using fuzzy-match retrieval, and then categorized these retrieved examples based on their source similarity score into buckets of similarity.

We then isolated a targeted subset of the data for which each segment has a fuzzy match example in each of four similarity-score buckets (0.75-0.80, 0.80-0.85, 0.85-0.90, 0.90-0.95). This final subset consists of 3,738 segments.

Experimental setup and evaluation:

The objective of the experiment was to contrastively compare the accuracy and quality of translations generated by Phrase NextMT and by Phrase Next GenMT.

We contrastively examined the impact of single-shot adaptation to the available fuzzy match examples from the four similarity-score buckets, and with no adaptation at all.

For Phrase Next GenMT, we experimented with two different OpenAI LLM models: GPT-4o and GPT-4o-mini. Each of the segments in the benchmark set was therefore translated 15 times (three MT system variants x five fuzzy-match conditions for each).

We aggregated results for each of these 15 conditions and analyzed their translation quality with four commonly-used reference-based MT evaluation metrics (COMET, BLEU, Chrf and TER).

Additionally we evaluated translation quality with Phrase QPS – our Quality Estimation metric (which does not require reference translations).

Results:

The results for three of the analyzed metrics (COMET, BLEU and Phrase QPS) are depicted in the graphics below. Results for Chrf and TER were very similar and are omitted for brevity.

Analysis:

The results according to all three metrics (COMET, BLEU and Phrase QPS) are largely consistent. They confirm that on-the-fly one-shot adaptation to retrieved fuzzy match examples is highly effective in improving MT quality and adherence to enterprise-specific language requirements, for all three MT models tested.

Furthermore, the similarity-level of the retrieved fuzzy-match has a significant impact on the effectiveness of the adaptation.

This is particularly evident from the COMET and BLEU results. These metrics assess quality by contrasting the MT-generated translation with a provided reference translation. Since the word choices in the reference explicitly expose the enterprise’s linguistic preferences, higher COMET and BLEU scores are largely reflective of improved adherence of the MT to these linguistic preferences.

As can be seen, in our experiment, adaptation to a single-shot fuzzy match example of a higher degree of similarity clearly resulted in dramatic improvements according to both COMET and BLEU scores.

The picture according to Phrase QPS is a bit more nuanced, however.

As a quality-estimation metric, Phrase QPS has no access to a reference translation.

Phrase QPS, as currently trained, is a predictor of MQM (Multidimensional Quality Metrics) scores and is designed to reflect general linguistic quality. It is therefore far less sensitive to the extent to which the MT-generated translation adheres to specific enterprise terminology and language preferences.

Consequently, Phrase QPS is far also less sensitive to the similarity score of the one-shot fuzzy match example. Single-shot adaptation clearly still helps – especially for Phrase Next GenMT with GPT-4o-mini – which, without adaptation, is clearly weaker than Phrase Next GenMT with GPT-4o, but is fully on par with GPT-4o when adapted to highly-similar examples.

Furthermore, for the neural Phrase NextMT system, single-shot adaptation to lower-scoring examples (in the 0.75-0.80 and 0.80-0.85 buckets) actually hurt translation quality according to Phrase QPS.

This type of on-the-fly adaptation for Phrase NextMT apparently critically depends on the availability of a highly-similar fuzzy match example.

Complementary LLM-based innovations: Automated LQA and document-level adaptation

Beyond their demonstrated effectiveness for on-the-fly MT adaptation, the Generative AI capabilities of LLMs offer further possibilities.

These include enabling additional automated steps in the translation workflow, such as automated translation error detection and correction and automated document-level translation adaptation.

Phrase’s recently launched Auto LQA and Auto Adapt implement these new groundbreaking capabilities.

These tools further refine translations, ensuring grammatical accuracy, style consistency, and brand alignment. Together, they form a cohesive ecosystem for enterprise translation management.

Conclusion

As enterprises continue to demand nuanced and high-quality translations, LLM-based MT adaptation offers transformative solutions.

Whether through static adaptation, on-the-fly adjustments, or complementary tools like automated document-level adaptation, the possibilities are vast.

By leveraging these advanced methodologies, enterprises can unlock unparalleled efficiency and precision in their translation automation processes, paving the way for seamless global communication.

Customized machine
translation for your
business

Achieve unparalleled translation quality and scale with Phrase Custom AI. Customize ML models in hours—no specialist skills required—and unlock limitless possibilities for translation across your organization.

Find out more

Enterprise-Specific Machine Translation Adaptation with LLM-based Technology

The evolution of enterprise-specific MT adaptation

Static adaptation: The traditional approach

On-the-fly adaptation: The LLM revolution

How It Works:

The role of fuzzy match retrieval in on-the-fly adaptive MT

Benchmarking the impact of fuzzy match retrieval

Data:

Experimental setup and evaluation:

Results:

Analysis:

Complementary LLM-based innovations: Automated LQA and document-level adaptation

Conclusion

Customized machinetranslation for yourbusiness

Behind the bugs: What Game Quality Forum 2025 revealed about the future of QA, localization, and player experience

The expansion challenge no-one talks about: How cultural misalignment holds automotive brands back

Make your localization data work for you: Insights, impact, and automation with Phrase Data

Keep exploring

5 global trends reshaping localization at LocWorld53 Malmö

From MT to market: How Phrase supercharges DeepL with quality, automation, and AI

Making LLMs work for scalable, brand-consistent multilingual content

Customized machine
translation for your
business