In the past few years, as the training of machine translation (MT) models and the adoption of open-source tools and technologies have become much easier, the translation industry is focusing more attention on understanding the quality of MT output. Knowing the reliability and accuracy of your engines is critical to the timely and efficient delivery of a high-quality translation, and the unique ways in which neural machine translation (NMT) gets things wrong has made this a much more challenging task than with older technologies.
It’s perhaps not surprising, therefore, that MT estimation and evaluation were hot topics at MT Summit 2021, a leading biennial event for the industry. TBSJ’s experts in MT, Paul O’Hare, CTO and co-founder, and Yury Sharshov, chief scientist, attended the summit in August and share their takeaways and insights in this report.
Evaluating and estimating – not the same thing
Evaluation of MT quality is critical in order to see if engines are getting better, explains Paul. Translators can do this, of course, but the cost of repeated human evaluations is prohibitive, so a large number of segments (often thousands) are typically reviewed using automated metrics like BLEU, RIBES, or METEOR, along with a set of reference translations prepared by trusted translators.
Evaluating engine quality can be painstaking due to the tools and processes involved, but it is a vital step in the modern translation workflow. Many presentations and discussions at MT Summit were focused on new and better-quality evaluation metrics and processes. A team from translation firm Logrus Global presented its work on improved metrics, while representatives of fellow translation firm Welocalize provided an overview of some newly available MT metrics. Several companies, including localization management platform Crowdin, translation quality management company ContentQuo, and machine translation implementer Custom.MT, also demonstrated their approach to evaluating MT quality.
And in recognition of the importance of this topic, the Award of Honor 2021 was presented to Dr. Alon Lavie of translation services provider Unbabel. He is one of the authors of the renowned metrics METEOR and COMET.
Meanwhile, estimation of the quality of individual translations as they emerge from a production engine allows the required level of post-editing work to be better understood. In some cases, the aim of estimation is to remove this human-review step entirely for those translations exceeding a certain threshold score. Estimation is carried out automatically on a sentence-by-sentence basis and without the use of a reference. Think of it as a score of confidence that the engine got the translation right.
At MT Summit, Aleš Tamchyna of translation management system provider Memsource presented the company’s approach to deploying MT Quality Estimation on a large scale and discussed the future role of quality estimation in translation. Meanwhile, many speakers mentioned that the source text is a critical factor in the outcome, with Alex Yanishevsky of Welocalize in particular presenting on how poorly written source material has a profound negative effect on the quality of MT.
TBSJ and evaluation metrics
At TBSJ, evaluation of MT is considered one of the most important aspects of our machine translation process, and performing this task to monitor the improvement in engines has long been part of operations. Our self-developed approach “facilitates important decisions during the post-editing stage, such as efficient reuse of MT suggestions and translation memory matches,” says Yury.
Furthermore, TBSJ has been using different approaches and automatic metrics “to train better machine translation engines and to ensure the highest quality final translations,” he adds. “Understanding the critical role of evaluating MT output, we recently released a free version of our quality evaluation tool to make it easier for translators and non-technical parties to compare and use automatic quality evaluation metrics.”
This tool is Sanbi, developed by TBSJ technologists who are also experienced in translation and linguistics. It automatically evaluates the quality of one or more machine-translated documents using a range of metrics used extensively in the industry.
The tool was initially released with just two metrics—BLEU and RIBES—but this week we updated it with two new ones that were widely discussed at the conference: METEOR and hLEPOR. Access to multiple metrics is important since each reflects different aspects of translation quality. METEOR, for example, takes into account synonyms and stems of words, while hLEPOR is considered to be highly correlated to human evaluation.
Thoughts on quality estimation
Having excellent MT engines as confirmed by metrics, human evaluation, and production, we are keen to better understand how quality estimation can identify content not actually suitable for specific engines. However, Paul recognizes that applying automatic estimation to bypass post-editing for certain segments may not be an option for TBSJ.
“This type of strategy is currently hard for us to imagine,” he explains. “Although we are successfully using custom-built engines for certain clients in fields like legal and equity research, the content is always reviewed and corrected by at least two expert translators. In the challenging Japanese–English language pair, we cannot see any other way to get to the levels of quality that our clients expect, despite the fact that our custom engines easily outperform their generic counterparts. Even removing one of the two expert translators seems like a stretch with the current technology.”
As seen at the conference, much research is still being done in the fields of estimation and evaluation, and we intend to monitor closely and keep adopting the latest developments in our tools and processes.