TBSJ News: MT Summit: human evaluation of machine translation

The overwhelming consensus at MT Summit 2021, an event held in August for professionals in the field of machine translation (MT), was that effective evaluation of MT output is key to improving its quality. Some 700 specialists from around the world, in sectors as varied as academia, business, and government, discussed the progress and remaining challenges in assessing MT output.

TBSJ representatives, Paul O’Hare, CTO and co-founder, and Yury Sharshov, chief scientist, will focus on human evaluation in this event report, the follow-up to our previous article on automated evaluation of MT. They explore the summit speakers’ remarks on human evaluation of MT as well as how TBSJ has approached the task.

Assessing MT output is fundamental not only to interpreting the level of post-editing required but also to improving MT engines in the short and long term. By providing feedback on MT translations as part of engine building, human linguists and technologists are key to this process.

“Automatic metrics provide specific information about translation quality. But the accuracy and reliability of these metrics can only be validated and confirmed by human evaluation,” says Yury.

Going beyond the use of automated metrics involves the human factor: skilled translators who are expert in the subject matter review the output of two or more engines sentence by sentence and compare them using numerous markers such as grammar and fluency. But even this process has drawbacks as it is subjective, time-consuming, and costly. With proper processes and clear, consistent rules, human evaluation can be improved, but even more can still be done, according to the MT industry. Multiple summit presenters therefore discussed the current state and future prospects of human evaluation of MT.

Silvio Picinini of e-commerce platform Ebay provided advice on how to be more confident of the quality of post-edited data, namely by looking at specific aspects of quality in a systematic way for the entire content, not just a sample. Ebay’s approach offers insights that also help project managers, even if they are not native speakers of the target language.

James Phillips of World Intellectual Property Organization (WIPO), a global forum for intellectual property services, gave a talk entitled “Human NMT Evaluation Approaches.” He explored how quality is measured in a large-scale, real-life scenario, and what attempts are being made to minimize the degree of human effort involved in order to make substantial cost savings relative to conventional approaches and without compromising quality. WIPO uses an advanced MT quality estimation (QE) algorithm trained on data it took them six years to collect. Nevertheless, James emphasized that human evaluation remains a critical part of assessing the reliability of QE scores.

Paula Manzur of global content solutions provider Vistatec gave recommendations on how to improve the human evaluation procedure, thereby making this process more objective and accurate. She explained how important it is to set clear expectations for all evaluators, to properly set up the process (choosing engines, evaluators, and randomized representative test sets), to do a full pilot, and to use an industry-established QE framework such as MQM.

At the MT summit, practically every presenter mentioned human evaluation as part of their research or production workflow. And TBSJ is pleased to report that we have long been following the best practices identified at the conference. From the outset, we have engaged the right people—skilled in translation, linguistics, and the subject matter—and invested time, money, and effort in setting up an efficient, appropriate process for human evaluation of MT quality.

People are key to everything we do with language, so our goal has always been “to use not only specific automated metrics but also human evaluation tools, since feedback from linguists is crucial for MT evaluation,” explains Paul.

These efforts include developing our own tools. One that we use internally is Ginbi, the sister product of Sanbi, our MT automated evaluation tool, which we have just updated to include new metrics. As the MT industry strives to make human evaluation tools better and better, TBSJ hopes to contribute with the launch of Ginbi in the coming months.