If there were ever a shortlist of projects that had the potential to produce a Babel Fish-type translation device, this would probably be on it.
Backed by the European academe, private sector, and government, the project is called ELITR (pronounced “eh-lee-ter”), also known as European Live Translator. The project was born out of the need to provide subtitles for a EUROSAI Congress back in May.
EUROSAI is the European Organization of Supreme Audit Institutions; and the Supreme Audit Office of the Czech Republic initiated the project to help translate speeches in real-time from six source languages into 43 targets: 24 EU languages, plus 19 EUROSAI languages (e.g., Armenian, Russian, Bosnian, Georgian, Hebrew, Kazakh, Norwegian, Luxembourgish).
In an ELITR demo video, Charles University Assistant Professor, Ondřej Bojar, said the project also looks into the possibility of “going directly from the source speech into the target language with an end-to-end spoken language translation system.” In short, speech-to-speech translation.
True S2ST, as it is sometimes called, bypasses the text-translation step and has become a sort of brass ring in research and big tech — as tackled by the likes of Apple, Google (via the so-called “Translatotron”; SlatorPro), and prominent Japanese researchers, who uploaded a toolkit for it on GitHub. Chinese search giant Baidu even drew some flack for claims around it; and, of course, there is a whole graveyard of translation gadgets from companies that tried to commercialize S2ST.
Admittedly, ELITR’s production pipeline currently relies on two independent steps — that is, automatic speech recognition (ASR) and machine translation (MT) and, according to Bojar “we are actually quite good in these two steps” (as evidenced by a paper published on June 17, 2021). However, end-to-end speech translation is part of the long-term vision.
“We’re also investigating the possibilities of going directly from the source speech into the target language with an end-to-end spoken language translation system” — Ondrej Bojar, Assistant Professor, Charles University
This vision was outlined in a recent paper published on the Association for Computational Linguistics portal. “The goal of a practically usable simultaneous spoken language translation (SLT) system is getting closer,” wrote the authors from Charles University, Karlsruhe Institute of Technology, the University of Edinburgh, and Italy-based automatic speech recognition (ASR) provider PerVoice. SLT also encompasses off-line spoken language systems, the authors said.
The authors (Bojar, among them) singled out two problems of the current system that have yet to be solved.
- Intonation cannot be factored in as punctuation prediction has no access to sound; and
- Loss of topicalization — that is, MT systems tending to “normalize word order,” thus reducing fluency in a stream of spoken sentences.
Hence, “for the future, we consider three approaches,” Bojar, et al. added: (1) training MT on sentence chunks, (2) including sound input in punctuation prediction, or (3) end-to-end neural SLT.”
Working alongside Charles University on ELITR were the University of Edinburgh, and Karlsruhe Institute of Technology. ASR provider, PerVoice, and Germany-based video conferencing platform, alfaview, also participated in the project. Does this mean commercialization plans are on the drawing board?
Bojar told Slator, “For a research institute at a university, commercialization is always something that takes an unbearably long time, but we are definitely very open to many forms of collaboration.”