The ParaCrawl project is mining a petabyte of the web for translations to release freely at https://paracrawl.eu/releases.html. But the web is a messy place, with a lot of data to sift through. To find translations, we translate everything into English or at least use a neural encoder. A related project makes machine translation inference more efficient by using optimizations ranging from assembly instructions to removal of bits of model architecture.
Kenneth Heafield is a lecturer leading a machine translation group at the University of Edinburgh. He works on efficient neural networks, low-resource translation, mining petabytes for translations, and, occasionally, grammatical error correction. The ParaCrawl project (https://paracrawl.eu/) has large free corpora for 24 languages parallel with English.