How to Overcome the Need for Data for Low-Resource Languages

NMT requires a large amount of data to deliver the best results. This presents a problem for low-resource languages because of the lack of publicly available corpora. The MT community has developed many creative approaches to solve this problem including massive crawling, alignment, back-translation, monolingual trainings, and more. This short talk will present a process of using Python libraries to extract text from institutional Twitter accounts to create a monolingual corpora for Galician that can be used to train an NLP engine. Results evaluated by native speakers prove this approach has promise.

Conference Event Type

Session

Conference Speakers