NMT requires a large amount of data to deliver the best results. This presents a problem for low-resource languages because of the lack of publicly available corpora. The MT community has developed many creative approaches to solve this problem including massive crawling, alignment, back-translation, monolingual trainings, and more. This short talk will present a process of using Python libraries to extract text from institutional Twitter accounts to create a monolingual corpora for Galician that can be used to train an NLP engine. Results evaluated by native speakers prove this approach has promise.
Conference Event Type
Conference Track Format
Conference Track Type