Séminaire Alpage : Barbara Plank

Vendredi 21 Octobre 2016, 11:00 to 12:00
Organisation: 
Djamé Seddah (LLF)
Lieu: 

ODG – Salle 357

Barbara Plank (University of Groningen)
Processing non-canonical data: Deep Learning Meets Fortuitous Data

Successful Natural Language Processing (NLP) depend on large amounts of annotated training data, that is abundant, completely labeled and preferably canonical. However, such data is only available to a limited degree. For example for parsing, annotated treebank data is only available for a limited set of languages and domains. This is the fundamental problem of data sparsity.

In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? I will present recent work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.

References

[1] Barbara Plank. What to do about non-standard (or non-canonical) language in NLP. In KONVENS 2016. Bochum, Germany.

[2] Barbara Plank, Anders Søgaard and Yoav Goldberg. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In ACL, 2016. Berlin, Germany.

[3] Barbara Plank. Keystroke dynamics as signal for shallow syntactic parsing. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), 2016. Osaka, Japan.