TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Image credit: Unsplash


Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

Association for Computational Linguistics
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.


My research interests include Cross-Lingual Information Retrieval, Multilingual Information Retrieval, Event Extraction, Narrative Event Schemas, Personality Profiling, and Persuasive Content-Messaging.