A light method for data generation: a combination of Markov Chains and Word Embeddings.

Loading...
Thumbnail Image
Identifiers

Publication date

2020

Start date of the public exhibition period

End date of the public exhibition period

Advisors

Journal Title

Journal ISSN

Volume Title

Publisher

Procesamiento del Lenguaje Natural
Metrics
Google Scholar
Share

Research Projects

Organizational Units

Journal Issue

Abstract

Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similar to an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.

Doctoral program

Description

Keywords

Generation, Hybrid, Markov Chains, Embeddings, Similarity

Citation

Collections