A light method for data generation: a combination of Markov Chains and Word Embeddings.

Martínez García, Eva; Nogales Moyano, Alberto; Morales Escudero, Javier; García Tejedor, Álvaro José

A light method for data generation: a combination of Markov Chains and Word Embeddings.

Files

6199-5608-1-PB.pdf (1.75 MB)

Identifiers

URI: http://hdl.handle.net/10641/2327

ISSN: 1135-5948

DOI: 10.26342/2020-64-10

Publication date

2020

Authors

Martínez García, Eva

Nogales Moyano, Alberto

Morales Escudero, Javier

García Tejedor, Álvaro José

Publisher

Procesamiento del Lenguaje Natural

Metrics

Share

Abstract

Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similar to an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.

Keywords

Generation, Hybrid, Markov Chains, Embeddings, Similarity

Collections

INGENIERÍA

Full item page

Depósito Digital UFV

A light method for data generation: a combination of Markov Chains and Word Embeddings.

Files

Identifiers

Publication date

Start date of the public exhibition period

End date of the public exhibition period

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Publisher

Metrics

Share

Research Projects

Organizational Units

Journal Issue

Abstract

Doctoral program

Description

Keywords

Citation

Collections