A light method for data generation: a combination of Markov Chains and Word Embeddings.

Martínez García, Eva; Nogales Moyano, Alberto; Morales Escudero, Javier; García Tejedor, Álvaro José

doi:10.26342/2020-64-10

Autor: Martínez García, Eva; Nogales Moyano, Alberto; Morales Escudero, Javier; García Tejedor, Álvaro José

Resumen: Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similar to an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.

Identificador universal: http://hdl.handle.net/10641/2327

DOI: 10.26342/2020-64-10

Fecha: 2020

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
6199-5608-1-PB.pdf	1.746Mb	PDF	Ver/

Este ítem aparece en la(s) siguiente(s) colección(ones)

INGENIERÍA [137]

Mostrar el registro completo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución-NoComercial-SinDerivadas 3.0 España

Depósito Digital UFV

Depósito Digital UFV