KarthikBacherao.com

A beginner's guide to text processing for deep learning using TensorFlow


Text Processing for Deep Learning


Text preprocessing -- particularly understanding how to create representations from textual datasets -- is one of the first steps to building a deep-learning model, which is capable of learning from large datasets of textual content, for natural-language tasks like classification, prediction, intent identification along with text generation.

Text Vectorization

Text Vectorization is the process of encoding textual content, strings and words from a dataset, into sequences of numeric values -- integers, floating points numbers -- for feeding into Neural Networks in order to process text. While there are severals ways of creating text vectors, let's start with a simple one. One of the easiest ways to vectorize text for a large dataset is to first create a dictionary of all the words in the dataset -- i.e., create a vocabularly, associate each word to an index, and convert each word in the data set to the corresponding index in the dictionary.

For example, "This is a simple line" would be vectorized as [1, 2, 3, 4, 5] in the dictionary, mapping each word as:
[ This -> 1] [ is -> 2] [ a -> 3] [ simple -> 4] [ line -> 5]


We'll use a simple tokenizer for this. Let's tokenize a few lines of text by first identifying and building a dictionary, building the vocabulary based on the provided textual dataset.

Loading...

Once all the words in the dataset have been identified, for a seed text, let's vectorize it. We do this by converting each word in the sample text into the corresponding index in the dictonary.
Loading...

Text Embeddings

The next step to building a textual model is to build an Embeddings Layer. Embeddings helps create vectorized represenations of sequences of words to build a deeper understanding of the phrase or the sequence of words, helping the model understand similarity between words and phrases and also help classify similar words based on context.
Let's create a sample text and see how an Embeddings layer interprets it across a few dimensions.

Loading...


Putting it all together

const textLines = [ 'This is a simple text file', 
		    'with contents split into',
		    'several lines'];

tokenizer.fitOnTexts(textLines);

// for a sample line of text, vectorize it
const seed = [ 'This',  'is',  'a',  'file'];

// textsToSequnces converts the sample line of text into a vector by looking up each 
// word and replacing it with the corresponding index from the dictionary 
const seedSeq = tokenizer.textsToSequences(seed);
console.log(seedSeq);

const xs = tf.tensor([seedSeq], [1, 4]);
const phraseEmbeddings = tf.layers.embedding({
			          name:  'phrasesEmbeddings',
			          inputDim: 12,
			          outputDim: 10,
			          inputLength: 4
			      }).apply(xs);

phraseEmbeddings.print();