Skip to content Skip to sidebar Skip to footer

How Keras Imdb Dataset Data Is Preprocessed?

I'm working on a problem of sentiment analysis and have a dataset, which is very similar to Kears imdb dataset. When I load Keras’s imdb dataset, it returned sequence of word in

Solution 1:

The words in the imdb dataset is replaced with an integer representing how frequently they occur in the dataset. When you are calling the load_data function for the first time it will download the dataset.

To see how the value is calculated, let's take a snippet of code from the source code(link is provided at the end)

idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

x_train is the numpy array from list xs of length x_train;

xs is the list formed from all the words in x_train and x_test by first extracting each item (movie review) from the dataset and then extracting the words. The position of each words are then added to index_from which specifies the actual index to start from (defaults to 3) and then added to starting character (1 by default so that the values start from 1 as padding will be done with zeros)

numpy arrays x_train, y_train, x_test, y_test formed in a similar manner and returned by the load_data function.

The source code is available here.

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

Solution 2:

As explained here

  1. Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). e.g a sentence is preprocessed like I am coming home => [ 1, 3, 11, 15]. Here 1 is the vocabulary index for the word I

  2. words are indexed by overall frequency in the dataset. i.e if you are using a CountVectorizer, you need to sort the vocabulary in the descending order of the frequency. Then the resulting order of words corresponding to their vocabulary indices.

Post a Comment for "How Keras Imdb Dataset Data Is Preprocessed?"