How Keras Imdb Dataset Data Is Preprocessed?
Solution 1:
The words in the imdb dataset is replaced with an integer representing how frequently they occur in the dataset. When you are calling the load_data function for the first time it will download the dataset.
To see how the value is calculated, let's take a snippet of code from the source code(link is provided at the end)
idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
x_train is the numpy array from list xs of length x_train;
xs is the list formed from all the words in x_train and x_test by first extracting each item (movie review) from the dataset and then extracting the words. The position of each words are then added to index_from which specifies the actual index to start from (defaults to 3) and then added to starting character (1 by default so that the values start from 1 as padding will be done with zeros)
numpy arrays x_train, y_train, x_test, y_test formed in a similar manner and returned by the load_data function.
The source code is available here.
https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py
Solution 2:
As explained here
Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). e.g a sentence is preprocessed like
I am coming home => [ 1, 3, 11, 15]
. Here1
is the vocabulary index for the wordI
words are indexed by overall frequency in the dataset. i.e if you are using a CountVectorizer, you need to sort the vocabulary in the descending order of the frequency. Then the resulting order of words corresponding to their vocabulary indices.
Post a Comment for "How Keras Imdb Dataset Data Is Preprocessed?"