Skip to content Skip to sidebar Skip to footer

On Training Lstms Efficiently But Well, Parallelism Vs Training Regime

For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct th

Solution 1:

I do not know your specific application, however sending only one timestep of data in is surely not a good idea. You should instead, give the LSTM the entire sequence of previously given one-hot vectors (presumably words), and pre-pad (with zeros) if necessary as it appears you are working on sequences of varying length. Consider also using an embedding layer before your LSTM if these are indeed words. Read the documentation carefully.

The utilization of your GPU being low is not a problem. You simply do not have enough data to fully utilize all resources in each batch. Training with batches is a sequential process, there is no real way to parallelize this process, at least a way that is introductory and beneficial to what your goals appear to be. If you do give the LSTM more data per timestep, however, this surely will increase your utilization.

statefull in an LSTM does not do what you think it does. An LSTM always remembers the sequence it is iterating over as it updates it's internal hidden states, h and c. Furthermore, the weight transformations that "build" those internal states are learned during training. What stateful does is preserve the previous hidden state from the last batch index. Meaning, the final hidden state at the third element in the batch is sent as the initial hidden state in the third element of the next batch and so on. I do not believe this is useful for your applications.

There are downsides to training the LSTM with one sample per batch. In general, training with min-batches increases stability. However, you appear to not be training with one sample per batch but instead one timestep per sample.


Edit (from comments)

If you use stateful and send the next 'character' of your sequence in the same index of the previous batch this would be analogous to sending the full sequence timesteps per sample. I would still recommend the initial approach described above in order to improve the speed of the application and to be in more line with other LSTM applications. I see no disadvantages to the approach of sending the full sequence per sample instead of doing it along every batch. However, the advantage of speed, being able to shuffle your input data per batch, and being more readable/consistent would be worth the change IMO.

Post a Comment for "On Training Lstms Efficiently But Well, Parallelism Vs Training Regime"