Understanding LSTMs
Unlocking the Power of Long-Short-Term Memory Networks for improved predictive models.
Long-short-term Memory architecture for time-series forecasting models has become popular in recent machine learning projects.
Types of LSTMs
It's important to have a general background in LSTMs to gain a broader understanding. The following LSTM models offer a concise explanation of their internal mechanisms and their rankings from basic to advanced.
Vanilla LSTM (Basic LSTM)
Your primary LSTM Cell states are calculated using activation functions, such as sigmoid and trigonometric function hyperbolic tangent, i.e., tanh(x), and providing weights and bias logic.
The hidden state and Input are immediately multiplied by their weights and Summed with their Bias. This pattern is repeated across the cell and multiplied by tanh(new cell state) to get the new hidden state.
The Final Cell and Hidden State outputs can be used in your final prediction or for sequentially chaining LSTM units to achieve better results.
Coupled Input-Forget Gate LSTM (Simplified LSTM)
Another simplified LSTM would remove the first Input Gate activation function and use the Forget Gates error multiplied by the Input Gate Candidate Values (i.e., after the first hyperbolic tangent value).
Below, the equation includes the one minus the forget gate activation function value, f_t.
Peephole LSTM
Peephole LSTM allows the Input and Output Cell States to be used in conjunction with the Input Data and hidden state, which now provides the Forget, Input, and Output Gate context of the Long-Term Memory (Cell State).
Residual LSTM
We add residual learning by adding a Residual Weight Ws, However if the Input Gate Dimension and Output Dimension are equal then the Ws = 1.
The Projection Matrix is generated by multiplying with Weight, Wp, and the output gate post-activation hyperbolic tangent function.
Advanced Processing Approaches
Bidirectional LSTM (BiLSTM)
You’ll notice a common theme in building advanced neural networks: forward-backward direction processing of the data. Here, the LSTMs are performed simultaneously by their respective input and hidden/cell states.
Stacked LSTM
This processing method stacks LSTM Layers, where the output Hidden and Cell states are the inputs for the next layer. Finally, the dense layer condenses the output to the desired dimension.
This is usually the initial approach if you are building time-series forecasting, where each LSTM could represent the processing for each output. Hence, this leads to widowing and the desired output dimension.
Not quite an LSTM
GRU (Gated Recurrent Unit)
Gaining popularity as an alternative to LSTM, GRU has become the standard for most recurrent neural networks and has been used for memory applications.
Where r(t) represents the output of the first sigmoid function and z(t) represents the output of the second sigmoid function.