Understanding LSTMs

4 min readOct 26, 2024

Unlocking the Power of Long-Short-Term Memory Networks for improved predictive models.

Long-short-term Memory architecture for time-series forecasting models has become popular in recent machine learning projects.

Types of LSTMs

It's important to have a general background in LSTMs to gain a broader understanding. The following LSTM models offer a concise explanation of their internal mechanisms and their rankings from basic to advanced.

Vanilla LSTM (Basic LSTM)

Your primary LSTM Cell states are calculated using activation functions, such as sigmoid and trigonometric function hyperbolic tangent, i.e., tanh(x), and providing weights and bias logic.

The hidden state and Input are immediately multiplied by their weights and Summed with their Bias. This pattern is repeated across the cell and multiplied by tanh(new cell state) to get the new hidden state.

Sigmoid Activation Function helps determining the probability of the summed input value.

Hyperbolic Tangent is performed in the Input Gate as well as the final multiplier of the final Cell State.

The Final Cell and Hidden State outputs can be used in your final prediction or for sequentially chaining LSTM units to achieve better results.

Coupled Input-Forget Gate LSTM (Simplified LSTM)

Another simplified LSTM would remove the first Input Gate activation function and use the Forget Gates error multiplied by the Input Gate Candidate Values (i.e., after the first hyperbolic tangent value).

Below, the equation includes the one minus the forget gate activation function value, f_t.

Peephole LSTM

You’ll notice Cell states C_{t-1} and C_{t+1} are used in determining long and short term memory states.

Peephole LSTM allows the Input and Output Cell States to be used in conjunction with the Input Data and hidden state, which now provides the Forget, Input, and Output Gate context of the Long-Term Memory (Cell State).

The New Activation function would appear with the new peephole connections with its respective weight multiplier.

Residual LSTM

We add residual learning by adding a Residual Weight Ws, However if the Input Gate Dimension and Output Dimension are equal then the Ws = 1.

The Projection Matrix is generated by multiplying with Weight, Wp, and the output gate post-activation hyperbolic tangent function.

Here is a clear view of the new hidden state calcuated by including the Residual and Projected weights.

Advanced Processing Approaches

Bidirectional LSTM (BiLSTM)

You’ll notice a common theme in building advanced neural networks: forward-backward direction processing of the data. Here, the LSTMs are performed simultaneously by their respective input and hidden/cell states.

Forward Pass

Backward Pass

Stacked LSTM

This processing method stacks LSTM Layers, where the output Hidden and Cell states are the inputs for the next layer. Finally, the dense layer condenses the output to the desired dimension.

This is usually the initial approach if you are building time-series forecasting, where each LSTM could represent the processing for each output. Hence, this leads to widowing and the desired output dimension.

Not quite an LSTM

GRU (Gated Recurrent Unit)

Gaining popularity as an alternative to LSTM, GRU has become the standard for most recurrent neural networks and has been used for memory applications.