LSTM Gives a Network Three Gates to Choose What to Forget; GRU Does It With Two
Vanilla RNNs forget almost everything within a few timesteps because of vanishing gradients, which defeats the entire point of using a recurrent network on long sequences. LSTM (Long Short-Term Memory) is the fix: an improved RNN cell explicitly designed to capture long-term dependencies by giving the network a separate memory pathway that gradients can flow through without shrinking at every step.
The mechanism is three gates, each a small neural layer with a sigmoid output between 0 and 1, acting as a dial rather than a switch. The forget gate looks at the previous hidden state and current input and decides what to remove from the memory cell: a value near 0 means erase, near 1 means keep. . The input gate decides what new information gets added: , paired with a candidate value that proposes what that new information actually is. The output gate decides what part of the (now updated) memory gets exposed as the hidden state: , then . Three gates, three separate decisions: what to erase, what to add, what to reveal.
GRU (Gated Recurrent Unit) asks whether all three are necessary and answers no. It's a simplified LSTM with only two gates instead of three, and no separate cell state at all, just a single hidden state that does the job of both. The update gate decides how much of the previous hidden state to carry forward into the next timestep: . The reset gate controls how much of the past to ignore when computing a new candidate: . That candidate is , and the final hidden state blends old and new directly:
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| State | Separate cell state + hidden state | Single hidden state |
| Parameters | More | Fewer |
| Typical use | Longer, more complex sequences | Faster training, comparable results on many tasks |
Neither gate count is arbitrary: LSTM's third gate buys it a dedicated, slow-changing memory lane that's genuinely separate from what gets output at each step. GRU folds those two roles together and loses some of that separation, but in exchange trains faster with fewer parameters, and in practice often gets within a hair of LSTM's accuracy anyway.