·2 min read
LSTM Gives a Network Three Gates to Choose What to Forget; GRU Does It With Two
Vanilla RNNs forget almost everything within a few timesteps because of vanishing gradients, which defeats the entire point of using a recurrent network on long