Chapter 10 — RNN, LSTM, GRU: Sequence Modeling
Chapter 10 — RNN, LSTM, GRU: Sequence Modeling
Unit III · Neural Networks in Practice
1. Recurrent Neural Networks
An RNN processes sequences \((x_1, x_2, \ldots, x_T)\) by maintaining a hidden state \(h_t\) that carries context from previous timesteps:
\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h), \quad \hat{y}_t = W_{hy} h_t + b_y\]
The same weight matrices \(W_{hh}, W_{xh}\) are shared across all timesteps — RNNs have the same number of parameters regardless of sequence length.
layers.SimpleRNN(64, return_sequences=True, input_shape=(T, features))
2. Deep RNNs
Stack multiple RNN layers where the output sequence of one layer becomes the input to the next. Increases representational power for complex sequence patterns.
model = keras.Sequential([
layers.LSTM(128, return_sequences=True, input_shape=(T, features)),
layers.LSTM(64, return_sequences=True),
layers.LSTM(32),
layers.Dense(1)
])
Bidirectional RNNs: process the sequence forward and backward, concatenating the hidden states — useful when the full sequence is available (e.g., NLP classification, not real-time forecasting).
layers.Bidirectional(layers.LSTM(64))
3. LSTM (Long Short-Term Memory)
LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem by introducing a cell state \(C_t\) — a memory highway protected by three learnable gates:
- Forget gate: \(f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)\) — what to erase from cell state.
- Input gate: \(i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)\), \(\tilde{C}_t = \tanh(W_C[\cdot])\) — what new info to add.
- Output gate: \(o_t = \sigma(W_o[\cdot])\); \(h_t = o_t \odot \tanh(C_t)\) — what to output.
\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]
4. GRU (Gated Recurrent Unit)
GRU (Cho et al., 2014) simplifies LSTM to two gates, merging cell state and hidden state:
- Reset gate: \(r_t = \sigma(W_r[h_{t-1}, x_t])\) — how much of previous hidden state to use.
- Update gate: \(z_t = \sigma(W_z[h_{t-1}, x_t])\) — interpolation between old and new state.
\[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| States | h + C | h only |
| Parameters | More | ~25% fewer |
| Performance | Slightly better (long seq) | Comparable, faster |
Worked Example — LSTM for stock price prediction
import numpy as np, tensorflow as tf
# Create sliding windows: past 60 days → predict next day
def make_dataset(series, window=60):
X, y = [], []
for i in range(len(series)-window):
X.append(series[i:i+window]); y.append(series[i+window])
return np.array(X)[..., None], np.array(y)
X, y = make_dataset(prices_scaled)
model = tf.keras.Sequential([
tf.keras.layers.LSTM(50, return_sequences=True, input_shape=(60,1)),
tf.keras.layers.LSTM(50),
tf.keras.layers.Dense(1)
])
model.compile('adam', 'mse')
model.fit(X, y, epochs=20, batch_size=32, validation_split=0.1)
Viva Questions
- Why do standard RNNs fail on long sequences?
- Describe the three gates of an LSTM and their roles.
- How does the cell state in LSTM mitigate vanishing gradients?
- When would you prefer a GRU over an LSTM?
- What is a bidirectional RNN and when is it appropriate?