Unit III — Neural Networks in Practice

Chapter 10 — RNN, LSTM, GRU: Sequence Modeling

Framework Fill the placeholders below with your full content.

Chapter 10 — RNN, LSTM, GRU: Sequence Modeling

Unit III · Neural Networks in Practice

Objectives

Understand recurrent computation and hidden states · Implement LSTM and GRU cells · Apply sequence models to time-series and NLP tasks

1. Recurrent Neural Networks

An RNN processes sequences \((x_1, x_2, \ldots, x_T)\) by maintaining a hidden state \(h_t\) that carries context from previous timesteps:

\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h), \quad \hat{y}_t = W_{hy} h_t + b_y\]

The same weight matrices \(W_{hh}, W_{xh}\) are shared across all timesteps — RNNs have the same number of parameters regardless of sequence length.

Vanishing gradient in RNNs

Gradients flow through the product \(\prod_{t=k}^{T}\frac{\partial h_t}{\partial h_{t-1}}\). With tanh, each Jacobian has norm ≤ 1, so the product vanishes exponentially with sequence length → RNNs struggle with long-range dependencies (typical useful memory: 8–15 steps).

layers.SimpleRNN(64, return_sequences=True, input_shape=(T, features))

2. Deep RNNs

Stack multiple RNN layers where the output sequence of one layer becomes the input to the next. Increases representational power for complex sequence patterns.

model = keras.Sequential([
    layers.LSTM(128, return_sequences=True, input_shape=(T, features)),
    layers.LSTM(64,  return_sequences=True),
    layers.LSTM(32),
    layers.Dense(1)
])

Bidirectional RNNs: process the sequence forward and backward, concatenating the hidden states — useful when the full sequence is available (e.g., NLP classification, not real-time forecasting).

layers.Bidirectional(layers.LSTM(64))

3. LSTM (Long Short-Term Memory)

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem by introducing a cell state \(C_t\) — a memory highway protected by three learnable gates:

Forget gate: \(f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)\) — what to erase from cell state.
Input gate: \(i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)\), \(\tilde{C}_t = \tanh(W_C[\cdot])\) — what new info to add.
Output gate: \(o_t = \sigma(W_o[\cdot])\); \(h_t = o_t \odot \tanh(C_t)\) — what to output.

\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]

Why LSTM works

The cell state update involves addition (not just multiplication), providing a gradient highway that prevents vanishing. Forget gate = 1, input gate = 0 → cell state unchanged (perfect memory).

4. GRU (Gated Recurrent Unit)

GRU (Cho et al., 2014) simplifies LSTM to two gates, merging cell state and hidden state:

Reset gate: \(r_t = \sigma(W_r[h_{t-1}, x_t])\) — how much of previous hidden state to use.
Update gate: \(z_t = \sigma(W_z[h_{t-1}, x_t])\) — interpolation between old and new state.

\[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
States	h + C	h only
Parameters	More	~25% fewer
Performance	Slightly better (long seq)	Comparable, faster

Worked Example — LSTM for stock price prediction

import numpy as np, tensorflow as tf
# Create sliding windows: past 60 days → predict next day
def make_dataset(series, window=60):
    X, y = [], []
    for i in range(len(series)-window):
        X.append(series[i:i+window]); y.append(series[i+window])
    return np.array(X)[..., None], np.array(y)

X, y = make_dataset(prices_scaled)
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(50, return_sequences=True, input_shape=(60,1)),
    tf.keras.layers.LSTM(50),
    tf.keras.layers.Dense(1)
])
model.compile('adam', 'mse')
model.fit(X, y, epochs=20, batch_size=32, validation_split=0.1)

Viva Questions

Why do standard RNNs fail on long sequences?
Describe the three gates of an LSTM and their roles.
How does the cell state in LSTM mitigate vanishing gradients?
When would you prefer a GRU over an LSTM?
What is a bidirectional RNN and when is it appropriate?