Ludo Reinforcement Learning — Full Pipeline
Train a competitive Ludo AI from scratch using Deep Q-Networks, self-play, and shaped rewards. This guide covers the complete pipeline: Gym environment, state representation, DQN architecture, training loop, and evaluation harness.
Why Reinforcement Learning for Ludo?
Ludo occupies a sweet spot for RL research: the action space is small (at most 4 moveable tokens per turn), game length is short (20–60 moves), the rule set is deterministic and fully observable, and self-play generates unlimited training data without any labeled datasets. Unlike chess or Go where move trees are astronomically large, Ludo's reduced state space means you can train a competitive agent on a single consumer GPU in under 8 hours.
Traditional heuristic-based Ludo bots use hand-tuned weight functions — safety bonuses, progress scores, capture incentives — that require expert domain knowledge and still leave exploitable blind spots. RL lets the agent discover strategies that humans never considered. The same approach powered AlphaGo's 2016 victory and AlphaZero's subsequent dominance across chess, shogi, and Go. For Ludo, the scale is far more manageable, making it an ideal RL proving ground.
This guide implements a complete pipeline: a Gym-compatible Ludo environment with rich state encoding, a Deep Q-Network (DQN) agent with experience replay and target networks, shaped reward functions, a self-play training loop, and a rigorous evaluation harness. By the end, you will have a trained model that you can export via TorchScript and connect to the Ludo API Python client for live gameplay evaluation.
The Ludo Gym Environment
Every RL pipeline starts with a well-designed environment. We implement a gym.Env-compatible Ludo environment that exposes reset(), step(action), and render(). The environment encodes the full board state as a flat observation vector — this encoding is the most consequential design decision because it determines what the neural network sees and therefore what strategies it can learn.
Our board representation uses 52 track positions (standard Ludo cross-shaped track), 4 home columns per player (each 5 squares long), and 4 base positions per player. The state vector encodes each token's position as a normalized float, making it easy for a neural network to learn position-based value functions. We include the dice value and current player index as additional features so the network has full context for decision-making.
import gym
import numpy as np
from gym import spaces
from typing import Tuple, Dict, Optional
class LudoEnv(gym.Env):
"""
Gym-compatible Ludo environment.
Observation: 52 track + 4 home + 4 base = 60 positional features,
plus dice value and current player = 2 meta features.
Total: 62 floats.
Action: 0–3 (choose which of the 4 tokens to move).
"""
metadata = {"render_modes": ["human"]}
TRACK_SIZE = 52
HOME_SIZE = 5
BASE_SIZE = 4
NUM_PLAYERS = 4
NUM_TOKENS = 4
SAFE_SQUARES = [1, 9, 14, 22, 27, 35, 40, 48]
def __init__(self, variant="standard"):
super().__init__()
self.variant = variant
self.observation_space = spaces.Box(
low=-1, high=100,
shape=(62,), dtype=np.float32
)
self.action_space = spaces.Discrete(4)
self._state = None
def reset(self, seed: Optional[int] = None) -> np.ndarray:
super().reset(seed=seed)
# Each player's token positions: -1=base, 0–51=track, 52–56=home, 57=finished
self.tokens = np.array([-1] * 16, dtype=np.int8)
self.player = 0
self.dice = 0
self.done = False
self.winner = None
self.consecutive_sixes = 0
return self._observe()
def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
if self.done:
return self._observe(), 0.0, True, {}
reward = 0.0
player_offset = self.player * 4
token_idx = player_offset + action
current_pos = self.tokens[token_idx]
if current_pos == -1:
# Token still in base — can only move out on a 6
if self.dice == 6:
self.tokens[token_idx] = player_offset
reward = self._compute_reward(token_idx, entering_board=True)
else:
reward = -0.05
else:
new_pos = current_pos + self.dice
if new_pos > self.TRACK_SIZE + self.HOME_SIZE:
# Exceeds home column — invalid move, penalise
reward = -0.05
elif new_pos == self.TRACK_SIZE + self.HOME_SIZE:
# Token reached home — game may end
self.tokens[token_idx] = 57
reward = self._compute_reward(token_idx, reached_home=True)
if self._check_winner():
self.done = True
else:
if new_pos <= self.TRACK_SIZE:
# Check for captures on track
captured = self._check_capture(new_pos, player_offset)
reward = self._compute_reward(token_idx, captured=captured)
self.tokens[token_idx] = new_pos
# Advance to next player (or re-roll on 6)
self._advance_player()
return self._observe(), reward, self.done, {"winner": self.winner}
def _observe(self) -> np.ndarray:
"""
Build 62-dimensional observation vector.
- Positions 0–15: Token track positions (normalized 0–1)
- Positions 16–31: Token home progress (0–1)
- Positions 32–47: Token in-base flags (0 or 1)
- Positions 48–51: Player base tokens count
- Positions 52–55: Player finished tokens count
- Positions 56–59: Relative track positions for current player
- Position 60: Dice value (normalized)
- Position 61: Current player index (normalized)
"""
obs = np.zeros(62, dtype=np.float32)
for p in range(4):
for t in range(4):
idx = p * 4 + t
pos = self.tokens[idx]
if pos == -1:
obs[idx] = 0.0
obs[32 + idx] = 1.0
elif pos >= 57:
obs[48 + p] += 1.0
obs[52 + p] += 1.0 / 4
else:
obs[idx] = pos / 56
obs[60] = self.dice / 6.0
obs[61] = self.player / 3.0
return obs
def _compute_reward(self, token_idx: int,
entering_board: bool = False,
captured: bool = False,
reached_home: bool = False) -> float:
"""
Shaped reward function — combines sparse terminal rewards with
dense intermediate shaping for faster convergence.
"""
reward = 0.0
pos = self.tokens[token_idx]
if reached_home:
return 100.0
if entering_board:
return 0.5
if captured:
return 5.0
if pos > 0:
return pos / 56.0 * 0.3
return 0.0
def _check_capture(self, new_pos: int, player_offset: int) -> bool:
if new_pos % 52 in self.SAFE_SQUARES:
return False
for opp in range(16):
if opp // 4 == self.player // 4:
continue
if self.tokens[opp] % 52 == new_pos % 52:
self.tokens[opp] = -1
return True
return False
def _check_winner(self) -> bool:
for p in range(4):
if all(self.tokens[p*4:p*4+4] >= 57):
self.winner = p
return True
return False
def _advance_player(self):
if self.dice != 6:
self.player = (self.player + 1) % 4
self.consecutive_sixes = 0
else:
self.consecutive_sixes += 1
if self.consecutive_sixes >= 3:
self.consecutive_sixes = 0
self.player = (self.player + 1) % 4
self.dice = np.random.randint(1, 7)
def render(self, mode="human"):
print(f"Player: {self.player}, Dice: {self.dice}, Tokens: {self.tokens}")
State Encoder — From Raw Board to Feature Vector
The state encoder transforms the raw board configuration into a neural-network-friendly feature vector. A naive approach encodes only token positions as a flat array of 16 integers. Our encoder is more sophisticated — it captures relational information between tokens, player-relative positions, and game-phase indicators that help the network distinguish between early-game, mid-game, and endgame situations.
Player-relative encoding is critical. Absolute track positions are misleading across players because each player's starting square is different. By encoding each position relative to the current player's start, the network learns position-independent strategies that transfer across all four players. We also compute "threat indicators" — whether opponent tokens are within capture range of the current player's tokens — giving the network explicit danger signals without needing to learn them from raw positions.
import numpy as np
class LudoStateEncoder:
"""
Encodes Ludo board state into a rich feature vector for neural networks.
Produces a 128-dimensional embedding from the 62-dimensional raw observation.
"""
TRACK_SIZE = 52
HOME_SIZE = 5
SAFE_SQUARES = [1, 9, 14, 22,
27, 35, 40, 48]
def encode(self, obs: np.ndarray, player: int) -> np.ndarray:
"""
Args:
obs: 62-dim raw observation from LudoEnv
player: current player index (0–3)
Returns:
128-dim encoded state
"""
features = np.zeros(128, dtype=np.float32)
pos = 0
# Token progress for each player (normalized 0–1)
for p in range(4):
for t in range(4):
idx = p * 4 + t
raw_pos = obs[idx]
if raw_pos == 0:
features[pos] = 0.0
elif raw_pos >= 57:
features[pos] = 1.0
else:
features[pos] = raw_pos / 56.0
pos += 1
# Player-relative positions (how far ahead/behind each player is)
player_track_total = 0.0
for t in range(4):
player_track_total += features[(player * 4) + t]
avg_progress = player_track_total / 4
for p in range(4):
player_total = 0.0
for t in range(4):
player_total += features[(p * 4) + t]
features[16 + p] = player_total - avg_progress
pos = 20
# Game phase indicator (0=opening, 0.5=mid, 1=endgame)
finished_count = sum(1 for t in range(4)
if features[player*4+t] >= 1.0)
features[20] = finished_count / 4.0
# Dice value and one-hot encoding of dice result (6 features)
dice_val = obs[60] * 6.0
for d in range(6):
features[21 + d] = 1.0 if int(dice_val) == d + 1 else 0.0
# Threat indicators — opponent tokens within 6 squares ahead
for p in range(4):
if p == player: continue
threats = 0
for t in range(4):
opp_pos = obs[p*4+t]
for own_t in range(4):
own_pos = obs[player*4+own_t]
if own_pos > 0 and opp_pos > 0:
dist = (opp_pos - own_pos) % 52
if 1 <= dist <= 6:
threats += 1
features[27 + p] = threats / 16.0
pos = 31
# Safe square proximity bonus indicators
for t in range(4):
token_pos = obs[player*4+t]
if token_pos > 0 and token_pos < 52:
for safe in self.SAFE_SQUARES:
if abs(int(token_pos) - safe) <= 3:
features[31 + t] = 1.0 - abs(int(token_pos) - safe) / 3.0
break
return features
Deep Q-Network Agent — Architecture and Training
The DQN agent estimates the action-value function Q(s, a) — the expected cumulative future reward of taking action a in state s. A neural network approximates this function, enabling the agent to generalise across the large state space. We use two key stabilising techniques: a target network (which provides stable boot-strapped targets) and experience replay (which breaks temporal correlations in the training data).
The network architecture uses three hidden layers with ReLU activations. The final layer outputs 4 Q-values — one per possible action. During training, we select actions using an epsilon-greedy policy that decays from 1.0 to 0.05 over the course of training, gradually shifting from exploration to exploitation. The learning rate of 1e-4 with the Adam optimizer provides stable convergence for this problem scale.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import collections
import random
class QNetwork(nn.Module):
"""
Feedforward network for Q-value estimation.
Input: 128-dim encoded state
Output: 4 Q-values (one per action)
"""
def __init__(self, state_dim=128, action_dim=4, hidden=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.LayerNorm(hidden),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.LayerNorm(hidden),
nn.Linear(hidden, hidden // 2),
nn.ReLU(),
nn.Linear(hidden // 2, action_dim)
)
for m in self.net:
if isinstance(m, nn.Linear):
nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
nn.init.constant_(m.bias, 0.0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
class DQNAgent:
"""
DQN agent with experience replay, target network, and double-Q updates.
"""
def __init__(self, state_dim=128, action_dim=4):
self.q_net = QNetwork(state_dim, action_dim)
self.target_net = QNetwork(state_dim, action_dim)
self.target_net.load_state_dict(self.q_net.state_dict())
self.target_net.eval()
self.replay_buffer = collections.deque(maxlen=200_000)
self.gamma = 0.99
self.epsilon = 1.0
self.epsilon_min = 0.05
self.lr = 1e-4
self.batch_size = 128
self.target_update_freq = 500
self.train_step_counter = 0
self.optimizer = optim.Adam(self.q_net.parameters(), lr=self.lr)
self.loss_fn = nn.SmoothL1Loss()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.q_net.to(self.device)
self.target_net.to(self.device)
def select_action(self, state: np.ndarray, training: bool = True) -> int:
if training and random.random() < self.epsilon:
return random.randint(0, 3)
with torch.no_grad():
state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
q_vals = self.q_net(state_t)
return int(q_vals.argmax(dim=1).item())
def store_transition(self, s, a, r, s2, done):
self.replay_buffer.append((s, a, r, s2, done))
def train_step) -> float:
if len(self.replay_buffer) < self.batch_size:
return 0.0
batch = random.sample(self.replay_buffer, self.batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states_t = torch.FloatTensor(np.array(states)).to(self.device)
actions_t = torch.LongTensor(actions).to(self.device)
rewards_t = torch.FloatTensor(rewards).to(self.device)
next_states_t = torch.FloatTensor(np.array(next_states)).to(self.device)
dones_t = torch.FloatTensor(dones).to(self.device)
# Double DQN: use online network to select action, target network to evaluate
with torch.no_grad():
next_actions = self.q_net(next_states_t).argmax(dim=1)
next_q = self.target_net(next_states_t).gather(1, next_actions.unsqueeze(1)).squeeze()
target = rewards_t + self.gamma * next_q * (1 - dones_t)
current_q = self.q_net(states_t).gather(1, actions_t.unsqueeze(1)).squeeze()
loss = self.loss_fn(current_q, target.detach())
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), max_norm=10.0)
self.optimizer.step()
self.train_step_counter += 1
if self.train_step_counter % self.target_update_freq == 0:
self.target_net.load_state_dict(self.q_net.state_dict())
return loss.item()
def decay_epsilon(self, episode: int, total: int):
self.epsilon = max(self.epsilon_min, 1.0 - episode / total)
def save(self, path: str):
torch.save({
"q_net": self.q_net.state_dict(),
"optimizer": self.optimizer.state_dict(),
"epsilon": self.epsilon,
"step": self.train_step_counter
}, path)
def load(self, path: str):
ckpt = torch.load(path, map_location=self.device)
self.q_net.load_state_dict(ckpt["q_net"])
self.target_net.load_state_dict(ckpt["q_net"])
self.optimizer.load_state_dict(ckpt[optimizer])
self.epsilon = ckpt["epsilon"]
self.train_step_counter = ckpt["step"]
Self-Play Training Loop
The training loop implements self-play where the agent competes against itself, continuously improving as it encounters stronger versions of its own strategy. We use a shared agent instance that plays all four positions — this creates a curriculum where the agent must continuously adapt because its opponents are also improving. We also maintain a periodic snapshot of the target opponent to provide a consistent baseline for measuring progress.
Episode data (states, actions, rewards, next states) is collected into the replay buffer. Every four training steps, we sample a mini-batch and perform a gradient update. We save checkpoints every 10,000 episodes and log training metrics (loss, epsilon, win rate against the snapshot opponent) for monitoring via TensorBoard or Weights & Biases.
import time
import matplotlib.pyplot as plt
def run_self_play_episode(env, agent, encoder, opponent_agent=None):
"""
Runs one full Ludo game with self-play.
Each player uses the same agent (self-play).
Returns the list of (state, action, reward, next_state, done) transitions.
"""
state = env.reset()
done = False
transitions = []
episode_reward = {p: 0.0 for p range(4)}
game_length = 0
while not done:
current_player = env.player
encoded = encoder.encode(state, current_player)
if opponent_agent is not None and current_player >= 2:
action = opponent_agent.select_action(encoded, training=False)
else:
action = agent.select_action(encoded, training=True)
next_state, reward, done, info = env.step(action)
next_encoded = encoder.encode(next_state, env.player)
transitions.append((encoded, action, reward, next_encoded, float(done)))
episode_reward[current_player] += reward
state = next_state
game_length += 1
winner = info.get("winner", None)
return transitions, winner, episode_reward, game_length
def train_dqn(total_episodes=200_000, save_every=10_000, eval_every=1_000):
env = LudoEnv()
encoder = LudoStateEncoder()
agent = DQNAgent(state_dim=128, action_dim=4)
# Snapshot opponent for consistent evaluation
snapshot_agent = DQNAgent(state_dim=128, action_dim=4)
snapshot_agent.q_net.load_state_dict(agent.q_net.state_dict())
snapshot_agent.target_net.load_state_dict(agent.target_net.state_dict())
metrics = {"loss": [], "epsilon": [], "win_rate": [], "avg_game_length": []}
wins = [0] * 4
total_games = 0
recent_lengths = collections.deque(maxlen=500)
start_time = time.time()
for episode in range(total_episodes):
agent.decay_epsilon(episode, total_episodes)
# Self-play episode — agents play all 4 positions
transitions, winner, ep_rewards, game_len = run_self_play_episode(
env, agent, encoder
)
if winner is not None:
wins[winner] += 1
total_games += 1
recent_lengths.append(game_len)
# Store all transitions in replay buffer
for s, a, r, s2, done in transitions:
agent.store_transition(s, a, r, s2, done)
# Train every 4 steps for efficiency
losses = []
for _ in range(4):
l = agent.train_step()
if l > 0:
losses.append(l)
# Evaluation every eval_every episodes
if episode % eval_every == 0 and episode > 0:
eval_wins = [0] * 4
for _ in range(500):
_, w, _, _ = run_self_play_episode(env, agent, encoder,
opponent_agent=snapshot_agent)
if w is not None:
eval_wins[w] += 1
win_rate = eval_wins[0] / 500
metrics["win_rate"].append(win_rate)
metrics["loss"].append(sum(losses) / len(losses) if losses else 0)
metrics["epsilon"].append(agent.epsilon)
metrics["avg_game_length"].append(sum(recent_lengths) / len(recent_lengths))
elapsed = time.time() - start_time
print(f"Ep {episode:>7d} | "
f"Loss: {metrics['loss'][-1]:.4f} | "
f"Eps: {agent.epsilon:.3f} | "
f"Win rate: {win_rate:.1%} | "
f"Games: {total_games} | "
f"Time: {elapsed:.0f}s")
# Checkpoint every save_every episodes
if episode % save_every == 0 and episode > 0:
agent.save(f"ludo_dqn_ep{episode}.pt")
snapshot_agent.q_net.load_state_dict(agent.q_net.state_dict())
snapshot_agent.target_net.load_state_dict(agent.target_net.state_dict())
agent.save("ludo_dqn_final.pt")
print(f"Training complete in {time.time() - start_time:.0f}s")
return agent, metrics
if __name__ == "__main__":
agent, metrics = train_dqn(total_episodes=100_000)
Reward Shaping — Dense vs. Sparse Rewards
Reward shaping is one of the most impactful (and underappreciated) aspects of RL for board games. A sparse reward scheme gives +100 for winning and -100 for losing, with 0 everywhere else. While theoretically sufficient for convergence (reward shaping preserves optimal policies under certain conditions), sparse rewards make learning extremely slow because the agent receives informative gradients only at the very end of a 30–60 move game.
Dense reward shaping addresses this by providing intermediate learning signals. Our reward function combines four components: progress rewards (small positive values for advancing tokens along the track), entry rewards (bonus for bringing a token onto the board with a 6), capture rewards (significant bonus for sending an opponent back to base), and home rewards (large bonus for getting a token home). We carefully balance these so the shaped rewards are consistent with the true game objective — if an agent could maximise shaped rewards while consistently losing games, it would learn the wrong thing.
The key principle is that shaped rewards must be potential-based functions of the state difference. In practice, this means our reward function depends only on the current state and the action taken, not on future states. This preserves the theoretical guarantees of the underlying RL algorithm and prevents the agent from learning exploitably sub-optimal strategies that "game" the reward function.
Evaluation Harness — Testing Trained Models
A trained model is only as good as its evaluation protocol. We implement a comprehensive evaluation harness that tests the agent across multiple dimensions: win rate against baseline opponents, consistency across game variants, and sensitivity to specific game situations.
The evaluation uses greedy action selection (epsilon=0) to measure the agent's true strategic quality without exploration noise. We test against three baseline opponents: random play (lower bound), a simple heuristic bot that prioritises captures and home entries (mid bound), and the snapshot opponent from training (upper bound as a measure of self-improvement over time).
import json
class HeuristicBot:
"""
Baseline heuristic bot for evaluation comparison.
Priority: capture > home entry > advance furthest > nearest home
"""
def __init__(self, player: int):
self.player = player
def select_action(self, obs: np.ndarray, dice: int) -> int:
tokens = obs[self.player*4:(self.player+1)*4]
valid_actions = []
for a in range(4):
pos = tokens[a]
if pos == -1:
if dice == 6:
valid_actions.append((a, 3))
elif pos >= 57:
continue
else:
new_pos = pos + dice
if new_pos <= 57:
score = 0
if new_pos >= 57:
score = 1000
elif new_pos % 52 in [1, 9, 14, 22,
27, 35, 40, 48]:
score = 50
else:
score = new_pos
valid_actions.append((a, score))
if not valid_actions:
return 0
valid_actions.sort(key=lambda x: x[1], reverse=True)
return valid_actions[0][0]
class RandomBot:
def select_action(self, obs: np.ndarray, dice: int) -> int:
return random.randint(0, 3)
def evaluate_agent(agent_path: str, num_games: int = 5000):
env = LudoEnv()
encoder = LudoStateEncoder()
agent = DQNAgent(state_dim=128)
agent.load(agent_path)
agent.epsilon = 0.0
baseline_bots = {
"random": [RandomBot() for _ in range(4)],
"heuristic": [HeuristicBot(i) for i in range(4)]
}
results = {}
for baseline_name, bots in baseline_bots.items():
wins = [0] * 4
game_lengths = []
for _ in range(num_games):
state = env.reset()
done = False
length = 0
while not done:
current = env.player
encoded = encoder.encode(state, current)
if current == 0:
action = agent.select_action(encoded, training=False)
else:
action = bots[current].select_action(
state, env.dice
)
state, _, done, info = env.step(action)
length += 1
if info["winner"] is not None:
wins[info["winner"]] += 1
game_lengths.append(length)
results[baseline_name] = {
"agent_wins": wins[0],
"total_games": num_games,
"win_rate": wins[0] / num_games,
"avg_game_length": sum(game_lengths) / len(game_lengths),
"wins_per_position": wins
}
print("=== Evaluation Results ===")
for baseline, res in results.items():
print(f"{baseline}: Win rate = {res["win_rate"]:.1%} | "
f"Avg length = {res["avg_game_length"]:.1f}")
return results
if __name__ == "__main__":
results = evaluate_agent("ludo_dqn_final.pt")
Exporting and Deploying the Trained Model
Once training is complete, export the model via TorchScript for deployment. TorchScript serialises the model's computation graph, making it portable across Python environments and callable from C++, mobile, or web runtimes. A TorchScript model can be loaded directly into the Ludo API Python client to power a live bot that plays against human opponents or other API-connected agents.
import torch
def export_torchscript(agent_path: str, output_path: str):
agent = DQNAgent(state_dim=128)
ckpt = torch.load(agent_path, map_location="cpu")
agent.q_net.load_state_dict(ckpt["q_net"])
agent.q_net.eval()
# Trace the model with a dummy input for TorchScript serialisation
dummy_input = torch.zeros(128)
traced = torch.jit.trace(agent.q_net, dummy_input)
traced.save(output_path)
print(f"Exported TorchScript model to {output_path}")
# Use with LudoKingAPI Python client
def load_and_play(model_path: str, api_key: str):
model = torch.jit.load(model_path)
# model.forward is now a TorchScript method callable from any runtime
def get_action(state: np.ndarray) -> int:
with torch.no_grad():
q_vals = model(torch.FloatTensor(state))
return int(q_vals.argmax().item())
return get_action
export_torchscript("ludo_dqn_final.pt", "ludo_dqn.pt")
Frequently Asked Questions
LudoEnv class — the DQN agent and training loop remain unchanged. Quick Ludo (reduced token count per player, shorter track) similarly requires only environment parameter changes. The state encoder's home-size and track-size constants must be updated to match the variant. One consideration: RL policies learned for standard Ludo may not transfer directly to variants because the action-value landscape changes — retraining from scratch with the modified environment is recommended. The game development tutorial covers implementing custom Ludo variants that could serve as the environment layer for this RL pipeline.
Deploy Your Trained Ludo AI with LudoKingAPI
Connect your PyTorch model to the LudoKingAPI infrastructure for live gameplay evaluation, tournament matchmaking, and human-vs-AI competitions.