Ludo Reinforcement Learning — Train Your Own AI with PyTorch & Self-Play

Why Reinforcement Learning for Ludo?

Ludo occupies a sweet spot for RL research: the action space is small (at most 4 moveable tokens per turn), game length is short (20–60 moves), the rule set is deterministic and fully observable, and self-play generates unlimited training data without any labeled datasets. Unlike chess or Go where move trees are astronomically large, Ludo's reduced state space means you can train a competitive agent on a single consumer GPU in under 8 hours.

Traditional heuristic-based Ludo bots use hand-tuned weight functions — safety bonuses, progress scores, capture incentives — that require expert domain knowledge and still leave exploitable blind spots. RL lets the agent discover strategies that humans never considered. The same approach powered AlphaGo's 2016 victory and AlphaZero's subsequent dominance across chess, shogi, and Go. For Ludo, the scale is far more manageable, making it an ideal RL proving ground.

This guide implements a complete pipeline: a Gym-compatible Ludo environment with rich state encoding, a Deep Q-Network (DQN) agent with experience replay and target networks, shaped reward functions, a self-play training loop, and a rigorous evaluation harness. By the end, you will have a trained model that you can export via TorchScript and connect to the Ludo API Python client for live gameplay evaluation.

The Ludo Gym Environment

Every RL pipeline starts with a well-designed environment. We implement a gym.Env-compatible Ludo environment that exposes reset(), step(action), and render(). The environment encodes the full board state as a flat observation vector — this encoding is the most consequential design decision because it determines what the neural network sees and therefore what strategies it can learn.

Our board representation uses 52 track positions (standard Ludo cross-shaped track), 4 home columns per player (each 5 squares long), and 4 base positions per player. The state vector encodes each token's position as a normalized float, making it easy for a neural network to learn position-based value functions. We include the dice value and current player index as additional features so the network has full context for decision-making.

import gym
import numpy as np
from gym import spaces
from typing import Tuple, Dict, Optional

class LudoEnv(gym.Env):
    """
    Gym-compatible Ludo environment.
    Observation: 52 track + 4 home + 4 base = 60 positional features,
                 plus dice value and current player = 2 meta features.
                 Total: 62 floats.
    Action: 0–3 (choose which of the 4 tokens to move).
    """
    metadata = {"render_modes": ["human"]}
    
    TRACK_SIZE = 52
    HOME_SIZE = 5
    BASE_SIZE = 4
    NUM_PLAYERS = 4
    NUM_TOKENS = 4
    SAFE_SQUARES = [1, 9, 14, 22, 27, 35, 40, 48]
    
    def __init__(self, variant="standard"):
        super().__init__()
        self.variant = variant
        self.observation_space = spaces.Box(
            low=-1, high=100,
            shape=(62,), dtype=np.float32
        )
        self.action_space = spaces.Discrete(4)
        self._state = None

    def reset(self, seed: Optional[int] = None) -> np.ndarray:
        super().reset(seed=seed)
        # Each player's token positions: -1=base, 0–51=track, 52–56=home, 57=finished
        self.tokens = np.array([-1] * 16, dtype=np.int8)
        self.player = 0
        self.dice = 0
        self.done = False
        self.winner = None
        self.consecutive_sixes = 0
        return self._observe()

    def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
        if self.done:
            return self._observe(), 0.0, True, {}

        reward = 0.0
        player_offset = self.player * 4
        token_idx = player_offset + action
        current_pos = self.tokens[token_idx]

        if current_pos == -1:
            # Token still in base — can only move out on a 6
            if self.dice == 6:
                self.tokens[token_idx] = player_offset
                reward = self._compute_reward(token_idx, entering_board=True)
            else:
                reward = -0.05
        else:
            new_pos = current_pos + self.dice
            if new_pos > self.TRACK_SIZE + self.HOME_SIZE:
                # Exceeds home column — invalid move, penalise
                reward = -0.05
            elif new_pos == self.TRACK_SIZE + self.HOME_SIZE:
                # Token reached home — game may end
                self.tokens[token_idx] = 57
                reward = self._compute_reward(token_idx, reached_home=True)
                if self._check_winner():
                    self.done = True
            else:
                if new_pos <= self.TRACK_SIZE:
                    # Check for captures on track
                    captured = self._check_capture(new_pos, player_offset)
                    reward = self._compute_reward(token_idx, captured=captured)
                self.tokens[token_idx] = new_pos

        # Advance to next player (or re-roll on 6)
        self._advance_player()
        return self._observe(), reward, self.done, {"winner": self.winner}

    def _observe(self) -> np.ndarray:
        """
        Build 62-dimensional observation vector.
        - Positions 0–15:  Token track positions (normalized 0–1)
        - Positions 16–31: Token home progress (0–1)
        - Positions 32–47: Token in-base flags (0 or 1)
        - Positions 48–51: Player base tokens count
        - Positions 52–55: Player finished tokens count
        - Positions 56–59: Relative track positions for current player
        - Position 60: Dice value (normalized)
        - Position 61: Current player index (normalized)
        """
        obs = np.zeros(62, dtype=np.float32)
        for p in range(4):
            for t in range(4):
                idx = p * 4 + t
                pos = self.tokens[idx]
                if pos == -1:
                    obs[idx] = 0.0
                    obs[32 + idx] = 1.0
                elif pos >= 57:
                    obs[48 + p] += 1.0
                    obs[52 + p] += 1.0 / 4
                else:
                    obs[idx] = pos / 56
        obs[60] = self.dice / 6.0
        obs[61] = self.player / 3.0
        return obs

    def _compute_reward(self, token_idx: int,
                         entering_board: bool = False,
                         captured: bool = False,
                         reached_home: bool = False) -> float:
        """
        Shaped reward function — combines sparse terminal rewards with
        dense intermediate shaping for faster convergence.
        """
        reward = 0.0
        pos = self.tokens[token_idx]
        if reached_home:
            return 100.0
        if entering_board:
            return 0.5
        if captured:
            return 5.0
        if pos > 0:
            return pos / 56.0 * 0.3
        return 0.0

    def _check_capture(self, new_pos: int, player_offset: int) -> bool:
        if new_pos % 52 in self.SAFE_SQUARES:
            return False
        for opp in range(16):
            if opp // 4 == self.player // 4:
                continue
            if self.tokens[opp] % 52 == new_pos % 52:
                self.tokens[opp] = -1
                return True
        return False

    def _check_winner(self) -> bool:
        for p in range(4):
            if all(self.tokens[p*4:p*4+4] >= 57):
                self.winner = p
                return True
        return False

    def _advance_player(self):
        if self.dice != 6:
            self.player = (self.player + 1) % 4
            self.consecutive_sixes = 0
        else:
            self.consecutive_sixes += 1
            if self.consecutive_sixes >= 3:
                self.consecutive_sixes = 0
                self.player = (self.player + 1) % 4
        self.dice = np.random.randint(1, 7)

    def render(self, mode="human"):
        print(f"Player: {self.player}, Dice: {self.dice}, Tokens: {self.tokens}")

State Encoder — From Raw Board to Feature Vector

The state encoder transforms the raw board configuration into a neural-network-friendly feature vector. A naive approach encodes only token positions as a flat array of 16 integers. Our encoder is more sophisticated — it captures relational information between tokens, player-relative positions, and game-phase indicators that help the network distinguish between early-game, mid-game, and endgame situations.

Player-relative encoding is critical. Absolute track positions are misleading across players because each player's starting square is different. By encoding each position relative to the current player's start, the network learns position-independent strategies that transfer across all four players. We also compute "threat indicators" — whether opponent tokens are within capture range of the current player's tokens — giving the network explicit danger signals without needing to learn them from raw positions.

import numpy as np

class LudoStateEncoder:
    """
    Encodes Ludo board state into a rich feature vector for neural networks.
    Produces a 128-dimensional embedding from the 62-dimensional raw observation.
    """
    TRACK_SIZE = 52
    HOME_SIZE = 5
    SAFE_SQUARES = [1, 9, 14, 22,
                    27, 35, 40, 48]

    def encode(self, obs: np.ndarray, player: int) -> np.ndarray:
        """
        Args:
            obs: 62-dim raw observation from LudoEnv
            player: current player index (0–3)
        Returns:
            128-dim encoded state
        """
        features = np.zeros(128, dtype=np.float32)
        pos = 0

        # Token progress for each player (normalized 0–1)
        for p in range(4):
            for t in range(4):
                idx = p * 4 + t
                raw_pos = obs[idx]
                if raw_pos == 0:
                    features[pos] = 0.0
                elif raw_pos >= 57:
                    features[pos] = 1.0
                else:
                    features[pos] = raw_pos / 56.0
                pos += 1

        # Player-relative positions (how far ahead/behind each player is)
        player_track_total = 0.0
        for t in range(4):
            player_track_total += features[(player * 4) + t]
        avg_progress = player_track_total / 4
        for p in range(4):
            player_total = 0.0
            for t in range(4):
                player_total += features[(p * 4) + t]
            features[16 + p] = player_total - avg_progress
        pos = 20

        # Game phase indicator (0=opening, 0.5=mid, 1=endgame)
        finished_count = sum(1 for t in range(4)
                         if features[player*4+t] >= 1.0)
        features[20] = finished_count / 4.0
        
        # Dice value and one-hot encoding of dice result (6 features)
        dice_val = obs[60] * 6.0
        for d in range(6):
            features[21 + d] = 1.0 if int(dice_val) == d + 1 else 0.0
        
        # Threat indicators — opponent tokens within 6 squares ahead
        for p in range(4):
            if p == player: continue
            threats = 0
            for t in range(4):
                opp_pos = obs[p*4+t]
                for own_t in range(4):
                    own_pos = obs[player*4+own_t]
                    if own_pos > 0 and opp_pos > 0:
                        dist = (opp_pos - own_pos) % 52
                        if 1 <= dist <= 6:
                            threats += 1
            features[27 + p] = threats / 16.0
        pos = 31

        # Safe square proximity bonus indicators
        for t in range(4):
            token_pos = obs[player*4+t]
            if token_pos > 0 and token_pos < 52:
                for safe in self.SAFE_SQUARES:
                    if abs(int(token_pos) - safe) <= 3:
                        features[31 + t] = 1.0 - abs(int(token_pos) - safe) / 3.0
                        break
        return features

Deep Q-Network Agent — Architecture and Training

The DQN agent estimates the action-value function Q(s, a) — the expected cumulative future reward of taking action a in state s. A neural network approximates this function, enabling the agent to generalise across the large state space. We use two key stabilising techniques: a target network (which provides stable boot-strapped targets) and experience replay (which breaks temporal correlations in the training data).

The network architecture uses three hidden layers with ReLU activations. The final layer outputs 4 Q-values — one per possible action. During training, we select actions using an epsilon-greedy policy that decays from 1.0 to 0.05 over the course of training, gradually shifting from exploration to exploitation. The learning rate of 1e-4 with the Adam optimizer provides stable convergence for this problem scale.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import collections
import random

class QNetwork(nn.Module):
    """
    Feedforward network for Q-value estimation.
    Input: 128-dim encoded state
    Output: 4 Q-values (one per action)
    """
    def __init__(self, state_dim=128, action_dim=4, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, hidden // 2),
            nn.ReLU(),
            nn.Linear(hidden // 2, action_dim)
        )
        for m in self.net:
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
                nn.init.constant_(m.bias, 0.0)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

class DQNAgent:
    """
    DQN agent with experience replay, target network, and double-Q updates.
    """
    def __init__(self, state_dim=128, action_dim=4):
        self.q_net = QNetwork(state_dim, action_dim)
        self.target_net = QNetwork(state_dim, action_dim)
        self.target_net.load_state_dict(self.q_net.state_dict())
        self.target_net.eval()

        self.replay_buffer = collections.deque(maxlen=200_000)
        self.gamma = 0.99
        self.epsilon = 1.0
        self.epsilon_min = 0.05
        self.lr = 1e-4
        self.batch_size = 128
        self.target_update_freq = 500
        self.train_step_counter = 0

        self.optimizer = optim.Adam(self.q_net.parameters(), lr=self.lr)
        self.loss_fn = nn.SmoothL1Loss()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.q_net.to(self.device)
        self.target_net.to(self.device)

    def select_action(self, state: np.ndarray, training: bool = True) -> int:
        if training and random.random() < self.epsilon:
            return random.randint(0, 3)
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_vals = self.q_net(state_t)
            return int(q_vals.argmax(dim=1).item())

    def store_transition(self, s, a, r, s2, done):
        self.replay_buffer.append((s, a, r, s2, done))

    def train_step) -> float:
        if len(self.replay_buffer) < self.batch_size:
            return 0.0

        batch = random.sample(self.replay_buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states_t = torch.FloatTensor(np.array(states)).to(self.device)
        actions_t = torch.LongTensor(actions).to(self.device)
        rewards_t = torch.FloatTensor(rewards).to(self.device)
        next_states_t = torch.FloatTensor(np.array(next_states)).to(self.device)
        dones_t = torch.FloatTensor(dones).to(self.device)

        # Double DQN: use online network to select action, target network to evaluate
        with torch.no_grad():
            next_actions = self.q_net(next_states_t).argmax(dim=1)
            next_q = self.target_net(next_states_t).gather(1, next_actions.unsqueeze(1)).squeeze()
            target = rewards_t + self.gamma * next_q * (1 - dones_t)

        current_q = self.q_net(states_t).gather(1, actions_t.unsqueeze(1)).squeeze()
        loss = self.loss_fn(current_q, target.detach())

        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), max_norm=10.0)
        self.optimizer.step()

        self.train_step_counter += 1
        if self.train_step_counter % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.q_net.state_dict())

        return loss.item()

    def decay_epsilon(self, episode: int, total: int):
        self.epsilon = max(self.epsilon_min, 1.0 - episode / total)

    def save(self, path: str):
        torch.save({
            "q_net": self.q_net.state_dict(),
            "optimizer": self.optimizer.state_dict(),
            "epsilon": self.epsilon,
            "step": self.train_step_counter
        }, path)

    def load(self, path: str):
        ckpt = torch.load(path, map_location=self.device)
        self.q_net.load_state_dict(ckpt["q_net"])
        self.target_net.load_state_dict(ckpt["q_net"])
        self.optimizer.load_state_dict(ckpt[optimizer])
        self.epsilon = ckpt["epsilon"]
        self.train_step_counter = ckpt["step"]

Self-Play Training Loop

The training loop implements self-play where the agent competes against itself, continuously improving as it encounters stronger versions of its own strategy. We use a shared agent instance that plays all four positions — this creates a curriculum where the agent must continuously adapt because its opponents are also improving. We also maintain a periodic snapshot of the target opponent to provide a consistent baseline for measuring progress.

Episode data (states, actions, rewards, next states) is collected into the replay buffer. Every four training steps, we sample a mini-batch and perform a gradient update. We save checkpoints every 10,000 episodes and log training metrics (loss, epsilon, win rate against the snapshot opponent) for monitoring via TensorBoard or Weights & Biases.

import time
import matplotlib.pyplot as plt

def run_self_play_episode(env, agent, encoder, opponent_agent=None):
    """
    Runs one full Ludo game with self-play.
    Each player uses the same agent (self-play).
    Returns the list of (state, action, reward, next_state, done) transitions.
    """
    state = env.reset()
    done = False
    transitions = []
    episode_reward = {p: 0.0 for p range(4)}
    game_length = 0

    while not done:
        current_player = env.player
        encoded = encoder.encode(state, current_player)

        if opponent_agent is not None and current_player >= 2:
            action = opponent_agent.select_action(encoded, training=False)
        else:
            action = agent.select_action(encoded, training=True)

        next_state, reward, done, info = env.step(action)
        next_encoded = encoder.encode(next_state, env.player)

        transitions.append((encoded, action, reward, next_encoded, float(done)))
        episode_reward[current_player] += reward
        state = next_state
        game_length += 1

    winner = info.get("winner", None)
    return transitions, winner, episode_reward, game_length

def train_dqn(total_episodes=200_000, save_every=10_000, eval_every=1_000):
    env = LudoEnv()
    encoder = LudoStateEncoder()
    agent = DQNAgent(state_dim=128, action_dim=4)

    # Snapshot opponent for consistent evaluation
    snapshot_agent = DQNAgent(state_dim=128, action_dim=4)
    snapshot_agent.q_net.load_state_dict(agent.q_net.state_dict())
    snapshot_agent.target_net.load_state_dict(agent.target_net.state_dict())

    metrics = {"loss": [], "epsilon": [], "win_rate": [], "avg_game_length": []}
    wins = [0] * 4
    total_games = 0
    recent_lengths = collections.deque(maxlen=500)

    start_time = time.time()
    for episode in range(total_episodes):
        agent.decay_epsilon(episode, total_episodes)

        # Self-play episode — agents play all 4 positions
        transitions, winner, ep_rewards, game_len = run_self_play_episode(
            env, agent, encoder
        )

        if winner is not None:
            wins[winner] += 1
        total_games += 1
        recent_lengths.append(game_len)

        # Store all transitions in replay buffer
        for s, a, r, s2, done in transitions:
            agent.store_transition(s, a, r, s2, done)

        # Train every 4 steps for efficiency
        losses = []
        for _ in range(4):
            l = agent.train_step()
            if l > 0:
                losses.append(l)

        # Evaluation every eval_every episodes
        if episode % eval_every == 0 and episode > 0:
            eval_wins = [0] * 4
            for _ in range(500):
                _, w, _, _ = run_self_play_episode(env, agent, encoder,
                                                    opponent_agent=snapshot_agent)
                if w is not None:
                    eval_wins[w] += 1
            win_rate = eval_wins[0] / 500
            metrics["win_rate"].append(win_rate)
            metrics["loss"].append(sum(losses) / len(losses) if losses else 0)
            metrics["epsilon"].append(agent.epsilon)
            metrics["avg_game_length"].append(sum(recent_lengths) / len(recent_lengths))
            elapsed = time.time() - start_time
            print(f"Ep {episode:>7d} | "
                  f"Loss: {metrics['loss'][-1]:.4f} | "
                  f"Eps: {agent.epsilon:.3f} | "
                  f"Win rate: {win_rate:.1%} | "
                  f"Games: {total_games} | "
                  f"Time: {elapsed:.0f}s")

        # Checkpoint every save_every episodes
        if episode % save_every == 0 and episode > 0:
            agent.save(f"ludo_dqn_ep{episode}.pt")
            snapshot_agent.q_net.load_state_dict(agent.q_net.state_dict())
            snapshot_agent.target_net.load_state_dict(agent.target_net.state_dict())

    agent.save("ludo_dqn_final.pt")
    print(f"Training complete in {time.time() - start_time:.0f}s")
    return agent, metrics

if __name__ == "__main__":
    agent, metrics = train_dqn(total_episodes=100_000)

Reward Shaping — Dense vs. Sparse Rewards

Reward shaping is one of the most impactful (and underappreciated) aspects of RL for board games. A sparse reward scheme gives +100 for winning and -100 for losing, with 0 everywhere else. While theoretically sufficient for convergence (reward shaping preserves optimal policies under certain conditions), sparse rewards make learning extremely slow because the agent receives informative gradients only at the very end of a 30–60 move game.

Dense reward shaping addresses this by providing intermediate learning signals. Our reward function combines four components: progress rewards (small positive values for advancing tokens along the track), entry rewards (bonus for bringing a token onto the board with a 6), capture rewards (significant bonus for sending an opponent back to base), and home rewards (large bonus for getting a token home). We carefully balance these so the shaped rewards are consistent with the true game objective — if an agent could maximise shaped rewards while consistently losing games, it would learn the wrong thing.

The key principle is that shaped rewards must be potential-based functions of the state difference. In practice, this means our reward function depends only on the current state and the action taken, not on future states. This preserves the theoretical guarantees of the underlying RL algorithm and prevents the agent from learning exploitably sub-optimal strategies that "game" the reward function.

Evaluation Harness — Testing Trained Models

A trained model is only as good as its evaluation protocol. We implement a comprehensive evaluation harness that tests the agent across multiple dimensions: win rate against baseline opponents, consistency across game variants, and sensitivity to specific game situations.

The evaluation uses greedy action selection (epsilon=0) to measure the agent's true strategic quality without exploration noise. We test against three baseline opponents: random play (lower bound), a simple heuristic bot that prioritises captures and home entries (mid bound), and the snapshot opponent from training (upper bound as a measure of self-improvement over time).

import json

class HeuristicBot:
    """
    Baseline heuristic bot for evaluation comparison.
    Priority: capture > home entry > advance furthest > nearest home
    """
    def __init__(self, player: int):
        self.player = player

    def select_action(self, obs: np.ndarray, dice: int) -> int:
        tokens = obs[self.player*4:(self.player+1)*4]
        valid_actions = []
        for a in range(4):
            pos = tokens[a]
            if pos == -1:
                if dice == 6:
                    valid_actions.append((a, 3))
            elif pos >= 57:
                continue
            else:
                new_pos = pos + dice
                if new_pos <= 57:
                    score = 0
                    if new_pos >= 57:
                        score = 1000
                    elif new_pos % 52 in [1, 9, 14, 22,
                                                27, 35, 40, 48]:
                        score = 50
                    else:
                        score = new_pos
                    valid_actions.append((a, score))
        if not valid_actions:
            return 0
        valid_actions.sort(key=lambda x: x[1], reverse=True)
        return valid_actions[0][0]

class RandomBot:
    def select_action(self, obs: np.ndarray, dice: int) -> int:
        return random.randint(0, 3)

def evaluate_agent(agent_path: str, num_games: int = 5000):
    env = LudoEnv()
    encoder = LudoStateEncoder()
    agent = DQNAgent(state_dim=128)
    agent.load(agent_path)
    agent.epsilon = 0.0

    baseline_bots = {
        "random": [RandomBot() for _ in range(4)],
        "heuristic": [HeuristicBot(i) for i in range(4)]
    }

    results = {}
    for baseline_name, bots in baseline_bots.items():
        wins = [0] * 4
        game_lengths = []

        for _ in range(num_games):
            state = env.reset()
            done = False
            length = 0
            while not done:
                current = env.player
                encoded = encoder.encode(state, current)
                if current == 0:
                    action = agent.select_action(encoded, training=False)
                else:
                    action = bots[current].select_action(
                        state, env.dice
                    )
                state, _, done, info = env.step(action)
                length += 1
            if info["winner"] is not None:
                wins[info["winner"]] += 1
            game_lengths.append(length)

        results[baseline_name] = {
            "agent_wins": wins[0],
            "total_games": num_games,
            "win_rate": wins[0] / num_games,
            "avg_game_length": sum(game_lengths) / len(game_lengths),
            "wins_per_position": wins
        }

    print("=== Evaluation Results ===")
    for baseline, res in results.items():
        print(f"{baseline}: Win rate = {res["win_rate"]:.1%} | "
              f"Avg length = {res["avg_game_length"]:.1f}")
    return results

if __name__ == "__main__":
    results = evaluate_agent("ludo_dqn_final.pt")

Exporting and Deploying the Trained Model

Once training is complete, export the model via TorchScript for deployment. TorchScript serialises the model's computation graph, making it portable across Python environments and callable from C++, mobile, or web runtimes. A TorchScript model can be loaded directly into the Ludo API Python client to power a live bot that plays against human opponents or other API-connected agents.

import torch

def export_torchscript(agent_path: str, output_path: str):
    agent = DQNAgent(state_dim=128)
    ckpt = torch.load(agent_path, map_location="cpu")
    agent.q_net.load_state_dict(ckpt["q_net"])
    agent.q_net.eval()

    # Trace the model with a dummy input for TorchScript serialisation
    dummy_input = torch.zeros(128)
    traced = torch.jit.trace(agent.q_net, dummy_input)
    traced.save(output_path)
    print(f"Exported TorchScript model to {output_path}")

# Use with LudoKingAPI Python client
def load_and_play(model_path: str, api_key: str):
    model = torch.jit.load(model_path)
    # model.forward is now a TorchScript method callable from any runtime
    def get_action(state: np.ndarray) -> int:
        with torch.no_grad():
            q_vals = model(torch.FloatTensor(state))
            return int(q_vals.argmax().item())
    return get_action

export_torchscript("ludo_dqn_final.pt", "ludo_dqn.pt")

Frequently Asked Questions

A DQN typically requires 50,000–100,000 self-play episodes to reach parity with a simple heuristic bot (which prioritises captures and home entries), and 200,000–500,000 episodes to consistently outperform it. With GPU acceleration on a modern RTX 3080 or better, 100,000 episodes complete in 3–6 hours. The agent exhibits rapid initial improvement (the first 10,000 episodes show dramatic win-rate gains) followed by slower asymptotic improvement as it refines its strategy. Early stopping at 200,000 episodes is a reasonable default — beyond that point, additional training yields diminishing returns unless you are targeting superhuman performance against the snapshot opponent.

Sparse rewards give the agent +100 for a win and -100 for a loss, with 0 at every other step. This is theoretically correct but practically devastating for Ludo — a typical game lasts 30–60 moves, meaning the agent receives a non-zero gradient signal only at the very end. Dense reward shaping provides intermediate signals at every step: small positive rewards for advancing tokens, larger rewards for entering the board on a 6, significant rewards for capturing opponents, and large rewards for reaching home. Our implementation combines all four: progress rewards (0.3 max), entry bonuses (0.5), capture bonuses (5.0), and home bonuses (100.0). The key constraint is that shaped rewards must be potential-based — they should not reward actions that lead to states where the agent can no longer achieve the true objective. If an agent could accumulate shaped rewards while losing every game, the shaping is broken.

Standard DQN uses the same network to both select and evaluate actions — this creates an overestimation bias where Q-values become systematically inflated, leading to suboptimal policies. Double DQN (implemented in our training loop) decouples selection and evaluation: the online network selects the best action for the next state, and the target network evaluates its Q-value. This reduces overestimation by roughly 30–50% in empirical studies, producing more accurate value estimates and faster convergence. The performance difference is especially significant in stochastic environments and games with multiple near-equivalent actions, which describes Ludo well — choosing between two tokens that would both advance your position meaningfully is a common scenario where overestimation hurts most.

Yes, and it is often beneficial. Our training loop supports a configurable opponent agent — you can inject a minimax bot or heuristic player as the opponent instead of (or alongside) self-play. Fictitious Self-Play (FSP) alternates between training against the current best policy and a historical snapshot, producing more robust agents that do not overfit to a single opponent strategy. The Python bot tutorial covers implementing minimax and Monte Carlo Tree Search for Ludo, which make excellent opponent training partners. Mixing opponent types (self-play 50%, heuristic 30%, minimax 20%) produces the most generally capable agent.

Use our evaluation harness to measure win rate against three baselines over at least 1,000 games per baseline. A well-trained agent should beat random play in over 90% of games, beat the heuristic bot in 55–70% of games, and show consistent improvement over its own snapshot from 10,000 episodes earlier. Beyond win rate, examine game length — a strong agent typically finishes games faster because it prioritises home entries and avoids wasting moves. You should also test in distributional shift scenarios: does the agent still perform well when playing from a disadvantaged position (multiple tokens sent back to base)? An agent that collapses under adversity is overfitting to favorable game states. The ultimate test is deploying it against human opponents via the LudoKingAPI and collecting human-vs-AI game data for qualitative analysis.

The good news is that Ludo's small state space makes it trainable on modest hardware. A CPU-only setup with 8GB RAM trains 100,000 episodes in 12–18 hours — acceptable for experimentation but slow for rapid iteration. A modern GPU (RTX 3060 or better) reduces this to 3–6 hours. The memory requirement is dominated by the replay buffer (200,000 transitions at 128 floats each = ~100MB) plus model parameters — well under 1GB total. For production training runs of 500,000 episodes, GPU acceleration is strongly recommended. Cloud options like Google Colab (free T4 GPU) or Lambda Labs spot instances provide cost-effective GPU access without hardware investment.

Yes, the pipeline is variant-agnostic. Speed Ludo (faster dice, different board) requires only modifying the environment's board size, home column length, and safe square definitions in the LudoEnv class — the DQN agent and training loop remain unchanged. Quick Ludo (reduced token count per player, shorter track) similarly requires only environment parameter changes. The state encoder's home-size and track-size constants must be updated to match the variant. One consideration: RL policies learned for standard Ludo may not transfer directly to variants because the action-value landscape changes — retraining from scratch with the modified environment is recommended. The game development tutorial covers implementing custom Ludo variants that could serve as the environment layer for this RL pipeline.

Deploy Your Trained Ludo AI with LudoKingAPI

Connect your PyTorch model to the LudoKingAPI infrastructure for live gameplay evaluation, tournament matchmaking, and human-vs-AI competitions.

Python API Integration 💬 Chat on WhatsApp

Ludo Reinforcement Learning — Full Pipeline