Ludo Game Scaling — Horizontal Auto-Scaling, Redis & Prometheus

📅 Updated March 21, 2026 ⏱️ 22 min read 🛠️ Docker • Redis • Nginx • Prometheus • AWS ASG

Get Infrastructure Help on WhatsApp

Overview — Scaling Ludo Games to Thousands of Concurrent Players

A single-server Ludo game deployment handles dozens of concurrent matches, but production platforms with thousands of daily active players demand a distributed architecture where game state is decoupled from any individual server instance, real-time events cross instance boundaries through a message broker, database queries are eliminated or dramatically reduced via caching, and infrastructure automatically provisions or deprovisions capacity based on live traffic patterns. This guide covers every layer of that architecture: container orchestration with Docker Compose, cross-instance communication with the Redis Socket.IO adapter, request routing with Nginx sticky sessions, database resilience with read replicas, asset delivery via CDN, capacity management with AWS Auto Scaling Groups, and operational visibility with Prometheus and Grafana.

The architecture assumes you are running a Node.js Socket.IO server (or the Ludo Realtime API) behind a load balancer, persisting game state and player data to PostgreSQL, caching hot data in Redis, and serving static assets (board graphics, token sprites, audio files) from a CDN. The database schema guide covers the PostgreSQL schema for games, players, and leaderboards. The latency optimization guide covers CDN edge routing and regional deployment strategies for reducing round-trip times globally.

Before scaling, profile your baseline. Instrument your game server with request duration histograms, WebSocket connection counts, and message throughput meters. Without baseline metrics, you cannot configure meaningful scaling thresholds. Prometheus (covered in section 8) provides the instrumentation and collection layer; Grafana provides the visualization and alerting. Start there before adding any infrastructure components.

Step 1 — Docker Compose Multi-Service Setup

Docker Compose defines and runs a multi-container application as a single unit. For a Ludo game backend, the Compose file orchestrates the game server, Redis broker, PostgreSQL primary, optional read replicas, Nginx load balancer, and Prometheus monitoring stack. Local development uses the same configuration as staging and production with environment-specific overrides, ensuring that what you test locally behaves identically in production.

Each service runs in its own container with resource limits, health checks, and restart policies. The shared ludo-network bridge network enables inter-service DNS resolution so containers can reference each other by service name:

# docker-compose.yml — Ludo Game Platform Multi-Service Architecture
version: "3.9"

services:
  # --- Application Services ---

  ludo-server:
    build:
      context: ./server
      dockerfile: Dockerfile
    image: ludoking/server:${IMAGE_TAG:-latest}
    container_name: ludo-server
    restart: unless-stopped
    environment:
      NODE_ENV: production
      PORT: 3000
      REDIS_HOST: redis
      REDIS_PORT: 6379
      DB_HOST: postgres-primary
      DB_PORT: 5432
      DB_NAME: ludoking
      DB_USER: ${DB_USER}
      DB_PASSWORD: ${DB_PASSWORD}
      JWT_SECRET: ${JWT_SECRET}
    ports:
      - "3000:3000"
    depends_on:
      redis:
        condition: service_healthy
      postgres-primary:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 15s
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 2G
        reservations:
          cpus: "0.5"
          memory: 512M
    networks:
      - ludo-network

  # --- Infrastructure Services ---

  redis:
    image: redis:7-alpine
    container_name: ludo-redis
    restart: unless-stopped
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru --appendonly yes
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - ludo-network
    sysctls:
      - net.core.somaxconn=65535

  postgres-primary:
    image: postgres:16-alpine
    container_name: ludo-postgres-primary
    restart: unless-stopped
    environment:
      POSTGRES_DB: ludoking
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - postgres-primary-data:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    command: >
      postgres
      -c max_connections=200
      -c shared_buffers=256MB
      -c effective_cache_size=512MB
      -c maintenance_work_mem=64MB
      -c checkpoint_completion_target=0.9
      -c wal_buffers=16MB
      -c default_statistics_target=100
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d ludoking"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - ludo-network

  postgres-replica:
    image: postgres:16-alpine
    container_name: ludo-postgres-replica
    restart: unless-stopped
    environment:
      POSTGRES_DB: ludoking
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    ports:
      - "5433:5432"
    volumes:
      - postgres-replica-data:/var/lib/postgresql/data
    command: >
      postgres
      -c hot_standby=on
      -c max_connections=200
      -c shared_buffers=256MB
    depends_on:
      - postgres-primary
    networks:
      - ludo-network

  # --- Load Balancer ---

  nginx:
    image: nginx:1.25-alpine
    container_name: ludo-nginx
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      - ludo-server
    networks:
      - ludo-network

  # --- Monitoring Stack ---

  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: ludo-prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'
    networks:
      - ludo-network

  grafana:
    image: grafana/grafana:10.2.0
    container_name: ludo-grafana
    restart: unless-stopped
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_USER: ${GRAFANA_USER}
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - ludo-network

# =============================================================================
# Networks
# =============================================================================
networks:
  ludo-network:
    driver: bridge

# =============================================================================
# Volumes
# =============================================================================
volumes:
  redis-data:
  postgres-primary-data:
  postgres-replica-data:
  prometheus-data:
  grafana-data:

The Compose file uses resource limits (deploy.resources.limits) to cap CPU and memory per container, preventing any single service from monopolizing host resources. Health checks ensure that dependent services start in the correct order — the game server will not start until Redis reports healthy, and Nginx will not start until the game server is healthy. The prometheus and grafana containers form the monitoring stack, scraping metrics from the game server and visualizing them in dashboards.

Step 2 — Redis Socket.IO Adapter for Cross-Instance Communication

When your game server runs on multiple instances behind a load balancer, a player connected to Instance A needs to receive events from a player connected to Instance B. Without a message broker, instances are isolated — a move made on Instance A never reaches players on Instance B. The Socket.IO Redis adapter solves this by using Redis pub/sub to bridge events across all server instances, and Redis itself as a distributed session store for live game state.

Install the adapter on the server side with npm install @socket.io/redis-adapter redis, then configure it to connect to your Redis instance:

// server/src/config/redis.js
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('redis');

const redisHost = process.env.REDIS_HOST || 'redis';
const redisPort = process.env.REDIS_PORT || 6379;

let pubClient = null;
let subClient = null;

/**
 * Create the Redis adapter for Socket.IO.
 * Returns { adapter, pubClient, subClient } for use in server setup.
 */
async function setupRedisAdapter() {
  // pubClient publishes events to Redis channels
  pubClient = createClient({ url: `redis://${redisHost}:${redisPort}/0` });
  pubClient.on('error', (err) => console.error('[Redis Pub] Error:', err));
  await pubClient.connect();

  // subClient subscribes to Redis channels and relays events to Socket.IO
  subClient = createClient({ url: `redis://${redisHost}:${redisPort}/1` });
  subClient.on('error', (err) => console.error('[Redis Sub] Error:', err));
  await subClient.connect();

  console.log(`[Redis] Connected to ${redisHost}:${redisPort}`);

  return { pubClient, subClient };
}

/**
 * Cache game state in Redis with TTL.
 * Key pattern: game:{gameId} — stores full board state as JSON string.
 */
async function cacheGameState(gameId, state, ttlSeconds = 3600) {
  const key = `game:${gameId}`;
  await pubClient.setEx(key, ttlSeconds, JSON.stringify(state));
}

/**
 * Retrieve cached game state from Redis.
 * Returns null if the game is not in cache (expired or never cached).
 */
async function getCachedGameState(gameId) {
  const raw = await pubClient.get(`game:${gameId}`);
  return raw ? JSON.parse(raw) : null;
}

/**
 * Invalidate game state cache on state change.
 * Called immediately after every move to prevent stale reads.
 */
async function invalidateGameCache(gameId) {
  await pubClient.del(`game:${gameId}`);
}

/**
 * Store player session affinity — tracks which server instance a player is on.
 * Key: player:{playerId}:server — value: server instance ID.
 * TTL: 300 seconds, refreshed on every heartbeat.
 */
async function setPlayerSession(playerId, serverId, ttlSeconds = 300) {
  await pubClient.setEx(`player:${playerId}:server`, ttlSeconds, serverId);
}

/**
 * Get the server instance a player is connected to.
 * Used by the load balancer to route WebSocket upgrades to the correct instance.
 */
async function getPlayerServer(playerId) {
  return pubClient.get(`player:${playerId}:server`);
}

/**
 * Leaderboard using Redis sorted sets.
 * Score is incremented after each completed game.
 * O(log N) for both updates and top-N queries — far faster than SQL ORDER BY.
 */
async function updateLeaderboard(playerId, scoreDelta, variant = 'classic') {
  const key = `leaderboard:${variant}`;
  await pubClient.zIncrBy(key, scoreDelta, playerId);
}

async function getTopLeaderboard(variant = 'classic', count = 10) {
  const key = `leaderboard:${variant}`;
  // ZREVRANGE returns players sorted by score descending with scores
  return pubClient.zRangeWithScores(key, 0, count - 1, { REV: true });
}

async function getPlayerRank(playerId, variant = 'classic') {
  const key = `leaderboard:${variant}`;
  // ZREVRANK returns 0-based rank (0 = highest score)
  const rank = await pubClient.zRevRank(key, playerId);
  return rank !== null ? rank + 1 : null;
}

module.exports = {
  setupRedisAdapter,
  cacheGameState,
  getCachedGameState,
  invalidateGameCache,
  setPlayerSession,
  getPlayerServer,
  updateLeaderboard,
  getTopLeaderboard,
  getPlayerRank,
  pubClient: () => pubClient,
  subClient: () => subClient,
};

In your Socket.IO server initialization, apply the adapter to the IO instance. The adapter automatically handles pub/sub channel management, broadcasting events to all instances subscribed to a room:

// server/src/index.js
const { Server } = require('socket.io');
const http = require('http');
const { setupRedisAdapter, setPlayerSession } = require('./config/redis');

async function startServer() {
  const httpServer = http.createServer(app);
  const io = new Server(httpServer, {
    cors: { origin: '*', methods: ['GET', 'POST'] },
    pingTimeout: 20000,
    pingInterval: 25000,
  });

  // Attach Redis adapter for cross-instance pub/sub
  const { pubClient, subClient } = await setupRedisAdapter();
  io.adapter(require('@socket.io/redis-adapter')(pubClient, subClient));

  // --- Socket.IO event handlers ---
  io.on('connection', (socket) => {
    const { playerId, roomId } = socket.handshake.auth;

    // Track player session affinity in Redis
    setPlayerSession(playerId, process.env.INSTANCE_ID || 'default');

    // Join game room — the Redis adapter broadcasts this to all instances
    socket.join(`room:${roomId}`);

    socket.on('dice_roll', async (data) => {
      // Broadcast to all players in the room (including sender for acknowledgment)
      io.to(`room:${roomId}`).emit('opponent_roll', {
        playerId,
        diceValue: data.value,
        timestamp: Date.now(),
      });
    });

    socket.on('token_move', async (data) => {
      // Persist move, invalidate cache, broadcast to room
      await invalidateGameCache(roomId);
      io.to(`room:${roomId}`).emit('opponent_move', {
        playerId,
        tokenId: data.tokenId,
        targetCell: data.targetCell,
        timestamp: Date.now(),
      });
    });

    socket.on('disconnect', () => {
      io.to(`room:${roomId}`).emit('player_disconnected', { playerId });
    });
  });

  httpServer.listen(3000, () => {
    console.log('[Server] Ludo game server running on port 3000');
  });
}

startServer().catch(console.error);

The Redis adapter handles all the complexity of cross-instance pub/sub. When a player on Instance A sends a token_move event, the adapter publishes it to the socket.io#/#room:{roomId} Redis channel. All other instances subscribed to that channel receive the event and emit it to their local Socket.IO clients. You do not need to manage any pub/sub logic manually — the adapter abstracts it entirely.

Step 3 — Nginx Load Balancer with Sticky Sessions

WebSocket connections are persistent — once established, they remain open for the duration of the game session. Nginx routes incoming WebSocket upgrades to backend instances using the ip_hash or cookie-based sticky sessions strategy. Without sticky sessions, subsequent HTTP requests from the same player might be routed to different instances, breaking session affinity and causing authentication failures or stale game state reads.

The ip_hash directive hashes the client IP address and routes requests from the same IP to the same backend consistently. For mobile clients behind carrier-grade NAT (where many players share the same IP), use the sticky directive with cookies instead:

# nginx/nginx.conf — Ludo Game Load Balancer Configuration
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    multi_accept on;
    use epoll;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    # --- Logging ---
    log_format main '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    'rt=$request_time uct=$upstream_connect_time '
                    'uht=$upstream_header_time urt=$upstream_response_time';

    access_log /var/log/nginx/access.log main;
    error_log  /var/log/nginx/error.log warn;

    # --- Performance ---
    sendfile           on;
    tcp_nopush         on;
    tcp_nodelay        on;
    keepalive_timeout  65;
    keepalive_requests 1000;
    gzip on;
    gzip_types text/plain application/json application/javascript text/css;
    gzip_min_length 256;

    # --- Rate Limiting ---
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
    limit_conn_zone $binary_remote_addr zone=addr_limit:10m;

    # --- Upstream: Game Server Pool ---
    upstream ludo_backend {
        # Sticky sessions via cookie — routes WebSocket upgrades to the same instance
        sticky cookie srv_id expires=1h domain=.ludokingapi.site path=/;

        # Multiple server instances (add more IPs/hosts as you scale)
        server ludo-server-1:3000 weight=5 max_fails=3 fail_timeout=30s;
        server ludo-server-2:3000 weight=5 max_fails=3 fail_timeout=30s;
        server ludo-server-3:3000 weight=5 max_fails=3 fail_timeout=30s;

        keepalive 64;
    }

    # --- Server Block ---
    server {
        listen 80;
        listen [::]:80;
        server_name ludokingapi.site www.ludokingapi.site;

        # Redirect all HTTP to HTTPS
        return 301 https://$server_name$request_uri;
    }

    server {
        listen 443 ssl http2;
        listen [::]:443 ssl http2;
        server_name ludokingapi.site www.ludokingapi.site;

        # SSL/TLS configuration
        ssl_certificate     /etc/nginx/ssl/fullchain.pem;
        ssl_certificate_key /etc/nginx/ssl/privkey.pem;
        ssl_protocols       TLSv1.2 TLSv1.3;
        ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
        ssl_prefer_server_ciphers off;
        ssl_session_cache   shared:SSL:10m;
        ssl_session_timeout 1d;

        # Security headers
        add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
        add_header X-Frame-Options        SAMEORIGIN always;
        add_header X-Content-Type-Options nosniff always;
        add_header X-XSS-Protection       "1; mode=block" always;

        # --- REST API Endpoints ---
        location /api/ {
            limit_req zone=api_limit burst=200 nodelay;
            limit_conn addr_limit 10;

            proxy_pass         http://ludo_backend;
            proxy_http_version 1.1;
            proxy_set_header   Upgrade $http_upgrade;
            proxy_set_header   Connection "upgrade";
            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Proto $scheme;
            proxy_read_timeout 86400;
        }

        # --- WebSocket Endpoint ---
        location /socket.io/ {
            proxy_pass         http://ludo_backend;
            proxy_http_version 1.1;
            proxy_set_header   Upgrade $http_upgrade;
            proxy_set_header   Connection "upgrade";
            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Proto $scheme;
            # Critical: WebSocket timeouts must be long
            proxy_read_timeout 86400;
            proxy_send_timeout 86400;
        }

        # --- Health Check Endpoint ---
        location /health {
            proxy_pass         http://ludo_backend;
            proxy_http_version 1.1;
            proxy_set_header   Host $host;
            access_log off;
        }

        # --- Static Assets (proxied to CDN origin; CDN handles caching) ---
        location /assets/ {
            proxy_pass         http://ludo_backend;
            proxy_http_version 1.1;
            proxy_set_header   Host $host;
            expires 30d;
            add_header Cache-Control "public, immutable";
        }
    }
}

The sticky cookie directive generates a srv_id cookie on the first request, encoding the selected backend instance. Subsequent requests from the same client include the cookie, and Nginx routes to the same instance. The max_fails=3 and fail_timeout=30s directives remove unhealthy instances from the pool temporarily, providing automatic failover. WebSocket support relies on the Upgrade and Connection header forwarding — without these, Nginx terminates the WebSocket upgrade request and the connection fails.

Step 4 — Database Read Replicas for Read-Heavy Workloads

Ludo game backends perform two categories of database queries: writes (move logging, player state updates, match results) and reads (leaderboards, player profiles, match history). Read operations vastly outnumber writes in most game workloads — players check leaderboards and view match history far more frequently than they complete games. PostgreSQL read replicas offload these read queries to one or more follower instances, reducing load on the primary and improving query latency for geographically distributed players.

Configure your application to route reads to replicas and writes to the primary. Use connection poolers like PgBouncer to multiplex hundreds of application connections onto a small pool of real database connections, preventing connection exhaustion as you scale to many server instances:

// server/src/config/database.js
const { Pool } = require('pg');

// Primary pool — all writes go here
const primaryPool = new Pool({
  host: process.env.DB_HOST || 'postgres-primary',
  port: process.env.DB_PORT || 5432,
  database: 'ludoking',
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// Read replica pool — all reads go here
const replicaPool = new Pool({
  host: process.env.DB_REPLICA_HOST || 'postgres-replica',
  port: process.env.DB_PORT || 5432,
  database: 'ludoking',
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  max: 30,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// Read-write router
async function query(sql, params, options = {}) {
  if (options.write || !isReadOnlyQuery(sql)) {
    return primaryPool.query(sql, params);
  }
  // Route reads to replica
  return replicaPool.query(sql, params);
}

// Heuristic: detect SELECT queries that should go to replica
function isReadOnlyQuery(sql) {
  const trimmed = sql.trim().toUpperCase();
  return trimmed.startsWith('SELECT') &&
         !trimmed.includes('FOR UPDATE') &&
         !trimmed.includes('FOR SHARE');
}

// --- Read-heavy queries (routed to replica) ---

async function getPlayerProfile(playerId) {
  return query(
    `SELECT id, username, avatar_url, games_played, games_won, created_at
     FROM players WHERE id = $1`,
    [playerId]
  );
}

async function getMatchHistory(playerId, limit = 20) {
  return query(
    `SELECT m.id, m.started_at, m.ended_at, m.winner_id, m.variant,
            p.username as winner_name
     FROM matches m
     JOIN players p ON p.id = m.winner_id
     WHERE m.player_ids @> $1::uuid[]
     ORDER BY m.ended_at DESC
     LIMIT $2`,
    [JSON.stringify([playerId]), limit]
  );
}

async function getGlobalLeaderboard(variant = 'classic', limit = 100) {
  return query(
    `SELECT p.id, p.username, p.avatar_url,
            s.games_won, s.games_played, s.win_rate
     FROM player_stats s
     JOIN players p ON p.id = s.player_id
     WHERE s.variant = $1
     ORDER BY s.games_won DESC
     LIMIT $2`,
    [variant, limit]
  );
}

// --- Write queries (always routed to primary) ---

async function logMove(gameId, playerId, tokenId, fromCell, toCell) {
  return query(
    `INSERT INTO move_log (game_id, player_id, token_id, from_cell, to_cell, logged_at)
     VALUES ($1, $2, $3, $4, $5, NOW())`,
    [gameId, playerId, tokenId, fromCell, toCell],
    { write: true }
  );
}

async function recordGameResult(gameId, playerResults) {
  const client = await primaryPool.connect();
  try {
    await client.query('BEGIN');
    await client.query(
      `UPDATE matches SET ended_at = NOW(), status = 'completed' WHERE id = $1`,
      [gameId]
    );
    for (const { playerId, won, score } of playerResults) {
      await client.query(
        `INSERT INTO game_results (match_id, player_id, won, score)
         VALUES ($1, $2, $3, $4)
         ON CONFLICT (match_id, player_id) DO UPDATE SET won = $3, score = $4`,
        [gameId, playerId, won, score]
      );
    }
    await client.query('COMMIT');
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

module.exports = { primaryPool, replicaPool, query, getPlayerProfile, getMatchHistory, getGlobalLeaderboard, logMove, recordGameResult };

Read replicas in PostgreSQL use streaming replication — the primary WAL (Write-Ahead Log) is streamed to replicas with a typical lag of 5–50ms under normal load. For leaderboards and match history, this lag is imperceptible. For real-time match-making queries that check player availability, route directly to the primary or use Redis as the authoritative source for live player state. The database schema guide covers the full schema design including indexing strategies for the move_log and game_results tables.

Step 5 — CDN for Static Assets

Game clients load board graphics, token sprites, dice animation frames, sound effects, and localization files on startup and during gameplay. Serving these from your origin server wastes bandwidth and increases latency for geographically distant players. A CDN (Cloudflare, AWS CloudFront, Fastly) caches static assets at edge locations worldwide, reducing time-to-first-byte and offloading the majority of HTTP traffic from your game servers.

The CDN configuration below uses CloudFront but the same principles apply to any provider. Assets are versioned by a hash or build number, enabling aggressive caching with long TTLs. Cache invalidation is triggered only when an asset actually changes, using the version prefix in the URL path:

// cdn-config.js — CDN Asset Manifest & Pre-warming Script

const assetManifest = {
  version: '2.4.0',
  buildHash: 'a3f7c291',
  baseUrl: 'https://cdn.ludokingapi.site/assets/v2.4.0/',
  baseUrlFallback: 'https://ludokingapi.site/assets/v2.4.0/',
  assets: {
    // Board textures
    'board.classic':        'board/classic.svg',
    'board.variant-a':      'board/variant-a.svg',
    'board.background':     'board/board-bg.png',

    // Token sprites (sprite sheet for efficient loading)
    'tokens.sheet':         'tokens/token-sheet@2x.webp',
    'tokens.sheet.mobile':  'tokens/token-sheet.webp',

    // Dice images
    'dice.set':             'dice/dice-set.webp',
    'dice.audio':           'audio/dice-roll.mp3',

    // Sound effects
    'sfx.token_move':       'audio/token-move.mp3',
    'sfx.token_capture':    'audio/capture.mp3',
    'sfx.dice_roll':        'audio/dice-roll.mp3',
    'sfx.game_win':         'audio/game-win.mp3',

    // Animation data (Lottie JSON files)
    'anim.dice_tumble':     'animations/dice-tumble.json',
    'anim.token_bounce':    'animations/token-bounce.json',
  },
};

// Resolve asset URL — appends version hash for cache busting
function getAssetUrl(assetKey) {
  const path = assetManifest.assets[assetKey];
  if (!path) throw new Error(`Unknown asset key: ${assetKey}`);
  return `${assetManifest.baseUrl}${path}?hash=${assetManifest.buildHash}`;
}

// Check CDN health before relying on it for critical assets
async function preloadFromCDN(assetKeys) {
  const results = await Promise.allSettled(
    assetKeys.map((key) => {
      const url = getAssetUrl(key);
      return fetch(url, { method: 'HEAD' }).then((res) => ({
        key,
        ok: res.ok,
        url,
        contentLength: res.headers.get('content-length'),
      }));
    })
  );
  const failed = results.filter((r) => r.status === 'rejected' || !r.value.ok);
  if (failed.length > 0) {
    console.warn(`[CDN] ${failed.length} critical assets failed; falling back to origin`);
  }
  return results;
}

// Preload critical game assets during splash screen
async function preloadCriticalAssets() {
  const criticalKeys = [
    'board.classic',
    'tokens.sheet',
    'dice.set',
    'sfx.dice_roll',
  ];
  await preloadFromCDN(criticalKeys);
  console.log('[CDN] Critical assets loaded');
}

module.exports = { assetManifest, getAssetUrl, preloadCriticalAssets };

Configure your CDN origin behavior to cache assets for 30 days (max-age=2592000) and serve from cache on origin failures. Enable Brotli compression for text-based assets (JSON, JS, CSS) to reduce transfer sizes by 20–30% compared to gzip. For WebP and AVIF image formats, configure the CDN to negotiate content encoding based on the Accept header, serving AVIF to supporting browsers, WebP to others, and fallback PNG/JPEG to legacy clients.

Step 6 — Auto-Scaling with AWS Auto Scaling Groups

AWS Auto Scaling Groups (ASG) automatically adjust the number of game server instances based on real-time demand metrics. During peak hours (evening gaming sessions), the ASG spawns additional instances behind the load balancer. During quiet periods (early morning), it terminates surplus instances to reduce costs. This elastic capacity model is fundamental to cost-efficient production Ludo game infrastructure.

The scaling policy below uses a target tracking configuration that maintains an average CPU utilization of 60% across the ASG, and a custom metric (WebSocket connection count per instance) that triggers scaling when player concurrency increases:

# aws-asg-configuration.yaml
# AWS Auto Scaling Group Configuration for Ludo Game Servers
# Apply via AWS Console, CLI, or Terraform/CloudFormation

AWSTemplateFormatVersion: "2010-09-09"
Description: "Ludo Game Server Auto Scaling Group"

Resources:
  # --- Launch Template ---
  LudoServerLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: ludoking-server-v2
      LaunchTemplateData:
        ImageId: ami-0abcdef1234567890        # Replace with your custom AMI ID
        InstanceType: c6i.large               # 2 vCPU, 4GB RAM — good for Node.js + Socket.IO
        KeyName: ludoking-prod-key            # Replace with your key pair
        SecurityGroupIds:
          - !Ref LudoServerSecurityGroup
        IamInstanceProfile:
          Name: LudoServerInstanceProfile
        UserData:
          Fn::Base64: |
            #!/bin/bash
            set -e
            export INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
            export AWS_REGION=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | grep region | awk -F'"' '{print $4}')

            # Pull and run Docker container from ECR
            aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.$AWS_REGION.amazonaws.com
            docker pull ${AWS_ACCOUNT_ID}.dkr.ecr.$AWS_REGION.amazonaws.com/ludoking/server:latest
            docker run -d \
              --name ludo-server \
              --restart unless-stopped \
              -p 3000:3000 \
              -e INSTANCE_ID=$INSTANCE_ID \
              -e REDIS_HOST=redis.internal \
              -e DB_HOST=postgres.internal \
              -e AWS_REGION=$AWS_REGION \
              --sysctl net.core.somaxconn=65535 \
              ${AWS_ACCOUNT_ID}.dkr.ecr.$AWS_REGION.amazonaws.com/ludoking/server:latest

        Monitoring:
          Detailed: true

  # --- Auto Scaling Group ---
  LudoServerASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AutoScalingGroupName: ludoking-game-servers
      MinSize: 2                               # Always keep at least 2 instances
      MaxSize: 20                             # Cap at 20 instances
      DesiredCapacity: 4                      # Default: 4 instances
      HealthCheckType: ELB                    # Use ELB health checks (HTTP /health endpoint)
      HealthCheckGracePeriod: 60
      DefaultCooldown: 300                    # 5 minutes between scaling actions
      TerminationPolicies:
        - OldestInstance                       # Terminate oldest instances first
      VPCZoneIdentifier:
        - subnet-abc1234
        - subnet-def5678
        - subnet-ghi9012
      LaunchTemplate:
        Version: !GetAtt LudoServerLaunchTemplate.LatestVersionNumber
      TargetGroupARNs:
        - !Ref LudoTargetGroup
      MetricsCollection:
        - Granularity: 1Minute

  # --- Scaling Policies ---
  ScaleUpPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref LudoServerASG
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ASGAverageCPUUtilization
        TargetValue: 60.0                      # Scale up when avg CPU > 60%
        DisableScaleIn: true                   # Prevent scale-in for 5 min after scale-up

  WebSocketScaleUpPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref LudoServerASG
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        CustomizedMetricSpecification:
          MetricName: LudoWebSocketConnections
          Namespace: LudoKingAPI
          Statistic: Average
          Unit: Count
          Dimensions:
            - Name: InstanceId
              Value: AWS:InstanceId
        TargetValue: 800.0                    # Scale up when avg connections > 800 per instance
        DisableScaleIn: true

  ScaleDownPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref LudoServerASG
      PolicyType: StepScaling
      AdjustmentType: ChangeInCapacity
      Steps:
        - UpperBound: 30
          Adjustment: -1
        - UpperBound: 40
          Adjustment: -2
      MetricAggregationType: Average
      EvaluationPeriods: 5
      Period: 60
      Statistic: Average

  # --- Scheduled Actions ---
  MorningScaleUp:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref LudoServerASG
      ScheduledActionName: morning-scale-up
      MinSize: 4
      MaxSize: 20
      DesiredCapacity: 6
      Recurrence: "0 8 * * *"                  # 8:00 AM UTC every day

  NightScaleDown:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref LudoServerASG
      ScheduledActionName: night-scale-down
      MinSize: 2
      MaxSize: 20
      DesiredCapacity: 2
      Recurrence: "0 2 * * *"                  # 2:00 AM UTC every day (low-traffic period)

  # --- Security Group ---
  LudoServerSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Ludo game server security group"
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          IpProtocol: -1
      SecurityGroupIngress:
        - CidrIp: 0.0.0.0/0
          IpProtocol: tcp
          FromPort: 3000
          ToPort: 3000

  # --- Target Group ---
  LudoTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: ludoking-tg
      Port: 3000
      Protocol: HTTP
      VpcId: vpc-abc1234
      HealthCheckIntervalSeconds: 10
      HealthCheckPath: /health
      HealthCheckPort: 3000
      HealthCheckProtocol: HTTP
      HealthyThresholdCount: 2
      UnhealthyThresholdCount: 3
      TargetType: instance

        

The dual scaling policies work together: CPU utilization handles general compute demand, and the custom WebSocket connection metric handles Socket.IO-specific load. The custom metric requires publishing instance-level metrics to CloudWatch using the CloudWatch agent or a SDK. Install the CloudWatch agent on your instances and configure it to push LudoWebSocketConnections from your Socket.IO server. Scheduled actions provide predictable capacity adjustments for known traffic patterns — morning scale-up before peak hours, night scale-down during low-traffic periods.

For Railway or other platform-as-a-service deployments, configure auto-scaling through the platform's dashboard or CLI. Railway supports auto-scaling policies via railway.json or the dashboard, with CPU and memory-based scaling triggers similar to AWS ASG target tracking. See the Docker Deployment Guide for container configuration details.

Step 7 — Prometheus + Grafana Monitoring

Operational visibility is non-negotiable in a distributed game platform. Prometheus scrapes metrics from your game servers, Redis, PostgreSQL, and Nginx at regular intervals, storing time-series data for analysis and alerting. Grafana connects to Prometheus as a data source and renders dashboards for real-time monitoring, capacity planning, and incident investigation. Without this stack, you are blind to resource exhaustion, latency spikes, and connection drops that degrade player experience.

The Prometheus configuration below scrapes metrics from the game server, Redis, PostgreSQL, and Nginx exporter endpoints:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ludoking-prod'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # --- Prometheus self-monitoring ---
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # --- Ludo Game Server Metrics ---
  # Requires prometheus-client middleware in your Node.js server
  # Example: app.get('/metrics', promClientMiddleware)
  - job_name: 'ludo-server'
    static_configs:
      - targets: ['ludo-server:3000']
    metrics_path: /metrics
    scrape_interval: 10s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+):\d+'
        replacement: '${1}'

  # --- Redis Metrics ---
  # Expose via redis_exporter (separate container or sidecar)
  - job_name: 'redis'
    static_configs:
      - targets: ['redis:9121']

  # --- PostgreSQL Metrics ---
  # Expose via postgres_exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # --- Nginx Metrics ---
  # Requires nginx.conf with stub_status on /metrics
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx:8080']

  # --- Docker Container Metrics ---
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

# =============================================================================
# prometheus/rules/ludo-alerts.yml — Alerting Rules
# =============================================================================

groups:
  - name: ludo-game-alerts
    interval: 30s
    rules:
      # High latency — players experiencing slow responses
      - alert: HighAPILatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency detected"
          description: "95th percentile latency is {{ $value | printf \"%.2f\" }}s (threshold: 2s)"

      # WebSocket connection exhaustion
      - alert: HighWebSocketConnections
        expr: rate(socketio_connections_total[5m]) > 0.9 * rate(socketio_connections_limit_total[5m])
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "WebSocket connections approaching limit"
          description: "{{ $value | printf \"%.0f\" }} connections in use — scale up ASG"

      # Redis memory pressure
      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 85%"
          description: "Redis is using {{ $value | printf \"%.1f\" }}% of max memory — consider increasing maxmemory"

      # Database connection pool exhaustion
      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL connection pool above 80% capacity"
          description: "{{ $value | printf \"%.0f\" }}% of connections in use"

      # Game server instance unhealthy
      - alert: GameServerDown
        expr: up{job="ludo-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Game server instance is down"
          description: "Instance {{ $labels.instance }} has been unreachable for 1 minute"

        

Key metrics to instrument in your game server include: WebSocket connection count (gauge), messages processed per second (counter), HTTP request duration histograms (histogram), active game sessions (gauge), dice rolls per minute (counter), token moves per minute (counter), Redis operation latency (histogram), and database query latency (histogram). These metrics feed the alerting rules and Grafana dashboards for capacity planning, anomaly detection, and SLA compliance reporting.

Grafana dashboard panels should display: real-time WebSocket connection count per instance and total across the ASG, API request latency percentiles (p50, p95, p99), Redis memory usage and command rate, PostgreSQL query latency and connection pool utilization, CPU and memory per container, and game session start/end rates. Set up Grafana alerting channels for Slack, PagerDuty, or email for the critical alerts defined above.

Frequently Asked Questions

The Socket.IO Redis adapter replaces the default in-memory adapter with a Redis-backed pub/sub layer. When a player on any server instance calls io.to('room:123').emit('event', data), the adapter publishes the event to a Redis channel named socket.io#/#room:123#. All other server instances are subscribed to this channel (via a separate Redis subscriber connection) and receive the event, which they then emit to their locally connected Socket.IO clients. This means a move made by a player on Instance A is instantly broadcast to all other players in the room, regardless of which instance they are connected to. The adapter handles all channel management, subscription cleanup, and acknowledgment tracking automatically.

WebSocket upgrades are HTTP requests with an Upgrade header that transitions the TCP connection from HTTP to the WebSocket protocol. By default, Nginx load balancers route each HTTP request independently — a player's HTTP requests might go to Instance A while their WebSocket upgrade goes to Instance B. If your Socket.IO server stores session data in local memory (rather than Redis), the player on Instance B would have no knowledge of their existing session. Even with Redis, sticky sessions improve cache locality and reduce cross-instance traffic. The sticky cookie directive ensures the same client IP (or the same cookie value) is consistently routed to the same backend instance across all requests.

For Socket.IO game servers, WebSocket connection count is the most direct metric — each instance has a practical ceiling (typically 1,000–5,000 concurrent connections depending on hardware). Supplement with CPU utilization as a secondary indicator and message queue depth (if using a message broker) as a leading indicator. Memory utilization is less useful as a primary metric because Node.js memory usage grows gradually and does not spike during high-load events. For the Ludo game specifically, the number of active game sessions is more meaningful than raw connection count — a lobby full of idle players consumes connections but minimal CPU, while a full game of 4 active players per session consumes both connections and CPU for event processing.

PostgreSQL uses streaming replication to keep read replicas synchronized with the primary. The primary continuously ships WAL (Write-Ahead Log) records to replicas in real time via the wal_sender process. Replicas apply these records via the wal_receiver process and the recovery subsystem. The replication lag — the time between a write on the primary and its visibility on the replica — is typically 5–50ms under normal load but can grow to seconds during replication bursts or network interruptions. For leaderboard queries and match history, this lag is imperceptible to players. For real-time matchmaking that checks player availability, always query the primary or use Redis as the authoritative live state store.

Set aggressive TTLs (30 days) for immutable assets like board graphics, token sprites, and audio files that are versioned by build hash in the URL path. These assets never change for a given version, so long TTLs maximize cache hit rates and minimize origin requests. Set short TTLs (5–60 seconds) for mutable or frequently updated assets. For the asset manifest itself (the JSON file listing asset URLs and versions), use a 5-minute TTL with a cache-busting query parameter so clients pick up new versions quickly after deployments. At edge, configure stale-while-revalidate to serve cached content while fetching updates in the background, eliminating perceived latency from CDN cache misses.

Stale cache is prevented by two practices: short TTLs and immediate invalidation on write. Set TTLs of 60–300 seconds for live game state — long enough to serve read-heavy traffic efficiently, short enough to limit stale reads to a few minutes at most. On every game state mutation (dice roll, token move, capture), delete the cache entry immediately with DEL game:{gameId} before or immediately after persisting to PostgreSQL. Do not rely on TTL expiration alone for invalidation — a stale entry served to a player joining mid-game creates a desync that is difficult to recover from. On game completion, explicitly delete the cache entry to free Redis memory for active games.

Railway simplifies deployment by abstracting infrastructure — you push a Docker image and Railway handles container orchestration, scaling, and health checks through its dashboard or railway.json configuration. It supports CPU and memory-based auto-scaling triggers but lacks the fine-grained control of AWS ASG (custom CloudWatch metrics, scheduled actions, step scaling policies, and mixed instance types). For production Ludo platforms expecting thousands of concurrent players across multiple regions, AWS ASG provides the control and cost optimization needed for enterprise workloads. Railway is excellent for staging environments, rapid prototyping, and smaller-scale production deployments where the built-in scaling behavior is sufficient. Use the Docker Deployment Guide for container configuration that works on both Railway and AWS.

Need Help Scaling Your Ludo Game Platform?

From architecture reviews to full infrastructure implementation, the Ludo King API team provides consulting for production-grade Ludo game scaling. Get a free consultation over WhatsApp.

Chat on WhatsApp