Skip to content

Getting Started — Running Polarway Across Multiple Containers/VMs

Date: 2026-01-13

This guide shows the simplest way to distribute Polarway workloads across multiple containers/VMs.

1) Quick concept

You scale Polarway by: - Running N identical gRPC workers behind a gRPC-capable load balancer. - Turning on external handles so any worker can serve follow-up operations.

External handles mean: - The worker persists the DataFrame result into a shared store. - The handle returned to the client references that persisted artifact.

2) Prerequisites

  • Docker (for container path) or systemd (for VM path)
  • A shared storage option accessible by all workers:
  • Development: shared host path
  • Multi-host: network filesystem (NFS/SMB)
  • Production: object store (recommended)

3) Environment variables

On each worker: - POLARWAY_BIND_ADDRESS=0.0.0.0:50051

Handle store modes: - In-memory (default): - POLARWAY_HANDLE_STORE=memory - External (distributed-safe): - POLARWAY_HANDLE_STORE=external - POLARWAY_STATE_DIR=/state (must be shared across workers)

4) Single-host multi-container (Phase 1)

Topology

  • 1 machine
  • 1 shared directory mounted into every worker container
  • 1 load balancer container in front

What to run

  • Worker containers (replicas): polarway-grpc
  • Load balancer: any HTTP/2 gRPC-capable LB

Key idea

All workers must see the same POLARWAY_STATE_DIR contents.

5) Multi-VM / multi-host (Phase 2)

Topology

  • Multiple VMs
  • Each VM runs one or more worker containers
  • A load balancer routes traffic to all worker endpoints
  • Shared external store:
  • network filesystem mounted at the same path on all nodes (e.g. /state), OR
  • object store backend (recommended next step)

Path consistency

If you use a filesystem store, ensure the mount path is identical on all machines: - Good: every worker uses /state - Risky: different paths per VM (handles will still work if key-to-path mapping is consistent, but ops/debugging becomes painful)

6) Client usage patterns

Pattern A: Interactive pipelines (simplest)

  • Client calls gRPC operations sequentially.
  • Each call can land on any worker via LB.

Recommended client settings: - request timeout (e.g. 30–120s depending on operation) - retries on transient errors

Pattern B: Batch jobs

  • Client enqueues jobs to a queue.
  • Workers pull and execute.

This is the right approach for long-running workloads where client timeouts are unacceptable.

7) Good practices (production-minded)

Make operations retry-safe

  • Assume clients will retry.
  • Ensure “create handle” is atomic:
  • filesystem: write temp file then rename
  • object store: upload then return handle

Bound resources

  • Set maximum payload sizes (streaming batches, Arrow IPC bytes).
  • Limit concurrency per worker (CPU and memory are the bottleneck with DataFrames).
  • Add guardrails: max rows, max columns, max string lengths (if applicable).

Handle lifecycle and storage growth

  • External state grows without cleanup.
  • Plan one of:
  • TTL-based GC (delete objects older than N hours/days)
  • “pinning” important artifacts and expiring the rest

Observability

At minimum, track: - requests/sec and latency per RPC method - error rate by status code - bytes written/read to state store - number of handles created per minute

Network and security

  • Run workers on a private network.
  • Put auth in front (mTLS or token auth).
  • Separate tenants/namespaces in handle keys.

Performance tips

  • Prefer Arrow IPC for fast serialization.
  • Consider request affinity (sticky routing) to reduce repeated reads from external store.
  • Add a local cache (best-effort) per worker if repeated reads are common.

8) Suggested next steps

  • Implement atomic writes + GC for the filesystem state store.
  • Add an object-store backed StateStore (same interface as filesystem).
  • Add a small client-side “router” helper for:
  • retries
  • optional handle-affinity routing
  • standardized timeouts

9) Reference docs