Getting Started — Running Polarway Across Multiple Containers/VMs

Date: 2026-01-13

This guide shows the simplest way to distribute Polarway workloads across multiple containers/VMs.

1) Quick concept

You scale Polarway by: - Running N identical gRPC workers behind a gRPC-capable load balancer. - Turning on external handles so any worker can serve follow-up operations.

External handles mean: - The worker persists the DataFrame result into a shared store. - The handle returned to the client references that persisted artifact.

2) Prerequisites

Docker (for container path) or systemd (for VM path)
A shared storage option accessible by all workers:
Development: shared host path
Multi-host: network filesystem (NFS/SMB)
Production: object store (recommended)

3) Environment variables

On each worker: - POLARWAY_BIND_ADDRESS=0.0.0.0:50051

Handle store modes: - In-memory (default): - POLARWAY_HANDLE_STORE=memory - External (distributed-safe): - POLARWAY_HANDLE_STORE=external - POLARWAY_STATE_DIR=/state (must be shared across workers)

4) Single-host multi-container (Phase 1)

Topology

1 machine
1 shared directory mounted into every worker container
1 load balancer container in front

What to run

Worker containers (replicas): polarway-grpc
Load balancer: any HTTP/2 gRPC-capable LB

Key idea

All workers must see the same POLARWAY_STATE_DIR contents.

5) Multi-VM / multi-host (Phase 2)

Topology

Multiple VMs
Each VM runs one or more worker containers
A load balancer routes traffic to all worker endpoints
Shared external store:
network filesystem mounted at the same path on all nodes (e.g. /state), OR
object store backend (recommended next step)

Path consistency

If you use a filesystem store, ensure the mount path is identical on all machines: - Good: every worker uses /state - Risky: different paths per VM (handles will still work if key-to-path mapping is consistent, but ops/debugging becomes painful)

6) Client usage patterns

Pattern A: Interactive pipelines (simplest)

Client calls gRPC operations sequentially.
Each call can land on any worker via LB.

Recommended client settings: - request timeout (e.g. 30–120s depending on operation) - retries on transient errors

Pattern B: Batch jobs

Client enqueues jobs to a queue.
Workers pull and execute.

This is the right approach for long-running workloads where client timeouts are unacceptable.

7) Good practices (production-minded)

Make operations retry-safe

Assume clients will retry.
Ensure “create handle” is atomic:
filesystem: write temp file then rename
object store: upload then return handle

Bound resources

Set maximum payload sizes (streaming batches, Arrow IPC bytes).
Limit concurrency per worker (CPU and memory are the bottleneck with DataFrames).
Add guardrails: max rows, max columns, max string lengths (if applicable).

Handle lifecycle and storage growth

External state grows without cleanup.
Plan one of:
TTL-based GC (delete objects older than N hours/days)
“pinning” important artifacts and expiring the rest

Observability

At minimum, track: - requests/sec and latency per RPC method - error rate by status code - bytes written/read to state store - number of handles created per minute

Network and security

Run workers on a private network.
Put auth in front (mTLS or token auth).
Separate tenants/namespaces in handle keys.

Performance tips

Prefer Arrow IPC for fast serialization.
Consider request affinity (sticky routing) to reduce repeated reads from external store.
Add a local cache (best-effort) per worker if repeated reads are common.

8) Suggested next steps

Implement atomic writes + GC for the filesystem state store.
Add an object-store backed StateStore (same interface as filesystem).
Add a small client-side “router” helper for:
retries
optional handle-affinity routing
standardized timeouts

9) Reference docs

Implementation plan: DISTRIBUTED_IMPLEMENTATION_PLAN.md