Migration Guide: Polars to Polarway
A comprehensive guide to migrate your existing Polars code to Polarway.
π Table of Contents
- Why Migrate
- Version-Specific Migrations
- v0.53.0: Adaptive Streaming Sources
- General Polars β Polarway Migration
π v0.53.0: Adaptive Streaming Sources (January 2026)
What's New
Polarway v0.53.0 introduces Generic Adaptive Streaming Sources with support for:
- CSV with adaptive chunking
- Cloud Storage: S3, Azure Blob, Google Cloud Storage
- Databases: DynamoDB, PostgreSQL, MySQL
- Streaming: Apache Kafka
- HTTP/REST APIs with automatic pagination
- Filesystem with memory mapping
Quick Migration
From Standard Polars CSV
Before:
After (v0.53.0):
import polarway as pl
# Automatic memory management
df = pl.adaptive_scan_csv("large.csv", memory_limit="2GB")
# Or streaming
from polarway.streaming import CsvSource
source = CsvSource("large.csv", memory_limit="2GB")
for chunk in source:
process(chunk)
From Manual S3 Downloads
Before:
import boto3
s3 = boto3.client('s3')
s3.download_file('bucket', 'data.parquet', '/tmp/data.parquet')
df = pl.read_parquet('/tmp/data.parquet')
After (v0.53.0):
from polarway.streaming import S3Source
# Stream directly from S3
source = S3Source(
"s3://bucket/data.parquet",
memory_limit="4GB"
)
for chunk in source:
process(chunk)
From Manual API Pagination
Before:
import requests
data = []
page = 1
while True:
r = requests.get(f"https://api.example.com/data?page={page}")
if not r.json(): break
data.extend(r.json())
page += 1
df = pl.DataFrame(data)
After (v0.53.0):
from polarway.streaming import HttpSource
# Automatic pagination
source = HttpSource(
"https://api.example.com/data",
paginated=True,
memory_limit="1GB"
)
for chunk in source:
process(chunk)
Configuration
Create polarway.toml for default settings:
[sources.csv]
default_memory_limit = "2GB"
default_chunk_size = 10000
[sources.s3]
region = "us-east-1"
default_memory_limit = "4GB"
[sources.http]
timeout = 30
retry_attempts = 3
Memory Recommendations
| Environment | Memory Limit | Chunk Size |
|---|---|---|
| Laptop (8GB) | "2GB" |
10,000 |
| Desktop (16GB) | "4GB" |
50,000 |
| Server (32GB) | "8GB" |
100,000 |
| Azure B1s (1GB) | "400MB" |
5,000 |
| Azure B2s (4GB) | "1.5GB" |
20,000 |
Custom Sources (Rust)
use polars_streaming_adaptive::sources::*;
use async_trait::async_trait;
#[derive(Debug)]
struct MySqlSource { /* ... */ }
#[async_trait]
impl StreamingSource for MySqlSource {
async fn metadata(&self) -> SourceResult<SourceMetadata> { /* ... */ }
async fn read_chunk(&mut self) -> SourceResult<Option<DataFrame>> { /* ... */ }
fn stats(&self) -> StreamingStats { /* ... */ }
async fn reset(&mut self) -> SourceResult<()> { /* ... */ }
fn has_more(&self) -> bool { /* ... */ }
}
// Register
let mut registry = SourceRegistry::new();
registry.register("mysql", Box::new(MySqlSourceFactory));
See Also
- API Reference - Complete API documentation
- User Guide - Comprehensive usage guide
- Benchmarks - Performance comparisons
π― Why Migrate?
Performance Improvements
- π Remote Execution: Process data on powerful servers from lightweight clients
- π Streaming-First: Handle larger-than-RAM datasets with constant memory usage
- β‘ Zero-Copy Streaming: Arrow IPC eliminates serialization overhead
- π True Parallelism: Async Tokio runtime enables concurrent operations
- π Better Scalability: Multi-client access to shared datasets
New Features
- gRPC Architecture: Handle-based remote execution
- WebSocket Streaming: Real-time data ingestion with sub-ms latency
- REST API Integration: Built-in pagination and retries
- Time-Series Native: OHLCV resampling and rolling windows
- Distributed Computing: Process data across multiple nodes
- Advanced Async: First-class async/await support
Use Cases Perfect for Polarway
β
Processing datasets larger than your machine's RAM
β
Shared data access from multiple clients
β
Real-time streaming data pipelines
β
Financial data analysis with time-series operations
β
Centralized data processing on powerful hardware
β
Network-sourced data (WebSocket, REST APIs)
π Compatibility Overview
β What's Compatible
Most Polars operations work identically in Polarway:
# These work the same in both Polars and Polarway
df.select(["col1", "col2"])
df.filter(pl.col("price") > 100)
df.group_by("symbol").agg({"price": "mean"})
df.join(df2, on="id", how="inner")
β οΈ Key Differences
1. Handle-Based Architecture
Polars (in-memory):
import polars as pl
df = pl.read_parquet("data.parquet") # Returns DataFrame with data
print(type(df)) # <class 'polars.DataFrame'>
Polarway (remote handles):
import polarway as pd
df = pd.read_parquet("data.parquet") # Returns Handle reference
print(type(df)) # <class 'polarway.DataFrame'> (just a UUID)
2. Explicit Collection
Polars:
df = pl.read_parquet("data.parquet")
result = df.select(["col1"]) # Result is immediately available
print(result) # Prints data
Polarway:
df = pd.read_parquet("data.parquet")
df2 = df.select(["col1"]) # Returns new handle
result = df2.collect() # Explicit collection needed
print(result) # PyArrow Table
3. Result Types
Polars (raises exceptions):
Polarway (Result monad):
result = pd.read_parquet("missing.parquet").collect()
if result.is_ok():
df = result.unwrap()
else:
error = result.unwrap_err()
π Step-by-Step Migration
1. Update Dependencies
Python
Before (Polars):
After (Polarway):
Or with pip:
Rust
Before (Polars):
After (Polarway):
2. Update Imports
Before:
After:
import polarway as pd
from polarway import col, lit
# Connect to server
client = pd.connect("localhost:50051")
3. Add Server Connection
Polarway requires a running gRPC server:
# Start server with Docker
docker run -d -p 50051:50051 polarway/server:latest
# Or build from source
cd polarway
cargo run -p polarway-grpc
4. Update Code Patterns
Simple Operations
Before (Polars):
df = pl.read_parquet("data.parquet")
result = (
df
.filter(pl.col("price") > 100)
.select(["symbol", "price"])
.group_by("symbol")
.agg({"price": "mean"})
)
print(result)
After (Polarway):
df = pd.read_parquet("data.parquet")
result = (
df
.filter(pd.col("price") > 100)
.select(["symbol", "price"])
.group_by("symbol")
.agg({"price": "mean"})
.collect() # Explicit collection
)
print(result) # PyArrow Table
Lazy Operations
Before (Polars):
lazy_df = pl.scan_parquet("data.parquet")
result = (
lazy_df
.filter(pl.col("price") > 100)
.collect()
)
After (Polarway):
# Lazy by default! No need for scan_
df = pd.read_parquet("data.parquet")
result = (
df
.filter(pd.col("price") > 100)
.collect()
)
Multiple DataFrames
Before (Polars):
df1 = pl.read_parquet("data1.parquet")
df2 = pl.read_parquet("data2.parquet")
joined = df1.join(df2, on="id")
After (Polarway):
df1 = pd.read_parquet("data1.parquet")
df2 = pd.read_parquet("data2.parquet")
joined = df1.join(df2, on="id")
result = joined.collect() # Explicit collection
5. Async Operations (New in Polarway!)
Polarway enables true parallel operations:
Sequential (slow):
results = []
for i in range(100):
df = pd.read_parquet(f"file_{i}.parquet")
result = df.filter(pd.col("value") > 0).collect()
results.append(result)
Parallel (fast):
import asyncio
async def process_files():
async with pd.AsyncClient("localhost:50051") as client:
# Read all files in parallel
handles = await asyncio.gather(*[
client.read_parquet(f"file_{i}.parquet")
for i in range(100)
])
# Process all in parallel
results = await asyncio.gather(*[
h.filter(pd.col("value") > 0).collect()
for h in handles
])
return results
results = await process_files()
πΊοΈ Feature Mapping
Core Operations
| Operation | Polars | Polarway | Notes |
|---|---|---|---|
| Read Parquet | pl.read_parquet() |
pd.read_parquet() |
β Same |
| Read CSV | pl.read_csv() |
pd.read_csv() |
β Same |
| Select | df.select() |
df.select() |
β Same |
| Filter | df.filter() |
df.filter() |
β Same |
| Group By | df.group_by() |
df.group_by() |
β Same |
| Join | df.join() |
df.join() |
β Same |
| Collect | df.collect() (lazy) |
df.collect() (always) |
β οΈ Always needed |
I/O Operations
| Operation | Polars | Polarway | Notes |
|---|---|---|---|
| Read local file | pl.read_parquet("file.parquet") |
pd.read_parquet("file.parquet") |
β Same |
| Scan lazy | pl.scan_parquet() |
pd.read_parquet() |
β οΈ Lazy by default |
| Write Parquet | df.write_parquet() |
df.write_parquet() |
β Same |
| Write CSV | df.write_csv() |
df.write_csv() |
β Same |
Expressions
| Operation | Polars | Polarway | Notes |
|---|---|---|---|
| Column reference | pl.col("name") |
pd.col("name") |
β Same |
| Literal value | pl.lit(100) |
pd.lit(100) |
β Same |
| String ops | .str.contains() |
.str.contains() |
β Same |
| Datetime ops | .dt.year() |
.dt.year() |
β Same |
| Aggregations | .mean(), .sum() |
.mean(), .sum() |
β Same |
New in Polarway
| Operation | Polars | Polarway | Notes |
|---|---|---|---|
| Async client | β N/A | pd.AsyncClient() |
β¨ New |
| WebSocket | β N/A | pd.from_websocket() |
β¨ New |
| REST API | β N/A | pd.read_rest_api() |
β¨ New |
| Time-series | Manual | df.as_timeseries() |
β¨ New |
| OHLCV resample | Manual | .resample_ohlcv() |
β¨ New |
| Result monad | Exceptions | .is_ok(), .unwrap() |
β¨ New |
π§ͺ Testing Your Migration
Unit Testing Strategy
Before (Polars):
def test_data_processing():
df = pl.read_parquet("test_data.parquet")
result = df.filter(pl.col("value") > 0)
assert result.shape[0] > 0
After (Polarway):
def test_data_processing():
df = pd.read_parquet("test_data.parquet")
result = df.filter(pd.col("value") > 0).collect()
assert result.num_rows > 0
Performance Benchmarking
import time
# Polars
start = time.time()
df = pl.read_parquet("data.parquet")
result = df.filter(pl.col("price") > 100)
polars_time = time.time() - start
# Polarway
start = time.time()
df = pd.read_parquet("data.parquet")
result = df.filter(pd.col("price") > 100).collect()
polarway_time = time.time() - start
print(f"Polars: {polars_time:.3f}s")
print(f"Polarway: {polarway_time:.3f}s")
print(f"Overhead: {((polarway_time/polars_time)-1)*100:.1f}%")
Expected Results: - Small datasets (<100MB): 5-10% overhead (gRPC network cost) - Large datasets (>1GB): Similar or better performance - Streaming (>RAM): Polarway wins (Polars OOM) - Concurrent operations: Polarway 10-100x faster
Validation Checklist
- [ ] All imports updated (
polarsβpolarway) - [ ] Server connection established
- [ ]
.collect()added where needed - [ ] Result types handled (
.unwrap()or error checks) - [ ] Async operations converted to
AsyncClient - [ ] Tests updated and passing
- [ ] Performance benchmarks acceptable
- [ ] Memory usage checked for large datasets
π Common Migration Issues
Issue 1: Missing .collect()
Error:
Solution:
Issue 2: Immediate Evaluation Expected
Error:
df = pd.read_parquet("data.parquet")
print(df.shape) # AttributeError: 'DataFrame' object has no attribute 'shape'
Solution:
df = pd.read_parquet("data.parquet")
result = df.collect()
print(result.num_rows, result.num_columns)
Issue 3: Exception Handling
Before:
After:
result = pd.read_parquet("data.parquet").collect()
if result.is_err():
print(f"Error: {result.unwrap_err()}")
Issue 4: Type Differences
Polars returns polars.DataFrame, Polarway returns pyarrow.Table:
# Polars
df = pl.read_parquet("data.parquet")
print(type(df)) # <class 'polars.DataFrame'>
# Polarway
result = pd.read_parquet("data.parquet").collect()
print(type(result)) # <class 'pyarrow.Table'>
# Convert to Pandas if needed
pandas_df = result.to_pandas()
π§ Migration Tools
Automated Import Replacement
# Find all Polars imports
find . -name "*.py" -exec grep -l "import polars" {} \;
# Replace imports (macOS)
find . -name "*.py" -exec sed -i '' 's/import polars as pl/import polarway as pd/g' {} \;
# Replace imports (Linux)
find . -name "*.py" -exec sed -i 's/import polars as pl/import polarway as pd/g' {} \;
Linting Rules
Add to your .pylintrc:
Pre-commit Hook
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: check-polars-imports
name: Check for Polars imports
entry: bash -c 'if grep -r "import polars" .; then exit 1; fi'
language: system
pass_filenames: false
π Case Study: Real Migration
Small Project (500 lines)
Project: Data analysis script
Changes needed:
1. Update imports (5 minutes)
2. Add .collect() calls (10 minutes)
3. Update tests (15 minutes)
4. Performance validation (10 minutes)
Total time: ~40 minutes
Result: +7% overhead, but gained streaming capability
Large Codebase (50K lines)
Project: Financial data platform
Strategy: 1. Phase 1: Migrate read/write operations (week 1) 2. Phase 2: Migrate transformations (week 2-3) 3. Phase 3: Add async operations (week 4) 4. Phase 4: Enable streaming for large datasets (week 5)
Challenges: - 200+ files to update - Complex test suite - Performance regression concerns
Solutions: - Automated import replacement - Incremental migration by module - Parallel testing (Polars vs Polarway) - Gradual rollout with feature flags
Results: - Migration completed in 5 weeks - 15% performance improvement on large datasets - 50% memory reduction for streaming workflows - 20x speedup for concurrent operations
π Getting Help
Community Support
- GitHub Issues: https://github.com/EnkiNudimmud/polarway/issues
- Discussions: https://github.com/EnkiNudimmud/polarway/discussions
- Discord: Join our community
Migration Assistance
Need help with migration? Open an issue with: - Your Polars code snippet - Expected behavior - Any errors encountered - Performance requirements
We'll help you find the best Polarway equivalent!
See Also: - Quick Reference - Common operations cheat sheet - API Documentation - Complete API reference - Architecture Guide - Understanding Polarway's design