ziplime deep dive · market data · async python
// data.history() in ziplime
.. Polars, Parquet & asyncio
.. instead of bcolz
How the Zipline fork rethought market data storage — and what it means for writing trading algorithms today.
When Quantopian shut down in 2020, most of us moved to local Zipline. Same API, same concepts — it seemed like
a natural transition. But the longer you worked with it, the harder it became to ignore a fundamental problem
in how Zipline stored and loaded data. That problem had a name: bcolz.
Why bcolz was a problem
Zipline stored market data in bcolz — a columnar binary format spread across thousands of directories on the filesystem, one per asset. The format was completely opaque: you couldn't just open a file and look at what was inside. The library itself has been effectively abandoned and barely works on Python 3.10+. During ingestion, a small type mismatch or a timezone inconsistency would cause the bundle to silently fail or produce a corrupted result — something you'd only discover when running an actual backtest.
Zipline + bcolz
- Abandoned library, broken on Python 3.10+
- Thousands of directories — one per asset
- Opaque binary format, nothing to inspect
- Ingestion errors are hard to diagnose
- Synchronous, single-threaded pipeline
ziplime + Parquet
- Open standard, readable by any tool
- One file per bundle — all assets, full history
- Predicate pushdown — only needed data hits disk
- Ingestion errors are clear and predictable
- Async pipeline built on asyncio + aiofiles
Bundle storage: BundleRegistry and metadata
In ziplime, every bundle is registered in the BundleRegistry and described
by a JSON metadata file. This is the single source of truth: given a bundle name, the system knows where the
data lives, which storage adapter to use, what date range it covers, and at what frequency it was ingested.
Here is what a real bundle metadata file looks like:
// limex_us_minute_data_1774172679.json { "name": "limex_us_minute_data", "version": "1774172679", // unix timestamp of ingestion run "bundle_storage_class": "...FileSystemParquetBundleStorage", "bundle_storage_data": { "base_data_path": "/Users/vyacheslav/.ziplime/data" }, "start_date": "2021-01-01T00:00:00Z", "end_date": "2026-03-03T00:00:00Z", "trading_calendar_name": "XNYS", "frequency_seconds": 86400.0, // 1 day "data_type": "MARKET_DATA" }
A few things worth noting here. The bundle version is the unix timestamp of the ingestion run — each new
ingestion creates a new version without automatically removing the previous one, so you can roll back if
something goes wrong. The bundle_storage_class field holds the fully
qualified class name of the storage adapter: ziplime dynamically instantiates it at load time rather than
hardcoding it. This means storage is pluggable — filesystem, S3, a database — as long as the adapter
implements the right interface.
data.history(): what changed in the API
From a signature standpoint the method became more flexible. The three most important differences from Zipline are: everything is async, the return type is a Polars DataFrame, and frequency is no longer a plain string.
Async API
All data-access methods in ziplime are asynchronous. This means every call requires
await — including data.history(),
data.current(), and order functions like
order_target_percent(). Your
handle_data function must therefore be declared with
async def:
async def handle_data(context, data): df = await data.history(assets=[asset], fields=["close"], bar_count=20) price = await data.current(asset, "close") await order_target_percent(asset, 0.5)
This is not cosmetic — it reflects that the entire I/O pipeline in ziplime is built on asyncio and aiofiles. Backtesting and live trading share the same codebase, and async is what makes that practical at scale.
Parameters
| Parameter | Type | Description |
|---|---|---|
| assets | list[Asset] |
One or more assets to query |
| fields | list[str] |
List of fields: open, high,
low, close,
volume, etc.
|
| bar_count | int |
Number of bars to return |
| frequency | timedelta / Period |
Bar frequency. Defaults to 1 day. Can be set via timedelta or a
Period string
|
One notable change from Zipline: fields now takes a list, not a string. In
Zipline, fetching multiple fields required a separate history() call per
field or a panel request. In ziplime you request everything in one call and get it back in a single DataFrame.
Return type — pl.DataFrame
The method returns a native Polars pl.DataFrame, not a pandas object. The
schema is fixed and predictable:
Returned DataFrame schema
Row count is always bar_count × len(assets). If you request 20 bars for 3
assets, you get 60 rows — all assets laid out sequentially, each sorted by date within its block.
Polars vs the pandas ecosystem
Returning a pl.DataFrame means libraries that expect a numpy array or
pandas DataFrame — TA-Lib being the most common — require an explicit conversion via
.to_numpy() or .to_pandas(). The
alternative is to use libraries written natively for Polars, such as
polars-talib, which operate directly on Polars expressions without any
intermediate conversion and naturally benefit from vectorized execution.
Usage examples
Basic — single asset, single field
The simplest case: one asset, one field, default daily frequency. The result is a DataFrame with columns
date, sid, and
close. Calling .to_numpy() on a single
column gives you a plain 1-D array that works with any numpy-based indicator library.
df = await data.history(assets=[asset], fields=["close"], bar_count=200) # extract as numpy for use with TA-Lib or similar prices = df["close"].to_numpy() sma = talib.SMA(prices, timeperiod=50)
Multiple fields in a single call
When your indicator needs more than one price series — ATR, Stochastic, Bollinger Bands — you can fetch all
required fields at once. There is no performance penalty for requesting additional columns: Polars reads only
the column chunks it needs from the Parquet file, so adding high and
low to the request does not read volume or
open from disk.
df = await data.history( assets=[asset], fields=["high", "low", "close"], bar_count=14 ) # each column extracted separately for TA-Lib highs = df["high"].to_numpy() lows = df["low"].to_numpy() closes = df["close"].to_numpy() atr = talib.ATR(highs, lows, closes, timeperiod=14)
Intraday bars via timedelta
For minute or hourly strategies, pass a datetime.timedelta as the
frequency argument. If the bundle stores minute-level data, ziplime will
aggregate it to the requested frequency on the fly — OHLCV aggregation (first open, max high, min low, sum
volume) happens inside the Polars lazy plan before any data lands in Python memory.
import datetime # last 60 one-minute bars df = await data.history( assets=[asset], fields=["close"], bar_count=60, frequency=datetime.timedelta(minutes=1) ) # last 24 hourly bars with volume df = await data.history( assets=[asset], fields=["close", "volume"], bar_count=24, frequency=datetime.timedelta(hours=1) )
Weekly bars via Period
Period is a ziplime-specific type that accepts calendar-aligned frequency
strings. Use it when you need weekly or monthly bars rather than a fixed time delta — a week is not always 7 ×
24 hours when you account for holidays and session boundaries, and
Period handles that correctly by aligning to the trading calendar specified
in the bundle metadata.
from ziplime.constants.period import Period # 52 weekly bars — one year of weekly closes df = await data.history( assets=[asset], fields=["close"], bar_count=52, frequency=Period("1w") )
Multiple assets in one call
All assets are returned in a single flat DataFrame. This is a meaningful departure from the Zipline panel
model where each asset occupied its own axis. In ziplime, assets share the same row space and are
distinguished by the sid column. To work with one asset at a time, filter by
sid. This layout integrates naturally with Polars group operations if you
want to compute indicators across all assets at once.
df = await data.history( assets=[context.aapl, context.msft, context.goog], fields=["close", "volume"], bar_count=20 ) # 60 rows total (20 × 3), columns: date, sid, close, volume # sorted by ["sid", "date"] # isolate a single asset aapl_df = df.filter(pl.col("sid") == context.aapl.sid) # or compute something across all assets at once last_close = df.group_by("sid").agg(pl.col("close").last())
What happens under the hood
When a data.history() call comes in, ziplime opens the bundle's Parquet file
in lazy mode via pl.scan_parquet() and immediately attaches filters for the
requested symbols and date range. Polars pushes these filters down to the Parquet reader at the row-group
level — data that falls outside the query is never read from disk at all.
If the requested frequency differs from the stored one — say, the bundle holds minute bars but the algorithm
asks for daily — resampling happens inside the same lazy plan, before
collect() is called. The entire chain from file open to Python DataFrame is
a single optimized query, not a sequence of Python steps with intermediate structures in between.
For the hot path inside the simulation loop — where handle_data fires on
every bar — ziplime maintains an in-memory index of each asset's row positions in the loaded DataFrame.
Fetching a slice for one asset is an O(1) lookup, not a scan.
Migrating from Zipline
Strategy logic transfers with minimal changes: add async/await throughout
and adjust any code that expected a pandas object from data.history(). The
one thing that cannot be migrated automatically is the bundle itself — bcolz bundles are incompatible with the
new format and need to be re-ingested from source data. It is a one-time cost, and the upside is that your
data is now in a format you can actually inspect and reason about.
Summary
ziplime replaced bcolz with Polars + Parquet and rebuilt the entire pipeline on asyncio. A bundle is now a
single readable file with explicit, self-describing metadata.
data.history() returns a native
pl.DataFrame, accepts multiple fields in one call, and supports any
frequency through timedelta or Period.
The result is a more honest architecture — less magic, more predictability, and a storage format that will
still work five years from now.
Open source, actively maintained. Source code, docs, and examples all in one place.
The author worked with Quantopian from 2017 to 2020 and subsequently moved to local Zipline. Since 2024, ziplime has been the primary framework for production backtesting and live trading. Technical details are based on reading the ziplime source code and are accurate as of the time of writing.