ziplime deep dive  ·  market data  ·  async python

// data.history() in ziplime
.. Polars, Parquet & asyncio
.. instead of bcolz

How the Zipline fork rethought market data storage — and what it means for writing trading algorithms today.

When Quantopian shut down in 2020, most of us moved to local Zipline. Same API, same concepts — it seemed like a natural transition. But the longer you worked with it, the harder it became to ignore a fundamental problem in how Zipline stored and loaded data. That problem had a name: bcolz.


Why bcolz was a problem

Zipline stored market data in bcolz — a columnar binary format spread across thousands of directories on the filesystem, one per asset. The format was completely opaque: you couldn't just open a file and look at what was inside. The library itself has been effectively abandoned and barely works on Python 3.10+. During ingestion, a small type mismatch or a timezone inconsistency would cause the bundle to silently fail or produce a corrupted result — something you'd only discover when running an actual backtest.

Zipline + bcolz

  • Abandoned library, broken on Python 3.10+
  • Thousands of directories — one per asset
  • Opaque binary format, nothing to inspect
  • Ingestion errors are hard to diagnose
  • Synchronous, single-threaded pipeline

ziplime + Parquet

  • Open standard, readable by any tool
  • One file per bundle — all assets, full history
  • Predicate pushdown — only needed data hits disk
  • Ingestion errors are clear and predictable
  • Async pipeline built on asyncio + aiofiles

Bundle storage: BundleRegistry and metadata

In ziplime, every bundle is registered in the BundleRegistry and described by a JSON metadata file. This is the single source of truth: given a bundle name, the system knows where the data lives, which storage adapter to use, what date range it covers, and at what frequency it was ingested.

Here is what a real bundle metadata file looks like:

// limex_us_minute_data_1774172679.json
  {
    "name": "limex_us_minute_data",
    "version": "1774172679",           // unix timestamp of ingestion run
    "bundle_storage_class": "...FileSystemParquetBundleStorage",
    "bundle_storage_data": {
      "base_data_path": "/Users/vyacheslav/.ziplime/data"
    },
    "start_date": "2021-01-01T00:00:00Z",
    "end_date":   "2026-03-03T00:00:00Z",
    "trading_calendar_name": "XNYS",
    "frequency_seconds": 86400.0,      // 1 day
    "data_type": "MARKET_DATA"
  }

A few things worth noting here. The bundle version is the unix timestamp of the ingestion run — each new ingestion creates a new version without automatically removing the previous one, so you can roll back if something goes wrong. The bundle_storage_class field holds the fully qualified class name of the storage adapter: ziplime dynamically instantiates it at load time rather than hardcoding it. This means storage is pluggable — filesystem, S3, a database — as long as the adapter implements the right interface.


data.history(): what changed in the API

From a signature standpoint the method became more flexible. The three most important differences from Zipline are: everything is async, the return type is a Polars DataFrame, and frequency is no longer a plain string.

Async API

All data-access methods in ziplime are asynchronous. This means every call requires await — including data.history(), data.current(), and order functions like order_target_percent(). Your handle_data function must therefore be declared with async def:

handle_data.py
async def handle_data(context, data):
      df    = await data.history(assets=[asset], fields=["close"], bar_count=20)
      price = await data.current(asset, "close")
      await order_target_percent(asset, 0.5)

This is not cosmetic — it reflects that the entire I/O pipeline in ziplime is built on asyncio and aiofiles. Backtesting and live trading share the same codebase, and async is what makes that practical at scale.

Parameters

Parameter Type Description
assets list[Asset] One or more assets to query
fields list[str] List of fields: open, high, low, close, volume, etc.
bar_count int Number of bars to return
frequency timedelta / Period Bar frequency. Defaults to 1 day. Can be set via timedelta or a Period string

One notable change from Zipline: fields now takes a list, not a string. In Zipline, fetching multiple fields required a separate history() call per field or a panel request. In ziplime you request everything in one call and get it back in a single DataFrame.

Return type — pl.DataFrame

The method returns a native Polars pl.DataFrame, not a pandas object. The schema is fixed and predictable:

Returned DataFrame schema

date Date trading date or bar timestamp
sid Int64 numeric asset identifier
close Float64 one column per requested field
... sorted by ["sid", "date"]

Row count is always bar_count × len(assets). If you request 20 bars for 3 assets, you get 60 rows — all assets laid out sequentially, each sorted by date within its block.


Usage examples

Basic — single asset, single field

The simplest case: one asset, one field, default daily frequency. The result is a DataFrame with columns date, sid, and close. Calling .to_numpy() on a single column gives you a plain 1-D array that works with any numpy-based indicator library.

single asset · single field
df = await data.history(assets=[asset], fields=["close"], bar_count=200)
  
  # extract as numpy for use with TA-Lib or similar
  prices = df["close"].to_numpy()
  sma    = talib.SMA(prices, timeperiod=50)

Multiple fields in a single call

When your indicator needs more than one price series — ATR, Stochastic, Bollinger Bands — you can fetch all required fields at once. There is no performance penalty for requesting additional columns: Polars reads only the column chunks it needs from the Parquet file, so adding high and low to the request does not read volume or open from disk.

multiple fields · ATR example
df = await data.history(
      assets=[asset],
      fields=["high", "low", "close"],
      bar_count=14
  )
  
  # each column extracted separately for TA-Lib
  highs  = df["high"].to_numpy()
  lows   = df["low"].to_numpy()
  closes = df["close"].to_numpy()
  
  atr = talib.ATR(highs, lows, closes, timeperiod=14)

Intraday bars via timedelta

For minute or hourly strategies, pass a datetime.timedelta as the frequency argument. If the bundle stores minute-level data, ziplime will aggregate it to the requested frequency on the fly — OHLCV aggregation (first open, max high, min low, sum volume) happens inside the Polars lazy plan before any data lands in Python memory.

intraday · timedelta frequency
import datetime
  
  # last 60 one-minute bars
  df = await data.history(
      assets=[asset],
      fields=["close"],
      bar_count=60,
      frequency=datetime.timedelta(minutes=1)
  )
  
  # last 24 hourly bars with volume
  df = await data.history(
      assets=[asset],
      fields=["close", "volume"],
      bar_count=24,
      frequency=datetime.timedelta(hours=1)
  )

Weekly bars via Period

Period is a ziplime-specific type that accepts calendar-aligned frequency strings. Use it when you need weekly or monthly bars rather than a fixed time delta — a week is not always 7 × 24 hours when you account for holidays and session boundaries, and Period handles that correctly by aligning to the trading calendar specified in the bundle metadata.

weekly bars · Period type
from ziplime.constants.period import Period
  
  # 52 weekly bars — one year of weekly closes
  df = await data.history(
      assets=[asset],
      fields=["close"],
      bar_count=52,
      frequency=Period("1w")
  )

Multiple assets in one call

All assets are returned in a single flat DataFrame. This is a meaningful departure from the Zipline panel model where each asset occupied its own axis. In ziplime, assets share the same row space and are distinguished by the sid column. To work with one asset at a time, filter by sid. This layout integrates naturally with Polars group operations if you want to compute indicators across all assets at once.

multiple assets · flat DataFrame
df = await data.history(
      assets=[context.aapl, context.msft, context.goog],
      fields=["close", "volume"],
      bar_count=20
  )
  # 60 rows total (20 × 3), columns: date, sid, close, volume
  # sorted by ["sid", "date"]
  
  # isolate a single asset
  aapl_df = df.filter(pl.col("sid") == context.aapl.sid)
  
  # or compute something across all assets at once
  last_close = df.group_by("sid").agg(pl.col("close").last())

What happens under the hood

When a data.history() call comes in, ziplime opens the bundle's Parquet file in lazy mode via pl.scan_parquet() and immediately attaches filters for the requested symbols and date range. Polars pushes these filters down to the Parquet reader at the row-group level — data that falls outside the query is never read from disk at all.

If the requested frequency differs from the stored one — say, the bundle holds minute bars but the algorithm asks for daily — resampling happens inside the same lazy plan, before collect() is called. The entire chain from file open to Python DataFrame is a single optimized query, not a sequence of Python steps with intermediate structures in between.

For the hot path inside the simulation loop — where handle_data fires on every bar — ziplime maintains an in-memory index of each asset's row positions in the loaded DataFrame. Fetching a slice for one asset is an O(1) lookup, not a scan.

"The entire chain from file open to Python DataFrame is a single optimized Polars query — not a sequence of Python steps with intermediate structures in between."

Migrating from Zipline

Strategy logic transfers with minimal changes: add async/await throughout and adjust any code that expected a pandas object from data.history(). The one thing that cannot be migrated automatically is the bundle itself — bcolz bundles are incompatible with the new format and need to be re-ingested from source data. It is a one-time cost, and the upside is that your data is now in a format you can actually inspect and reason about.

Summary

ziplime replaced bcolz with Polars + Parquet and rebuilt the entire pipeline on asyncio. A bundle is now a single readable file with explicit, self-describing metadata. data.history() returns a native pl.DataFrame, accepts multiple fields in one call, and supports any frequency through timedelta or Period. The result is a more honest architecture — less magic, more predictability, and a storage format that will still work five years from now.

github.com/Limex-com/ziplime

Open source, actively maintained. Source code, docs, and examples all in one place.

View on GitHub

The author worked with Quantopian from 2017 to 2020 and subsequently moved to local Zipline. Since 2024, ziplime has been the primary framework for production backtesting and live trading. Technical details are based on reading the ziplime source code and are accurate as of the time of writing.