Introduction

Zem is a unified data pipeline framework built for high-performance data curation. It leverages the **Model Context Protocol (MCP)** to allow modularity and **ZenML** for professional-grade orchestration and visualization.

Traditionally, data pipelines are hard to manage and scale. Zem solves this by making pipelines entirely config-driven.

Installation

Zem is designed to work with uv, the ultra-fast Python package manager.

pip install xfmr-zem

YAML Configuration

Define your pipeline steps by mapping servers to tools. Zem handles the data flow between steps.

name: my_pipeline
servers:
  nemo: src/xfmr_zem/servers/nemo_curator/server.py
pipeline:
  - nemo.pii_removal:
      input:
        data: [{"text": "Hello World"}]

Running Pipelines

Use the zem CLI to run your pipelines. It automatically initializes MCP servers and tracks execution in ZenML.

zem run my_config.yaml

MCP Servers

MCP Servers are modular units that contain the actual processing logic (e.g., NeMo Curator, DataJuicer). They talk to the Zem Client over Standard I/O using JSON-RPC.

ZenML Integration

Zem automatically converts your YAML config into a ZenML pipeline. You can visualize the results by running:

uv run zenml up --port 8871