Introduction
Zem is a unified data pipeline framework built for high-performance data curation. It leverages the **Model Context Protocol (MCP)** to allow modularity and **ZenML** for professional-grade orchestration and visualization.
Traditionally, data pipelines are hard to manage and scale. Zem solves this by making pipelines entirely config-driven.
Installation
Zem is designed to work with uv, the ultra-fast Python package manager.
pip install xfmr-zem
YAML Configuration
Define your pipeline steps by mapping servers to tools. Zem handles the data flow between steps.
name: my_pipeline
servers:
nemo: src/xfmr_zem/servers/nemo_curator/server.py
pipeline:
- nemo.pii_removal:
input:
data: [{"text": "Hello World"}]
Running Pipelines
Use the zem CLI to run your pipelines. It automatically initializes MCP servers and
tracks execution in ZenML.
zem run my_config.yaml
MCP Servers
MCP Servers are modular units that contain the actual processing logic (e.g., NeMo Curator, DataJuicer). They talk to the Zem Client over Standard I/O using JSON-RPC.
ZenML Integration
Zem automatically converts your YAML config into a ZenML pipeline. You can visualize the results by running:
uv run zenml up --port 8871