Zem | Unified Data Pipeline Framework

Why choose Zem?

MCP Architecture

Standalone, modular servers for domain logic. Bypasses async complexity with robust stdio communication.

ZenML Visualization

Automatic tracking and visualization of every step. Wows stakeholders with beautiful pipeline graphs.

Config-Driven

No more tangled code. Define or modify complex pipelines by simply editing a YAML file.

GPU Accelerated

Leverage NVIDIA NeMo Curator for high-performance deduplication, PII removal, and text normalization.

Multimodal Ready

Process PDFs, images, audio, and unstructured documents with specialized OCR and voice engines.

Frontier LLMs

Integrate Ollama, OpenAI, or custom models for classification, summarization, and instruction generation.

10+ Production-Ready Modules

Each module is a standalone MCP server, battle-tested for real-world AI pipelines

NeMo Curator

GPU-accelerated deduplication, PII removal, and text normalization from NVIDIA

GPU CUDA

Data-Juicer

Advanced cleaning, filtering, and quality assessment for massive datasets

Python Dask

OCR Engine

Extract text from PDFs and images with HuggingFace VLMs and configurable preprocessing

Vision PyMuPDF

Voice Transcription

Whisper-powered speech-to-text with automatic language detection

Whisper Audio

LLM Integration

Classify, summarize, and extract insights using Ollama or OpenAI models

LLM Ollama

Unstructured Parser

Parse complex documents including Word, PowerPoint, and HTML with layout preservation

Unstructured Layout

Instruction Gen

Generate high-quality training instructions for supervised fine-tuning

SFT Training

Data Sinks

Export to HuggingFace Hub, Vector DBs (Pinecone, Weaviate), or custom endpoints

HF Hub VectorDB

I/O Operations

Load from S3, GCS, local files, or Parquet with partition-aware reading

S3 Parquet

Profiler

Deep performance insights with per-tool timing, memory usage, and execution graphs

Metrics Profiling

Unified Orchestration

Zem acts as the bridge between modular processing units (MCP Servers) and professional orchestration (ZenML). Every execution is tracked, every artifact is versioned, and every step is descriptively labeled.

Dynamic Step Naming
Seamless Subprocess Management
JSON-RPC over Standard I/O
ZenML Stack Integration
Pass-by-Reference for Large Data
True Parallel DAG Execution

Zem Client

ZenML Wrapper

OCR

Voice

NeMo

LLM

The Future of Data Curation