The Future of Data Curation

Build, orchestrate, and visualize high-performance data pipelines with Zem. The first unified framework designed for the MCP era.

pipeline.yaml
name: multimodal_ai_pipeline

servers:
  ocr: servers/ocr
  voice: servers/voice
  llm: servers/llm
  nemo: servers/nemo_curator
  sinks: servers/sinks

pipeline:
  - ocr.extract_pdf: 
      file_path: documents/report.pdf
  - voice.transcribe:
      file_path: audio/interview.wav
  - nemo.pii_removal:
      anonymize_names: true
  - llm.classify_domain:
      categories: [Medical, Legal, Finance]
  - sinks.to_huggingface:
      repo_id: your-org/dataset

Why choose Zem?

MCP Architecture

Standalone, modular servers for domain logic. Bypasses async complexity with robust stdio communication.

ZenML Visualization

Automatic tracking and visualization of every step. Wows stakeholders with beautiful pipeline graphs.

Config-Driven

No more tangled code. Define or modify complex pipelines by simply editing a YAML file.

GPU Accelerated

Leverage NVIDIA NeMo Curator for high-performance deduplication, PII removal, and text normalization.

Multimodal Ready

Process PDFs, images, audio, and unstructured documents with specialized OCR and voice engines.

Frontier LLMs

Integrate Ollama, OpenAI, or custom models for classification, summarization, and instruction generation.

10+ Production-Ready Modules

Each module is a standalone MCP server, battle-tested for real-world AI pipelines

NeMo Curator

GPU-accelerated deduplication, PII removal, and text normalization from NVIDIA

GPU CUDA

Data-Juicer

Advanced cleaning, filtering, and quality assessment for massive datasets

Python Dask

OCR Engine

Extract text from PDFs and images with HuggingFace VLMs and configurable preprocessing

Vision PyMuPDF

Voice Transcription

Whisper-powered speech-to-text with automatic language detection

Whisper Audio

LLM Integration

Classify, summarize, and extract insights using Ollama or OpenAI models

LLM Ollama

Unstructured Parser

Parse complex documents including Word, PowerPoint, and HTML with layout preservation

Unstructured Layout

Instruction Gen

Generate high-quality training instructions for supervised fine-tuning

SFT Training

Data Sinks

Export to HuggingFace Hub, Vector DBs (Pinecone, Weaviate), or custom endpoints

HF Hub VectorDB

I/O Operations

Load from S3, GCS, local files, or Parquet with partition-aware reading

S3 Parquet

Profiler

Deep performance insights with per-tool timing, memory usage, and execution graphs

Metrics Profiling

Unified Orchestration

Zem acts as the bridge between modular processing units (MCP Servers) and professional orchestration (ZenML). Every execution is tracked, every artifact is versioned, and every step is descriptively labeled.

  • Dynamic Step Naming
  • Seamless Subprocess Management
  • JSON-RPC over Standard I/O
  • ZenML Stack Integration
  • Pass-by-Reference for Large Data
  • True Parallel DAG Execution
Zem Client
ZenML Wrapper
OCR
Voice
NeMo
LLM

Ready to Transform Your Data Pipeline?

Join developers building the next generation of AI-ready datasets

10+
Production Modules
GPU
Accelerated
100%
Config-Driven