Metadata-Version: 2.4
Name: vmlx
Version: 1.4.1
Summary: Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac
Author-email: Jinho Jang <eric@jangq.ai>
License: Apache-2.0
Project-URL: Homepage, https://github.com/jjang-ai/vmlx
Project-URL: Documentation, https://github.com/jjang-ai/vmlx#readme
Project-URL: Repository, https://github.com/jjang-ai/vmlx
Project-URL: Downloads, https://github.com/jjang-ai/mlxstudio/releases
Project-URL: Issues, https://github.com/jjang-ai/vmlx/issues
Keywords: llm,mlx,apple-silicon,vllm,inference,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.15,>=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.29.0
Requires-Dist: mlx-lm>=0.31.2
Requires-Dist: mlx-vlm>=0.4.3
Requires-Dist: transformers>=4.40.0
Requires-Dist: tokenizers>=0.19.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.28.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: opencv-python-headless>=4.8.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: uvicorn>=0.23.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: mlx-embeddings>=0.0.5
Provides-Extra: ui
Requires-Dist: gradio>=4.0.0; extra == "ui"
Requires-Dist: pytz>=2024.1; extra == "ui"
Provides-Extra: mxtq
Requires-Dist: jang-tools>=2.5.0; extra == "mxtq"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4.0; python_version < "3.13" and extra == "vllm"
Provides-Extra: vision
Requires-Dist: torch>=2.3.0; extra == "vision"
Requires-Dist: torchvision>=0.18.0; extra == "vision"
Provides-Extra: audio
Requires-Dist: mlx-audio>=0.2.9; extra == "audio"
Requires-Dist: sounddevice>=0.4.0; extra == "audio"
Requires-Dist: soundfile>=0.12.0; extra == "audio"
Requires-Dist: scipy>=1.10.0; extra == "audio"
Requires-Dist: numba>=0.57.0; extra == "audio"
Requires-Dist: tiktoken>=0.5.0; extra == "audio"
Requires-Dist: misaki[ja,zh]>=0.5.0; extra == "audio"
Requires-Dist: spacy>=3.7.0; extra == "audio"
Requires-Dist: num2words>=0.5.0; extra == "audio"
Requires-Dist: loguru>=0.7.0; extra == "audio"
Requires-Dist: phonemizer>=3.2.0; extra == "audio"
Requires-Dist: ordered_set>=4.1.0; extra == "audio"
Requires-Dist: cn2an>=0.5.0; extra == "audio"
Requires-Dist: fugashi>=1.3.0; extra == "audio"
Requires-Dist: unidic-lite>=1.0.0; extra == "audio"
Requires-Dist: jieba>=0.42.0; extra == "audio"
Provides-Extra: jang
Requires-Dist: jang-tools>=2.5.0; extra == "jang"
Provides-Extra: image
Requires-Dist: mflux>=0.16.0; extra == "image"

# vMLX

**The most complete MLX inference engine for Apple Silicon.**

Run local LLMs, VLMs, and image generation models with full GPU acceleration via MLX -- continuous batching, 5-layer cache stack, 14 tool call parsers, Anthropic + OpenAI API compatibility, vision/video/audio multimodal, image generation, and JANG adaptive quantization.

```bash
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
```

<p align="center">
  <a href="https://pypi.org/project/vmlx/"><img src="https://img.shields.io/pypi/v/vmlx?style=flat-square&label=PyPI&color=%234B8BBE&logo=python&logoColor=white" alt="PyPI"></a>
  <a href="https://github.com/jjang-ai/mlxstudio/releases/latest"><img src="https://img.shields.io/github/v/release/jjang-ai/mlxstudio?style=flat-square&label=Desktop%20App&color=blue" alt="Desktop App"></a>
  <a href="https://github.com/jjang-ai/vmlx/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square" alt="License"></a>
</p>

> **Desktop app**: Download the full GUI experience from [MLX Studio](https://github.com/jjang-ai/mlxstudio/releases/latest) -- no terminal required.

---

## Stack layout (all-in-one, vendored)

vmlx now ships both implementations under one roof. The Swift side
controls the entire stack from Metal kernels up — one `Package.swift`,
one `.build`, one `swift test` run:

```
/Users/eric/vmlx/
├── swift/                      ← Swift stack — SwiftPM, 21 local targets, 225 tests
│   ├── Package.swift           ← 21 targets, 5 external deps only
│   ├── Sources/
│   │   │ ─── MLX runtime (merged from mlx-swift @ vmlx-0.31.3) ───
│   │   ├── Cmlx/               ← 23 MB mlx + mlx-c submodule w/ Metal kernels
│   │   ├── MLX/                ← core tensor API
│   │   ├── MLXNN/              ← nn.Module + layers
│   │   ├── MLXFast/            ← SDPA, layer norm, rope
│   │   ├── MLXFFT/ MLXLinalg/ MLXOptimizers/ MLXRandom/
│   │   │
│   │   │ ─── vMLX layer (our code) ───
│   │   ├── vMLXLMCommon/       ← cache, batch, FlashMoE, TurboQuant
│   │   ├── vMLXLLM/            ← ~50 LLM models
│   │   ├── vMLXVLM/            ← ~15 VLM models
│   │   ├── vMLXEmbedders/      ← embedding models
│   │   ├── vMLXFlux*/          ← image/video diffusion
│   │   ├── vMLXEngine/         ← Engine, Settings, Stream, Cache, MCP, FlashMoE
│   │   ├── vMLXServer/         ← Hummingbird routes
│   │   ├── vMLXApp/            ← SwiftUI 5-mode app
│   │   ├── vMLXTheme/
│   │   └── vMLXCLI/            ← `vmlxctl` binary
│   └── PROGRESS.md             ← full multi-session changelog
├── engine/vmlx_engine → /Users/eric/mlx/vllm-mlx/vmlx_engine  (Python engine)
├── app/panel → /Users/eric/mlx/vllm-mlx/panel                 (Electron UI)
├── inference/                  ← benchmarks + configs
├── docs/                       ← architecture docs
├── tests/                      ← cross-matrix regression tests
└── PROGRESS-2026-04-13.md      ← top-level multi-session summary
```

**External Swift deps (5 only):** `swift-numerics`, `hummingbird`,
`swift-argument-parser`, `swift-transformers`, `Jinja`. Everything
else — including the MLX runtime — is vendored in-tree.

**Build the Swift stack:**

```bash
cd /Users/eric/vmlx/swift
swift build            # ~1 min clean, 21 targets (8 MLX + 13 vMLX)
swift test             # 225 tests, ~15s
swift run vmlxctl serve --model /path/to/model
```

**Build the Python stack:**

```bash
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
```

See `PROGRESS-2026-04-13.md` for the full state of the Swift rewrite,
`swift/APP-SURFACE-AUDIT-2026-04-13.md` for per-surface REAL/STUB/MISSING
inventory, and `swift/SWIFT-ENGINE-ISSUES-AUDIT.md` for the GH issue
cross-reference against the Swift engine.

---

## Features

### Model Support (65+ Families)

- **Text LLMs** -- Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Gemma 2/3, Phi-3/4, DeepSeek V2/V3/R1, GLM-4/4.7, Nemotron, MiniMax, Kimi, Step, and any mlx-lm model
- **Vision LLMs** -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n, Phi-3-Vision
- **Mixture-of-Experts** -- Qwen 3.5 MoE, Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4
- **Hybrid SSM** -- Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention)
- **Image Generation** -- Flux Schnell/Dev/Kontext/Krea, Z-Image Turbo, Flux Klein (via mflux)
- **Audio** -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- **JANG** -- Adaptive mixed-precision quantized models, stay quantized in GPU via native `QuantizedLinear`

### API Endpoints

OpenAI + Anthropic compatible -- point any SDK at your local server:

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/chat/completions` | OpenAI Chat Completions (streaming, tools, vision, structured output) |
| `POST` | `/v1/messages` | **Anthropic Messages API** -- drop-in Claude replacement |
| `POST` | `/v1/responses` | OpenAI Responses API (agentic format) |
| `POST` | `/v1/completions` | Text completions |
| `POST` | `/v1/images/generations` | Image generation (Flux/Z-Image, OpenAI format) |
| `POST` | `/v1/embeddings` | Text embeddings with dimension control |
| `POST` | `/v1/rerank` | Document reranking |
| `POST` | `/v1/audio/speech` | Text-to-speech (Kokoro) |
| `POST` | `/v1/audio/transcriptions` | Speech-to-text (Whisper) |
| `GET` | `/v1/models` | List loaded models |
| `GET` | `/health` | Server health, VRAM, queue length |
| `GET` | `/v1/cache/stats` | Cache hit rates and memory usage |
| `POST` | `/v1/cache/warm` | Pre-warm cache with prompts |

### Anthropic API Compatibility

Use the Anthropic Python/TypeScript SDK -- just change `base_url`:

```python
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000/v1", api_key="none")
response = client.messages.create(
    model="local",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
```

- Full `/v1/messages` endpoint with streaming
- Anthropic tool calling format (auto-translated)
- Vision/multimodal via Anthropic content blocks

### Tool Calling (14 Parsers)

Auto-detected from model config -- no manual setup:

| Parser | Models |
|--------|--------|
| `qwen` | Qwen3, Qwen2.5, QwQ |
| `llama3` | Llama 3/3.1/3.2/3.3/4 |
| `mistral` | Mistral, Mixtral, Codestral |
| `hermes` | Hermes, NousResearch |
| `deepseek` | DeepSeek V2/V3 |
| `glm47` | GLM-4.7, ChatGLM4 |
| `minimax` | MiniMax M2.5 |
| `nemotron` | Nemotron, Llama-Nemotron |
| `granite` | IBM Granite |
| `functionary` | Functionary v3 |
| `xlam` | Salesforce xLAM |
| `kimi` | Moonshot Kimi |
| `step3p5` | StepFun Step-3.5 |
| `auto` | Auto-detect from config.json |

### Reasoning Models (4 Parsers)

- **Qwen3 / Qwen3.5** -- `<think>...</think>` blocks
- **DeepSeek-R1** -- DeepSeek reasoning format
- **GPT-OSS / GLM-4.7** -- thinking format
- **Phi-4-reasoning** -- reasoning content
- Enable/disable per request, reasoning effort control (low/medium/high)

### Vision & Multimodal

- **Images** -- PNG, JPEG, WebP via base64 or URL (up to 50 MB), detail levels (auto/low/high)
- **Video** -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames)
- **Audio** -- base64 or URL input (Qwen3-Audio)
- Dedicated MLLM cache for image/video embeddings

### Continuous Batching

- Handle 32+ concurrent requests with dynamic slot allocation
- Configurable prefill and completion batch sizes
- Stream interval control
- Request pooling for shared GPU memory
- Rate limiting and API key authentication

### 5-Layer Cache Stack

- **Prefix Cache** -- token-level semantic caching with LRU eviction
- **Paged KV Cache** -- block-aware, reduced fragmentation
- **Disk Cache** -- persistent spillover for large contexts
- **KV Quantization** -- q4/q8 compression at storage boundary (2-4x memory savings)
- **Hybrid SSM Cache** -- Mamba + Attention architectures
- Auto cache type selection, warming API, stats API

### Sampling Parameters

- Temperature, Top-P, Top-K, Min-P, Repetition Penalty
- Stop sequences, max tokens (up to 131072)
- Structured output (`json_object` and `json_schema` modes)
- Streaming with proper Unicode handling (emoji, CJK, Arabic)
- Usage stats in streaming (`stream_options.include_usage`)

### Image Generation

- Flux Schnell (4 steps), Dev (20 steps), Kontext, Krea, Klein
- Z-Image Turbo (4-bit, 8-bit, full precision)
- Configurable steps, guidance, size, seed, sampler
- Quantized model support (2-bit to 8-bit)
- OpenAI-compatible `/v1/images/generations` with `usage` field

### Model Conversion

- **16-bit to MLX** -- convert safetensors to MLX format
- **16-bit to quantized** -- 2/4/8-bit MLX quantization
- **GGUF to MLX** -- import GGUF models
- **MLX to JANG** -- adaptive mixed-precision (different bits per layer type)

---

## CLI Reference

```bash
vmlx serve <model> [OPTIONS]
  --port 8000
  --host 0.0.0.0
  --continuous-batching
  --enable-prefix-cache
  --cache-type [auto|kv|prefix|paged]
  --cache-memory-percent 0.30
  --max-num-seqs 32
  --prefill-batch-size 4
  --completion-batch-size 16
  --tool-call-parser [auto|qwen|llama|mistral|hermes|deepseek|glm47|minimax|nemotron|granite|functionary|xlam|kimi|step3p5]
  --reasoning-parser [auto|qwen3|deepseek_r1|gptoss]
  --enable-thinking
  --enable-auto-tool-choice
  --api-key <secret>
  --rate-limit 60
  --enable-jit
  --mcp-config mcp.json
  --served-model-name <alias>
  --log-level [INFO|DEBUG]

vmlx bench <model> [OPTIONS]
  --num-prompts 10
  --num-completions 50
  --batch-size 1
```

---

## Advanced Quantization

**JANG adaptive mixed-precision** assigns different bit widths per layer type for better quality at the same model size.

```bash
vmlx convert model --jang-profile JANG_3M
```

- Pre-quantized models: [JANGQ-AI on HuggingFace](https://huggingface.co/JANGQ-AI)
- Stays quantized in GPU memory via native `QuantizedLinear` + `quantized_matmul`
- Compatible with all cache layers (prefix, paged, disk, KV quant)

---

## Project Structure

```
vmlx/
├── vmlx_engine/           # Python inference engine
├── panel/                 # Electron desktop app (MLX Studio)
│   ├── src/main/          # Main process (sessions, chat, tools, DB)
│   ├── src/renderer/      # React UI
│   └── bundled-python/    # Bundled Python 3.12 interpreter
├── tests/                 # Engine test suite (1894+ tests)
└── docs/                  # Documentation
```

---

## Links

| Resource | Link |
|---|---|
| **Desktop App** | [github.com/jjang-ai/mlxstudio](https://github.com/jjang-ai/mlxstudio) |
| **PyPI** | [pypi.org/project/vmlx](https://pypi.org/project/vmlx/) |
| **MLX Models** | [huggingface.co/mlx-community](https://huggingface.co/mlx-community) |
| **JANG Models** | [huggingface.co/JANGQ-AI](https://huggingface.co/JANGQ-AI) |
| **Website** | [vmlx.net](https://vmlx.net) |

---

## License

Apache License 2.0

---

<p align="center">
  Built by <a href="https://github.com/jjang-ai">Jinho Jang</a> &bull; <a href="mailto:eric@jangq.ai">eric@jangq.ai</a> &bull; <a href="https://jangq.ai">JANGQ AI</a>
</p>
