Metadata-Version: 2.2
Name: arnio
Version: 1.0.0
Summary: C++ accelerated CSV preprocessing and data cleaning for pandas
Keywords: pandas,csv,data-cleaning,preprocessing,c++,performance
Author-Email: Anish Raj <anishrajyadav97@gmail.com>
License: MIT
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/im-anishraj/arnio
Project-URL: Repository, https://github.com/im-anishraj/arnio
Project-URL: Issues, https://github.com/im-anishraj/arnio/issues
Project-URL: Changelog, https://github.com/im-anishraj/arnio/blob/main/CHANGELOG.md
Requires-Python: >=3.9
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: pre-commit>=3.0; extra == "dev"
Description-Content-Type: text/markdown

<div align="center">
  <br />
  <h1>⚡ arnio</h1>
    <b>Arnio</b> is an open-source C++ accelerated data preprocessing library<br />
    <i>for Python. Built for speed and memory efficiency — and actively being optimized during GSSoC 2026.</i>
  <br />

  [![CI](https://github.com/im-anishraj/arnio/actions/workflows/ci.yml/badge.svg?style=for-the-badge&color=2ea44f)](https://github.com/im-anishraj/arnio/actions/workflows/ci.yml)
  [![PyPI](https://img.shields.io/pypi/v/arnio?style=for-the-badge&color=blue)](https://pypi.org/project/arnio/)
  [![Python](https://img.shields.io/pypi/pyversions/arnio?style=for-the-badge&color=black)](https://pypi.org/project/arnio/)
  [![License](https://img.shields.io/badge/license-MIT-blue.svg?style=for-the-badge)](LICENSE)

  <p>
    <a href="#-the-problem">The Problem</a> •
    <a href="#-the-solution-arnio">The Solution</a> •
    <a href="#-benchmarks-arnio-vs-pandas">Benchmarks</a> •
    <a href="#-getting-started">Quickstart</a>
  </p>
</div>

---

> **Pandas is incredible for analysis. It is notoriously slow and memory-hungry for ingesting and cleaning raw CSVs.** <br/>
> Arnio exists to do exactly one thing: intercept your messy CSVs, clean them natively in C++, and hand you a pristine Pandas DataFrame in half the time.

<p align="center">
  <img src="intro.gif" alt="arnio demo" width="80%" style="border-radius: 12px; border: 1px solid #30363D; box-shadow: 0 10px 30px rgba(0,0,0,0.5);">
</p>

## 🧨 The Problem

Every data project starts the same way. You load a CSV. It crashes your RAM. You load it again in chunks. You find random nulls, weird capitalization, and trailing whitespaces. You write a 15-line script chaining `.apply()`, `.dropna()`, and `.str.strip()`. You copy-paste this script into your next 5 Jupyter notebooks. 

It's slow. It's unreadable. It's error-prone.

## ✨ The Solution: Arnio

**Arnio** replaces your messy ingestion script with a high-performance, declarative pipeline powered by `pybind11` and C++. 

| ❌ The Old Way (Pandas) | ⚡ The Arnio Way |
| :--- | :--- |
| **Memory Spikes**: Python loads the entire raw string file before casting. | **C++ Native**: Parses and infers types directly into columnar memory. |
| **Spaghetti Code**: `.apply()` lambda functions scattered across cells. | **Declarative**: A strict, readable list of cleaning steps. |
| **Slow Execution**: Python loops over strings to strip whitespaces. | **Blazing Fast**: Cleaning primitives run at near metal speeds. |

---

## 🚀 Getting Started

If you have Python 3.9+, you are 5 seconds away from faster data pipelines.

```bash
pip install arnio
```

### The 3-Step Workflow

Drop Arnio into the very top of your Jupyter Notebook or Python script.

```python
import arnio as ar

# 1. Load the raw file using the C++ core (no Python overhead)
frame = ar.read_csv("messy_sales_data.csv")

# 2. Define a strict, readable cleaning pipeline
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 3. Export to a clean pandas DataFrame and start your analysis!
df = ar.to_pandas(clean_frame)

# -> Now, use `df` exactly like you always have.
```

---

## 🏎️ Benchmarks

> Tested on Ubuntu, Python 3.12, 1M row CSV.  
> Run `make benchmark` to reproduce on your machine.

| Metric | pandas | arnio v1.0.0 |
|--------|--------|--------------|
| Execution Time | 4.73s | 5.75s |
| Peak RAM | 211MB | 212MB |

**Current state:** arnio's C++ CSV reader matches pandas on memory.  
Speed parity is the active engineering goal for v0.2.0 — specifically  
`drop_duplicates` and `strip_whitespace` are unoptimized C++ and are  
the primary contributors to the gap.

**[Help close the gap →](https://github.com/im-anishraj/arnio/issues)**

<details>
<summary><b>🔍 Want to peek at a massive file without loading it?</b></summary>
<br>

Arnio lets you instantly scan a massive CSV to infer its schema without loading the data into memory.

```python
import arnio as ar

schema = ar.scan_csv("100GB_file.csv")
print(schema) 
# {'id': 'INT64', 'name': 'STRING', 'is_active': 'BOOL'}
```
</details>

---

## 🛠️ What's Inside?

Arnio ships with a growing library of hyper-optimized C++ cleaning primitives:

- `drop_nulls`: Rip out bad rows instantly.
- `fill_nulls`: Patch holes with scalar values.
- `drop_duplicates`: Deduplicate rows based on exact matches.
- `strip_whitespace`: Trim invisible spaces from string columns.
- `normalize_case`: Force `upper` or `lower` case instantly.
- `rename_columns` & `cast_types`: Shape your data exactly how you need it.

---

## 🤝 Contributing

Arnio is a GSSoC 2026 project. We welcome contributors of all levels.

- **No C++ required**: Add pipeline steps in pure Python
- **C++ contributors**: Help optimize `drop_duplicates` and `strip_whitespace`  
  — these are the current performance bottleneck
- **Docs & examples**: Always needed

[Read the Contribution Guide →](CONTRIBUTING.md) | 
[Browse open issues →](https://github.com/im-anishraj/arnio/issues)

---

## 🗺️ Roadmap

| Version | Focus | Status |
|---------|-------|--------|
| v1.0.0 | Stable release, cross-platform wheels, Google Colab support, CI/CD pipeline | ✅ Released |
| v0.2.0 | C++ pipeline optimization, speed parity with pandas | 🔨 Active |
| v0.3.0 | Chunked processing, Parquet/JSON support | 📋 Planned |

<div align="center">
<br>
<b>Stop fighting your data. Let Arnio clean it.</b>
<br><br>
</div>
