Metadata-Version: 2.4
Name: qualipilot
Version: 2.0.0
Summary: Production-grade data quality checks with pluggable LLM reporting (AWS Bedrock, Ollama, OpenAI-compatible).
Project-URL: Homepage, https://github.com/Sarvesh-GanesanW/dataqualitychecker
Project-URL: Repository, https://github.com/Sarvesh-GanesanW/dataqualitychecker
Project-URL: Issues, https://github.com/Sarvesh-GanesanW/dataqualitychecker/issues
Author-email: Sarvesh Ganesan <sarveshganesan99@gmail.com>
License: Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
License-File: LICENSE
Keywords: bedrock,dask,data-quality,data-validation,ollama,pandas,polars
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27
Requires-Dist: pandas>=2.2
Requires-Dist: polars>=1.12
Requires-Dist: pyarrow>=17.0
Requires-Dist: pydantic-settings>=2.5
Requires-Dist: pydantic>=2.9
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.8
Requires-Dist: tenacity>=9.0
Requires-Dist: typer>=0.12
Provides-Extra: all
Requires-Dist: boto3>=1.35; extra == 'all'
Requires-Dist: botocore>=1.35; extra == 'all'
Requires-Dist: dask[dataframe]>=2024.8; extra == 'all'
Requires-Dist: duckdb>=1.5; extra == 'all'
Requires-Dist: matplotlib>=3.9; extra == 'all'
Requires-Dist: numpy>=2.0; extra == 'all'
Requires-Dist: openai>=1.50; extra == 'all'
Requires-Dist: pyiceberg>=0.9; extra == 'all'
Requires-Dist: rapidfuzz>=3.9; extra == 'all'
Provides-Extra: bedrock
Requires-Dist: boto3>=1.35; extra == 'bedrock'
Requires-Dist: botocore>=1.35; extra == 'bedrock'
Provides-Extra: dask
Requires-Dist: dask[dataframe]>=2024.8; extra == 'dask'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.112; extra == 'dev'
Requires-Dist: moto>=5.0; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pre-commit>=3.8; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.3; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.5; extra == 'duckdb'
Provides-Extra: iceberg
Requires-Dist: pyarrow>=17; extra == 'iceberg'
Requires-Dist: pyiceberg>=0.9; extra == 'iceberg'
Provides-Extra: linking
Requires-Dist: numpy>=2.0; extra == 'linking'
Requires-Dist: rapidfuzz>=3.9; extra == 'linking'
Provides-Extra: ollama
Provides-Extra: openai
Requires-Dist: openai>=1.50; extra == 'openai'
Provides-Extra: spark
Requires-Dist: pyspark>=3.5; extra == 'spark'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.9; extra == 'viz'
Description-Content-Type: text/markdown

# qualipilot

Production-grade data quality checker for Python. Runs structural and
statistical checks on any tabular dataset (CSV / Parquet / JSON / Pandas /
Polars / Dask / cuDF) and, optionally, asks an LLM — **AWS Bedrock**,
**Ollama**, or any OpenAI-compatible endpoint — to narrate the findings.

* swap engines with one flag (Polars default, Pandas/Dask/cuDF on demand)
* swap LLM providers the same way (`--llm bedrock|ollama|openai|none`)
* one-click install, docker-compose for local runs, terraform for Lambda
* typed Pydantic results, deterministic JSON output, exit-code severity
  gate for CI pipelines

---

## Install

### One-click (recommended)

```bash
# macOS / Linux
./install.sh --all        # core + every optional extra
./install.sh --bedrock    # core + boto3
./install.sh --dev        # editable + dev + pre-commit

# Windows PowerShell
.\install.ps1 -Extras all
```

### Manual

```bash
pip install qualipilot                 # core
pip install "qualipilot[bedrock]"      # + boto3 for AWS Bedrock
pip install "qualipilot[ollama]"       # + httpx (already core)
pip install "qualipilot[dask]"         # + dask[dataframe]
pip install "qualipilot[all]"          # everything except cuDF
```

cuDF (GPU) needs the RAPIDS conda channel — see
[docs.rapids.ai/install](https://docs.rapids.ai/install).

---

## Quickstart (CLI)

```bash
qualipilot check data.csv \
    --engine polars \
    --range amount=0,100000 \
    --output reports/data.quality.html \
    --llm bedrock \
    --model anthropic.claude-3-5-haiku-20241022-v1:0 \
    --region us-east-1 \
    --fail-on warn
```

* `--output` can be `.json`, `.html`, or `.md`; format is inferred.
* `--fail-on {ok,warn,error}` decides when the CLI returns a non-zero
  exit code — wire it straight into CI.
* All flags have `--config` equivalents; see `examples/config.yaml`.

## Quickstart (Python)

```python
import pandas as pd
from qualipilot import DataQualityChecker, QualipilotConfig
from qualipilot.models.config import CheckConfig, ColumnRange, LLMConfig

df = pd.read_csv("orders.csv")

config = QualipilotConfig(
    engine="polars",
    checks=CheckConfig(
        column_ranges={"amount": ColumnRange(min=0, max=100_000)},
    ),
    llm=LLMConfig(
        provider="bedrock",
        model="anthropic.claude-3-5-haiku-20241022-v1:0",
        region="us-east-1",
    ),
)

report = DataQualityChecker(df, config).run()
print(report.to_json())
print(report.llm_report)
```

---

## What it checks

| Check | Default | Description |
|---|---|---|
| `missing_values` | on  | per-column null counts + percentage |
| `duplicates`     | on  | global duplicate rows (subset-aware) |
| `data_types`     | on  | dtype rollup per column |
| `outliers`       | on  | IQR rule, Q1/Q3 computed in one pass |
| `ranges`         | on  | user-supplied `[min, max]` per column |
| `cardinality`    | on  | distinct count + top-10 values |
| `freshness`      | off | max-timestamp vs `freshness_max_age_hours` |

Each check returns a typed `CheckResult` with severity `ok / warn /
error`, a duration, a JSON-safe payload, and any captured exception.

---

## Engines

| Engine | When to use |
|---|---|
| `polars` (default) | in-memory data up to ~10 GB — 8× faster than pandas |
| `pandas`  | legacy integrations that need pandas-native output |
| `dask`    | larger-than-memory data or multi-worker clusters |
| `cudf`    | single-node GPU acceleration (RAPIDS required) |

`--engine auto` inspects the input object and picks the fastest safe
backend (Polars for single-node, Dask for already-Dask frames, cuDF
when a GPU frame is handed in).

---

## LLM providers

| Provider | `--llm` | Required |
|---|---|---|
| None (default) | `none` | nothing |
| AWS Bedrock (Converse API) | `bedrock` | `boto3`, IAM `bedrock:Converse` |
| Ollama | `ollama` | running ollama server |
| OpenAI-compatible | `openai` | base URL + API key |

Bedrock uses the **Converse API**, so the same code path works for
Anthropic Claude, Meta Llama, Mistral, Cohere, etc. — you just change
`model=...`.

---

## Deploy

### Docker (local Ollama stack)

```bash
docker compose -f docker/docker-compose.yml up --build
```

This brings up `ollama` and a `qualipilot` container wired to it, and
runs the sample check end-to-end.

### AWS Lambda (container image)

```bash
cd deploy/terraform
terraform init
terraform apply -var project=qualipilot -var aws_profile=sre-tea

# build + push the image to the ECR repo terraform just made
aws ecr get-login-password | docker login --username AWS --password-stdin \
    $(terraform output -raw ecr_repository_url | cut -d/ -f1)
docker build -f ../../docker/Dockerfile.lambda -t qualipilot-lambda:latest ../..
docker tag qualipilot-lambda:latest "$(terraform output -raw ecr_repository_url):latest"
docker push "$(terraform output -raw ecr_repository_url):latest"

aws lambda update-function-code \
    --function-name qualipilot \
    --image-uri "$(terraform output -raw ecr_repository_url):latest"
```

Invoke with:

```bash
aws lambda invoke \
    --function-name qualipilot \
    --payload '{"s3_uri":"s3://my-bucket/events.parquet"}' \
    response.json
```

Report lands at `s3://my-bucket/reports/events.quality.json`.

---

## Development

```bash
./install.sh --dev
make lint typecheck test
```

* Ruff for lint + format, MyPy in strict mode, pytest with coverage.
* Pre-commit runs the same locally before every commit.
* `pytest -m integration` runs tests that need real AWS/Bedrock credentials.

---

## Record linkage / probabilistic dedup

Beyond exact duplicates, qualipilot ships an in-house Fellegi-Sunter
linker — no external splink dependency. Polars blocking, rapidfuzz
string distance, numpy EM. 1M rows in ~10 s on a laptop.

```bash
qualipilot link customers.csv \
    --id customer_id \
    --compare "name:fuzzy:0.92,0.75" \
    --compare "postcode:exact" \
    --block "postcode" \
    --threshold 0.9
```

Full details: [`docs/LINKING.md`](docs/LINKING.md).

## Docs

* [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) — module layout + data flow
* [`docs/LINKING.md`](docs/LINKING.md) — probabilistic dedup / linkage
* [`docs/DEEP_DIVE.md`](docs/DEEP_DIVE.md) — audit of the v1 codebase
* [`docs/DEPLOY.md`](docs/DEPLOY.md) — cloud + on-prem deployment notes
* [`docs/MIGRATION.md`](docs/MIGRATION.md) — upgrading from v1.x
* [`docs/RUST_CONVERSION.md`](docs/RUST_CONVERSION.md) — should we port to Rust? (tldr: hybrid, not rewrite)

## License

MIT.
