Metadata-Version: 2.4
Name: ssondo
Version: 0.1.0
Summary: S-SONDO: Lightweight audio embeddings from self-supervised knowledge distillation
Author: Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid
License-Expression: MIT
Project-URL: Homepage, https://github.com/MedAliAdlouni/ssondo_temp
Project-URL: Repository, https://github.com/MedAliAdlouni/ssondo_temp
Project-URL: Issues, https://github.com/MedAliAdlouni/ssondo_temp/issues
Keywords: audio,embeddings,knowledge-distillation,self-supervised,sound
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: torchaudio>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: einops>=0.7
Requires-Dist: timm>=0.4.12
Requires-Dist: huggingface-hub>=0.20

# S-SONDO

Lightweight audio embeddings from self-supervised knowledge distillation.

S-SONDO provides compact audio models (MobileNetV3, DyMN, ERes2Net) trained via knowledge distillation from large audio foundation models (MATPAC, M2D). Extract general-purpose audio embeddings with a single function call.

**Paper:** *S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models* (ICASSP 2026)

## Installation

```bash
pip install ssondo
```

## Quick Start

```python
import torchaudio
from ssondo import get_ssondo

# Load a pretrained model (auto-downloads from Hugging Face Hub)
model = get_ssondo("matpac-mobilenetv3")

# Load audio (mono, 32kHz)
x, sr = torchaudio.load("audio.wav")
x = x.mean(dim=0, keepdim=True)  # mono

# Extract embeddings
embeddings = model(x)  # (1, n_segments, 960)
```

## Available Models

```python
from ssondo import list_models

for name, description in list_models().items():
    print(f"{name}: {description}")
```

| Model | Teacher | Student | Embedding Size |
|-------|---------|---------|---------------|
| `matpac-mobilenetv3` | MATPAC++ | MobileNetV3 | 960 |
| `matpac-dymn` | MATPAC++ | DyMN | 960 |
| `matpac-eres2net` | MATPAC++ | ERes2Net | varies |
| `m2d-mobilenetv3` | M2D | MobileNetV3 | 960 |
| `m2d-dymn` | M2D | DyMN | 960 |
| `m2d-eres2net` | M2D | ERes2Net | varies |

## Usage

### Extract Embeddings

```python
model = get_ssondo("matpac-mobilenetv3")
embeddings = model(audio)  # (batch, n_segments, emb_size)
```

### Get Logits Too

```python
model = get_ssondo("matpac-mobilenetv3", return_logits=True)
embeddings, logits = model(audio)
```

### GPU Inference

```python
model = get_ssondo("matpac-mobilenetv3", device="cuda")
embeddings = model(audio.cuda())
```

### Load from Local Checkpoint

```python
model = get_ssondo("path/to/checkpoint.ckpt")
```

## Input Requirements

- **Mono audio** (single channel)
- **Sample rate**: 32,000 Hz
- Audio is internally sliced into 10-second segments and converted to 128-band log-mel spectrograms

## How It Works

`get_ssondo()` auto-detects everything from the checkpoint: student backbone, preprocessing parameters, and classification head. No manual configuration needed.

When you pass a model name (e.g., `"matpac-mobilenetv3"`), the checkpoint is automatically downloaded from [Hugging Face Hub](https://huggingface.co/mohammedali2501/ssondo) and cached locally.

## Citation

```bibtex
@inproceedings{eladlouni2026ssondo,
  title={S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models},
  author={El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}
```

## License

MIT
