Smol-Audio: Colab-Friendly Notebook Set for Tuning Whisper, Parrot, Foxtral, Granite Speech and Flamingo Voice 3


Voice AI has had a banner year. Automatic speech recognition has gotten dramatically better with models like OpenAI’s Whisper variants, NVIDIA’s Parakeet, and Mistral’s Voxtral. Advance audio understanding with models like NVIDIA’s Audio Flamingo 3. The text-to-speech feature arrived at the dialogue level via Dia-1.6B from Nari Labs. Meta shipped the Perceptual Audio-Visual Encoder (PE-AV), a multimedia encoder capable of learning a common embedding space across audio, video, and text. The border has never moved faster.

hunting? The practical knowledge required to actually work with these models-how to fine-tune them, adapt them to new languages, or run efficient heuristics-is scattered across GitHub issues, research blogs, and private notebooks that never see the light of day. If you’re a machine learning engineer and just want to fine-tune Whisper on a new domain or run clipless video classification with PE-AV, you’re often starting from scratch.

This is the gap smol-audio It is designed to close.

What is small sound? ?

Released under the Apache-2.0 license by the Deep-unlearning team, smol-audio is a flat repository of standalone Jupyter notebooks, each focused on a single practical AI-powered audio task. Each notebook is designed to open directly in Google Colab, requires no local GPU setup, and is built entirely on the Hugging Face ecosystem – specifically transformers, datasets, peftand accelerate. Most recipes fit within 16GB of Colab runtime, which means the Free or Standard Colab tier is sufficient for most tasks.

The “flat repo” design is a deliberate choice. Instead of encapsulating recipes inside a framework or hiding complexity behind convenience functions, smol-audio reveals every step. You can read the training loop, understand the data path, and modify the configuration without reverse engineering the library. For early-career engineers, this transparency is truly educational.

ASR fine tuning: whisper, parrot, foxtral and granite

The largest class in the repo today covers ASR fine-tuning across four distinct model families. Each requires usefully different treatment.

the Whisper The notebook covers fine-tuning using transformers and datasetsmaking it easy to adapt the encoding and decoding architecture to a custom language or narrow domain. Whisper uses a serialization-to-serialization approach, creating token copies with a token-familiar territory to anyone who has worked with language models.

Nvidia parrot It uses a CTC (Communication Temporal Classification) structure instead of a sequence-to-sequence setting. CTC is faster and lighter for inference but requires alignment between audio frames and output codes rather than autoregressive decoding. The smol-audio notebook covers both full fine-tuning and LoRA (low-rank adaptation) for Parakeet, which is important because large CTC models with full fine-tuning can be memory-intensive.

Mistral foxtral Architecturally different from both Whisper and Parakeet. Instead of a traditional ASR decoder and encoder, Voxtral is built on top of a large language model – Ministral 3B for Voxtral Mini and Mistral Small 3.1 24B for Voxtral Small – making it an LLM-based speech understanding model. The smol-audio notebook handles ASR fine-tuning with fast masking, supporting full fine-tuning and LoRA. Instant masking is important here precisely because of this LLM structure: when the model accepts text prompts along with voice input, you typically don’t want to calculate the loss on the claim tokens themselves – just on the generated transcription. Making this mistake deteriorates the training dynamics, so having a working reference application saves significant time in correcting errors.

IBM’s Granite Letter It gets its own notebook focusing on Italian ASR using the YODAS-Granary dataset. This is a useful example that goes beyond just a model: it demonstrates fine-tuning of domain and language in a real multilingual speech corpus, a common production scenario.

Understanding audio with NVIDIA Audio Flamingo 3

Audio Flamingo 3, developed by NVIDIA, is a large audio language model (LALM) for inference and understanding across speech, sound, and music. Smol Audio Notebook is specifically tuned to the task of audio captioning – generating a natural language description of an audio clip, which is useful for accessibility tools, content indexing, and retrieval systems. The laptop covers both full fine-tuning and LoRA-based fine-tuning, giving practitioners the choice between maximum performance and memory efficiency.

LoRA, for those newer to efficient parameter fine-tuning, works by freezing the original model’s weights and injecting small trainable rank decomposition matrices into specific layers. For large multimedia models like Audio Flamingo 3, LoRA can reduce GPU memory requirements by an order of magnitude compared to full fine-tuning, enabling redundancy on commodity hardware.

Text-to-speech dialogue with Dia-1.6B

The Dia-1.6B notebook covers dialogue-style text-to-speech, where the goal is not just to install a single speaker, but to generate natural conversational exchanges. Dia is a 1.6 billion parameter text-to-speech (TTS) model from Nari Labs capable of producing multi-speaker dialogue, making it suitable for anyone building voice agents, podcast creation tools, or conversational interfaces.

Multimodal inference using Meta’s PE-AV

Perhaps the most forward-looking notebook in the current edition covers inference using Meta’s Perceptual Audio-Visual Encoder (PE-AV). PE-AV is a multimedia encoder that learns a single common embedding space across audio, video, and text – enabling zero-shot video classification without any task-specific fine-tuning, and audio-text retrieval according to standards such as AudioCaps. Since all three methods relate to the same embedding space, multimodal queries such as retrieving an audio clip from a textual description work through simple dot product similarity.

The notebook shows how to run these lines of inference directly, which is valuable because multimodal models with shared audio and visual text encoders are more architecturally complex than single-modality models and typically require careful preprocessing of multiple input modalities.


verify The repo is here. Also, feel free to follow us on twitter And don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Michel Sutter is a data science specialist and holds a Master’s degree in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michelle excels at transforming complex data sets into actionable insights.

Leave a Reply