Vaibhav Srivastav

reach-vb

AI & ML interests

TTS + LM performance prediction

Articles

Organizations

Posts 4

view post
Post
403
On-device AI framework ecosystem is blooming these days:

1. llama.cpp - All things Whisper, LLMs & VLMs - run across Metal, CUDA and other backends (AMD/ NPU etc)
https://github.com/ggerganov/llama.cpp

2. MLC - Deploy LLMs across platforms especially WebGPU (fastest WebGPU LLM implementation out there)
https://github.com/mlc-ai/web-llm

3. MLX - Arguably the fastest general purpose framework (Mac only) - Supports all major Image Generation (Flux, SDXL, etc), Transcription (Whisper), LLMs
https://github.com/ml-explore/mlx-examples

4. Candle - Cross-platform general purpose framework written in Rust - wide coverage across model categories
https://github.com/huggingface/candle

Honorable mentions:

1. Transformers.js - Javascript (WebGPU) implementation built on top of ONNXruntimeweb
https://github.com/xenova/transformers.js

2. Mistral rs - Rust implementation for LLMs & VLMs, built on top of Candle
https://github.com/EricLBuehler/mistral.rs

3. Ratchet - Cross platform, rust based WebGPU framework built for battle-tested deployments
https://github.com/huggingface/ratchet

4. Zml - Cross platform, Zig based ML framework
https://github.com/zml/zml

Looking forward to how the ecosystem would look 1 year from now - Quite bullish on the top 4 atm - but open source ecosystem changes quite a bit! 🤗

Also, which frameworks did I miss?
view post
Post
2730
Less than two days ago Kyutai Labs open sourced Moshi - an ~7.6B on-device Speech to Speech foundation model and Mimi - SoTA streaming speech codec! 🔥

The release includes:

1. Moshiko & Moshika - Moshi finetuned on synthetic data (CC-BY license) ( kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd)
2. Mimi - Streaiming Audio Codec, processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps (CC-BY license) ( kyutai/mimi)
3. Model checkpoints & Inference codebase written in Rust (Candle), PyTorch & MLX (Apache license) (https://github.com/kyutai-labs/moshi)

How does Moshi work?

1. Moshi processes two audio streams: one for itself and one for the user, with the user's stream coming from audio input and Moshi's stream generated by the model.

2. Along with these audio streams, Moshi predicts text tokens for its speech, enhancing its generation quality.

3. The model uses a small Depth Transformer for codebook dependencies and a large 7B parameter Temporal Transformer for temporal dependencies.

4. The theoretical latency is 160ms, with a practical latency of around 200ms on an L4 GPU.

Model size & inference:

Moshiko/ka are 7.69B param models

bf16 ~16GB VRAM
8-bit ~8GB VRAM
4-bit ~4GB VRAM

You can run inference via Candle 🦀, PyTorch and MLX - based on your hardware.

The Kyutai team, @adefossez @lmz and team are cracked AF, they're bringing some serious firepower to the open source/ science AI scene, looking forward to what's next! 🐐