DashengTokenizer is a high-performance continious audio tokenizer designed for audio understanding and generation tasks.
Compared to previous works, our framework trains a single linear layer to enable audio generation for semantically strong encoders.
Achievements:
State-of-the-Art Audio Understanding: DashengTokenizer consistently outperforms most previous self-supervised and supervised audio encoders.
High-Fidelity Signal Reconstruction: Maintains exceptional signal integrity, ensuring that audio remains crisp and accurate after processing.
Accelerated Audio Generation Training: Achieves optimal performance significantly faster than standard VAE models, reducing training time and costs.
Superior Speech Enhancement: Provides a more robust encoding foundation for isolating and clarifying speech in noisy environments.
# Extract rich audio features for downstream tasks
features = model.encode(audio)
# Use features for classification, clustering, etc.
Limitations
Optimized for 16kHz mono audio
Results
Citation
If you use DashengTokenizer in your research, please cite:
bibtex
@misc{dinkel_dashengtokenizer_2026,
title={DashengTokenizer: One layer is enough for unified audio understanding and generation},
author={MiLM Plus, Xiaomi},
year={2026},
url={https://huggingface.co/mispeech/dashengtokenizer}
}
License
Apache 2.0 License
â ī¸ Incomplete Data
Some information about this model is not available.
Use with Caution - Verify details from the original source before relying on this data.