Last year, I spent hundreds of hours trying to train toy language models from scratch. None of them were "useful", but they weren't meant to be. Language models are fascinating, and I wanted to capture even a fraction of the magic.
While fine-tuning large language models (LLMs) is relatively straightforward, there are far fewer tools available for pre-training them from scratch. I ended up using Andrej Karpathy's llama.c —a fantastic educational project—but I soon realized its limitations. It wasn't designed to easily integrate new datasets or experiment with custom model architectures.
That's why, over the past few weeks, I built Sonic-ML: a simple, command-line tool for training language models from scratch on a single machine, whether it's running a CPU, M-Chip, or GPU. Sonic-ML streamlines the process, making it easy to train and evaluate models with just one command. Beyond that, it simplifies downloading datasets from Hugging Face and enables users to train custom tokenizers on any dataset.
This project is open-source and available to the public on GitHub. If you're excited about language models and want to contribute, check out the issues section—I'd love to hear your ideas and see your contributions!