$ cat skip-the-subscription-and-build-a-tts-pipeline.md

May 15, 2026

Skip the subscription and build a TTS pipeline

I have subscriptions to various online journals. I mean what isn’t a subscription these days. I often save articles I want to keep or use for writing and save them as files and store them. Some days, I want to have the article read aloud to me. Either so I can read along or just want to go into listening mode while taking a walk.

With AI, we’re no longer stuck with that awful 2010 robotic voice reading text to us.

That being said everything these days is a subscription. EVERYTHING (I mean I just said I have subscriptions). I also just don’t want to listen to a robot and I’m not converting to audio often enough to pay money besides one never knows when they’ll need to buy another journal subscription.

So I thought, “You know what? I’ll just build it myself.”

whoah! that’s cool

I fired up claude code and was prepared to dive into TTS models that were going to cost some money but since I don’t want something often, it would still be cheaper than recurring monthly payments. This is when I learned about the Kokoro-82M. It’s a tiny TTS (text-to-speech) model (82 million parameters, smaller than most photos in a camera roll) and it sounds genuinely competitive with elevenlabs to my ear. Also, it runs on a CPU. It's free and there are like fifty voices to choose from.

The recipe for my setup was:

python script on the mac mini
pull a markdown article from Nextcloud (I self host and keep documents there) over WebDAV
strip the markdown down to clean prose
run Kokoro chunk by chunk (paragraph-sized, because the model has a context limit but it's fine if you don't shove a whole essay at it)
stitch the audio together with a quarter-second silence between chunks
encode to mp3 (FFmpeg, two seconds)
upload back into Nextcloud

It took longer to pick the voice I wanted to use than to write and run the script.

Having Kokoro is not having a product. Having Kokoro plus a place your articles already live, plus a one-word command from any harness you use, plus an audio destination your phone already syncs to is what becomes the product. So i wrapped it twice. once as a shell function and once as a claude skill that's fuzzy-matched and it figures out which markdown file i mean.

This turned out great, it sounds great and I didn’t have to pay for a subscription.

A couple of hours of plumbing beats $11/mo forever. The bigger thing I keep relearning is that the model is rarely the bottleneck anymore. The bottleneck is the system around it. Where does the input come from, where does the output go, how do you trigger it without thinking, how does it not break the third time you use it. It’s building a product. Good thing I like building products.

I made a Github gist if you’d like to try building your own!

https://gist.github.com/bonus414/20f38d32d8c08c5ff32f516fb373bbfd

LIKED THIS?

I write about AI in plain English every other Sunday. No hype, no jargon — just the stuff that actually helps.

I'M IN →

← Back to the blog