A while ago I was listening to the Accidental Tech Podcast and the hosts were trying to recall something that had happened on a previous episode. I thought wouldn’t it be nice to have a set of transcripts for all the episodes. Surely there were some reasonably easy to use open source audio transcription engines I could use? Surely this idea wouldn’t take more than a weekend or two!
While this was not the only side project I worked on between then and now, needless to say it took a bit of time to evaluate various solutions and figure out how to implement it. This was a nice but unexpectedly large in scope side project to work on in-between the rest of life. And it is now March 2021, and it is finally launched.
A major hat tip to David Smith, who did something similiar a while ago. I leveraged some of the overall UI theme he pioneered for this sort of a project, and then added a bunch more features and functionality to it.
Note that this project is completely unaffiliated with the hosts of the Accidental Tech Podcast outside of seeking their permission to release it.
While my approach with Flashlight and then Whisper were state of the art when I originally implemented (and my sliding window approach with Flashlight actually meant I beat many commercial options…), things have progressed since 2023 when I had last modified the site. Recently I updated the existing pipeline, replacing the single ASR approach with an ensemble of multiple recognition engines that effectively check each other’s work:
On the website side, search is now Pagefind, which builds a static index that runs entirely in your browser. There is no search server to keep alive or to forget to load the latest episodes results into. The whole site is now just static files hosted on Cloudflare. The word-by-word highlight-as-it-plays feature has also been retained, though I did have to fix some AI generated bugs (flashbacks to my day job, as even frontier models often struggle with embedded firmware code…).
Each transcript page notes exactly which engines and which judge produced it. It is very possible I might experiment with different models / techniques over time.
If you’re curious about the details of ASR engine transcription disagreements, we also have a View ASR model disagreements toggle in the sidebar. Flip it on and on any episode transcribed with the new model architecture, every word the two engines disagreed about is highlighted. Click or hover over one and you’ll see what each engine heard (along with how well the audio actually backed up each guess), what the third audio-input model heard when it listened to that clip, and which version the judge ultimately picked, or very occasionally, the replacement it wrote itself. It is interesting to see how the different models interpret certain phrases / words.
I also had Claude create a dark / light mode. I also added a “system” mode which follows whatever preference your browser reports, which the page defaults to.
In the more “fun” category, I added a graph in the statistics section showing where in each episode (using % of elapsed time) the episode title was spoken. Note there are a fair number of missing data points where the episode title wasn’t correctly identified. It is possible as I re-generate older episodes using the new ensemble method it might be able to pick up additional episode titles and fill in the graph further.
The facts below describe earlier revisions of the project (the Flashlight and Whisper eras) and are kept for posterity — see The Current Approach above for how things work today. Notably, the per-line highlighting trick mentioned below was later reverted to one span per word, and the Tantivy search has been replaced with Pagefind.
Some fun facts about the setup of this project include:
<span> for each word in the transcript (which I did in early revisions of the site), I now have one <span> per transcript line and use JavaScript to then select a specific word in the line to highlight/unhighlight. This helped reduce the size of the complete site (along with disabling Lanyon’s automatic pagination) in compressed 7z format from 214MB to 141MB. It also helps noticably improve initial load speed of transcripts, particularly on mobile networks. Note that I am not a web dev, and there are probably more ways to make it more even efficient (such as splitting the transcripts into one JS file/dictionary per chapter), but I’m happy with how it’s performing now.This comparison is from the original (2021) version of the project and reflects what was available then. It is not how transcripts are generated today (see The Current Approach). Still a fun historical snapshot.
Early on in this project I chose an initial short segment of audio to use as a ‘reference’ to determine how different speech recognition systems could handle the tech-heavy speech of ATP. The first segment I selected was John’s soliloquy about the Apple Human Interface Guidelines and UI Design from Accidental Tech Podcast #392. It has a duration of 200 seconds. Using the 200 second segment I then generated a ‘reference’ transcript and ran it through a variety of speech recognition systems. This included both open source options and the built in offline transcription found in iOS 14 and MacOS Big Sur (which interestingly consistently provided slightly different results…). Note that for Flashlight it also involved some custom algorithms / techniques to merge overlapping transcripts together, as Flashlight’s accuracy with the model I was using dropped significantly when samples were longer than ~15 seconds.
For this project I did not examine cloud hosted speech recognition systems, as due to the 400+ often 3+ hour episodes of ATP transcribing it all would have been cost prohibitive for a side project. Note that they likely exceed the performance of even the best performing open source project, but were out of scope for this project.
For this project I wanted to maximize the number of Hits (number of words correctly found) while minimizing the number of Insertions (number of words inserted that did not exist in original audio) and Subsitutions (where a word was substituted for a different one). I used the Python module jiwer to help calculate all these values. I also removed all punctuation ('/"/./!/ etc) before running them through the comparison, to ensure I didn’t favor those that used punctuation over those that didn’t. Finally do not take this performance comparison as gospel, it is a short snippet of a very tech oriented show hosted by three white dudes from the United States, and performance might be dramatically different for your particular voice recognition use case.
| Speech Recognition System | Hits | Insertions | Substitutions | WER (Word Error Rate) |
|---|---|---|---|---|
| Reference Transcript | 800 | 0 | 0 | 0.0% |
| Flashlight (15 second chunking) | 740 | 8 | 39 | 8.5% |
| iOS 14 | 692 | 9 | 87 | 14.625% |
| macOS Big Sur | 666 | 14 | 107 | 18.5% |
| macOS Catalina | 662 | 12 | 111 | 18.75% |
| Wav2Letter (Flashlight predecessor) | 636 | 19 | 126 | 22.875% |
| Seq2Seq Jasper | 597 | 14 | 161 | 27.125% |
| Mozilla DeepSpeech 9 | 541 | 8 | 101 | 33.375% |
| PocketSphinx | 408 | 14 | 266 | 50.75% |
While I didn’t document it here, there were massive deltas in the speed of the various options. Many of the poor performers were also extremely slow, while Flashlight was many times faster than real time, which is very convenient when you need to transcribe 400+ episodes. It also supported CUDA acceleration, which allowed me to use my nVidia gaming GPU to transcribe the many, many hours of Casey, Marco and John discussing technology.
Have questions or suggestions about catatp? Feel free to send me an e-mail. Do not expect an immediate response, as I do have a day job and family that keeps me very busy.
This site uses a research level voice recognition engine, which while impressive compared to historical standards is still not close to a professional transcriber’s accuracy. But for transcribing hundreds of multi-hour podcast episodes as a side project, it is far more cost effective.
Nope. Just someone who enjoys listening to the show and was surprised no one had tried to automatically generate transcriptions for their show recently. After working on this project though, I think I know why. I did receive permission from the hosts before releasing this project however.
Processing a new episode is now a single command (download, transcribe with the ensemble, identify speakers, build the site, and deploy), but it still needs my GPU free to run, so please be patient. I’ll try to get to it within a couple days of the episode dropping.
I’m a software and firmware engineer in one of the most beautiful places in the world (Seattle, WA). I have a plan to write a personal blog which I’ll link here once that is ready.
Thanks for reading and stay safe!