A while ago I was listening to the Accidental Tech Podcast and the hosts were trying to recall something that had happened on a previous episode. I thought wouldn’t it be nice to have a set of transcripts for all the episodes. Surely there were some reasonably easy to use open source audio transcription engines I could use? Surely this idea wouldn’t take more than a weekend or two!
While this was not the only side project I worked on between then and now, needless to say it took a bit of time to evaluate various solutions and figure out how to implement it. This was a nice but unexpectedly large in scope side project to work on in-between the rest of life. And it is now March 2021, and it is finally launched.
A major hat tip to David Smith, who did something similiar a while ago (but which has not been updated in years). I leveraged some of the overall UI theme he pioneered for this sort of a project.
Note that this project is completely unaffiliated with the hosts of ATP outside of seeking their permission to release it.
Some fun facts about the setup of this project include:
<span>
for each word in the transcript (which I did in early revisions of the site), I now have one <span>
per transcript line and use JavaScript to then select a specific word in the line to highlight/unhighlight. This helped reduce the size of the complete site (along with disabling Lanyon’s automatic pagination) in compressed 7z format from 214MB to 141MB. It also helps noticably improve initial load speed of transcripts, particularly on mobile networks. Note that I am not a web dev, and there are probably more ways to make it more even efficient (such as splitting the transcripts into one JS file/dictionary per chapter), but I’m happy with how it’s performing now.Early on in this project I chose an initial short segment of audio to use as a ‘reference’ to determine how different speech recognition systems could handle the tech-heavy speech of ATP. The first segment I selected was John’s soliloquy about the Apple Human Interface Guidelines and UI Design from Accidental Tech Podcast #392. It has a duration of 200 seconds. Using the 200 second segment I then generated a ‘reference’ transcript and ran it through a variety of speech recognition systems. This included both open source options and the built in offline transcription found in iOS 14 and MacOS Big Sur (which interestingly consistently provided slightly different results…). Note that for Flashlight it also involved some custom algorithms / techniques to merge overlapping transcripts together, as Flashlight’s accuracy with the model I was using dropped significantly when samples were longer than ~15 seconds.
For this project I did not examine cloud hosted speech recognition systems, as due to the 400+ often 3+ hour episodes of ATP transcribing it all would have been cost prohibitive for a side project. Note that they likely exceed the performance of even the best performing open source project, but were out of scope for this project.
For this project I wanted to maximize the number of Hits (number of words correctly found) while minimizing the number of Insertions (number of words inserted that did not exist in original audio) and Subsitutions (where a word was substituted for a different one). I used the Python module jiwer to help calculate all these values. I also removed all punctuation ('
/"
/.
/!
/ etc) before running them through the comparison, to ensure I didn’t favor those that used punctuation over those that didn’t. Finally do not take this performance comparison as gospel, it is a short snippet of a very tech oriented show hosted by three white dudes from the United States, and performance might be dramatically different for your particular voice recognition use case.
Speech Recognition System | Hits | Insertions | Substitutions | WER (Word Error Rate) |
---|---|---|---|---|
Reference Transcript | 800 | 0 | 0 | 0.0% |
Flashlight (15 second chunking) | 740 | 8 | 39 | 8.5% |
iOS 14 | 692 | 9 | 87 | 14.625% |
macOS Big Sur | 666 | 14 | 107 | 18.5% |
macOS Catalina | 662 | 12 | 111 | 18.75% |
Wav2Letter (Flashlight predecessor) | 636 | 19 | 126 | 22.875% |
Seq2Seq Jasper | 597 | 14 | 161 | 27.125% |
Mozilla DeepSpeech 9 | 541 | 8 | 101 | 33.375% |
PocketSphinx | 408 | 14 | 266 | 50.75% |
While I didn’t document it here, there were massive deltas in the speed of the various options. Many of the poor performers were also extremely slow, while Flashlight was many times faster than real time, which is very convenient when you need to transcribe 400+ episodes. It also supported CUDA acceleration, which allowed me to use my nVidia gaming GPU to transcribe the many, many hours of Casey, Marco and John discussing technology.
While I added a beta level search, I should probably rewrite some of the web interface components to be faster (hmm, Rust maybe?). Similarly the local search implementation I’m using doesn’t have support for offsetting search results, so I plan on adding that feature in the future as well.
I also should add links from the stats page to the relevant episodes.
If I wanted to there are some medium hanging fruit for optimizing transcript highlight playback, such as splitting the dictionary containing the transcripts into one file per chapter, but I’m happy with how it’s doing for now.
Finally there is a bug where sometimes played audio and highlighted words are out of sync (primarily on Chrome browsers from my brief testing). Not sure if there is a workaround for it.
Have questions or suggestions about catatp? Feel free to send me an e-mail. Do not expect an immediate response, as I do have a day job and family that keeps me very busy.
This site uses a research level voice recognition engine, which while impressive compared to historical standards is still not close to a professional transcriber’s accuracy. But for transcribing hundreds of multi-hour podcast episodes as a side project, it is far more cost effective.
Nope. Just someone who enjoys listening to the show and was surprised no one had tried to automatically generate transcriptions for their show recently. After working on this project though, I think I know why. I did receive permission from the hosts before releasing this project however.
Yeah, I think this is due to bugs in some browser’s (cough Chrome cough) audio playback implementations. I’ve seen it primarily in Chrome, and when I switch over to Safari it appears to be in sync even for the same episode/chapter. I find this odd, as I much prefer the speed handling engine for audio in Youtube in Chrome compared to Safari. Even within Chrome it does seem to be worse in some episodes than others. Honestly fixing this is not a high priority, the feature was a ‘fun’ one not core to the objective of the website. If this is bugging you try a different browser, and hopefully the issue will improve.
Currently this involves a number of manually executed commands, plus it needs my GPU free to run the transcription. So patience. I’ll try to get to it within a couple days of the episode dropping.
I’m a software and firmware engineer in one of the most beautiful places in the world (Seattle, WA). I have a plan to write a personal blog which I’ll link here once that is ready.
Thanks for reading and stay safe!