About

Hey there! This is a side project I did for fun. Do not expect 100% accuracy from any of the unofficial transcripts, as they are entirely computer generated.

Background

A while ago I was listening to the Accidental Tech Podcast and the hosts were trying to recall something that had happened on a previous episode. I thought wouldn’t it be nice to have a set of transcripts for all the episodes. Surely there were some reasonably easy to use open source audio transcription engines I could use? Surely this idea wouldn’t take more than a weekend or two!

While this was not the only side project I worked on between then and now, needless to say it took a bit of time to evaluate various solutions and figure out how to implement it. This was a nice but unexpectedly large in scope side project to work on in-between the rest of life. And it is now March 2021, and it is finally launched.

A major hat tip to David Smith, who did something similiar a while ago (but which has not been updated in years). I leveraged some of the overall UI theme he pioneered for this sort of a project.

Note that this project is completely unaffiliated with the hosts of ATP outside of seeking their permission to release it.

I recently transitioned this project over to Whisper as the speech recognition engine. A write up of that process and some learnings is available on my blog. I have not had a chance to update the below text, so consider the blog post superseding all the text below until I get a chance to update this page.

Project Fun Facts

Some fun facts about the setup of this project include:

Uses Facebook Research’s Flashlight speech recognition engine, combined with a bunch of Python scripts to intelligently split up files into overlapping maximum 15 second chunks, transcribe them, generate alignment information for them and finally merge them back together into a (hopefully) consistent transcript.
The current revision of this required splitting the ATP episodes into > 500,000 files across all 400+ episodes.
Uses the Mutagen module for Python to extract metadata from all the MP3s.
Uses the Lanyon theme for Jekyll document hosting (this website). I am not a web dev (and have no interest in being one professionally), so apologies about the number of things I’m likely doing horribly wrong. I also leveraged some of the copy writing the example already had in place. Note that this gave me an appreciation for the power of Markdown, as all the transcripts (and in fact all pages on this site) are written in Markdown and then converted to HTML / etc by Jekyll. Very cool.
Uses Tantivy as the local search engine for this site. It doesn’t use much RAM, returns results very quickly, and supports multi-word search unlike the first engine I tested out, Sonic (which admittedly targets even smaller installations than this one). Note that I tried out Google’s custom search but it was very slow to go through ATP’s enormous archive, and wasn’t processing results across lines split up by the play and pause buttons, so I decided to partially roll my own.
Performs asynchronous loading of the time alignment data after you click play. That way if you’re just examining the transcript you don’t need to load and/or process the massive dictionary used for fun visualization of what word is being spoken at a given moment.
Instead of having a <span> for each word in the transcript (which I did in early revisions of the site), I now have one <span> per transcript line and use JavaScript to then select a specific word in the line to highlight/unhighlight. This helped reduce the size of the complete site (along with disabling Lanyon’s automatic pagination) in compressed 7z format from 214MB to 141MB. It also helps noticably improve initial load speed of transcripts, particularly on mobile networks. Note that I am not a web dev, and there are probably more ways to make it more even efficient (such as splitting the transcripts into one JS file/dictionary per chapter), but I’m happy with how it’s performing now.
Uses pyannote for speaker diarization (identification). I had speaker identification as a stretch goal within the original design doc for this site, but it was quickly dropped due to not finding an option with sufficient accuracy and time constraints. In advance of ATP’s 500th episode I did some more research, and discovered pyannote along with a couple other options. I was impressed with pyannote’s accuracy and ability to perform speaker matching to existing samples (preventing me from having to manually label 500 episodes worth of speakers, something I don’t have nearly enough time to do). A weekend of processing old episodes using two nVidia GPUs later and a bunch of updates to my processing scripts later and I had labeled transcriptions. Note that just like the speech recognition component accuracy is not nearly 100%, but it is still surprisingly good. I plan on writing up my experiences with speaker diarization on my blog when I get a chance.

Speech Recognition Comparison

Early on in this project I chose an initial short segment of audio to use as a ‘reference’ to determine how different speech recognition systems could handle the tech-heavy speech of ATP. The first segment I selected was John’s soliloquy about the Apple Human Interface Guidelines and UI Design from Accidental Tech Podcast #392. It has a duration of 200 seconds. Using the 200 second segment I then generated a ‘reference’ transcript and ran it through a variety of speech recognition systems. This included both open source options and the built in offline transcription found in iOS 14 and MacOS Big Sur (which interestingly consistently provided slightly different results…). Note that for Flashlight it also involved some custom algorithms / techniques to merge overlapping transcripts together, as Flashlight’s accuracy with the model I was using dropped significantly when samples were longer than ~15 seconds.

For this project I did not examine cloud hosted speech recognition systems, as due to the 400+ often 3+ hour episodes of ATP transcribing it all would have been cost prohibitive for a side project. Note that they likely exceed the performance of even the best performing open source project, but were out of scope for this project.

For this project I wanted to maximize the number of Hits (number of words correctly found) while minimizing the number of Insertions (number of words inserted that did not exist in original audio) and Subsitutions (where a word was substituted for a different one). I used the Python module jiwer to help calculate all these values. I also removed all punctuation ('/"/./!/ etc) before running them through the comparison, to ensure I didn’t favor those that used punctuation over those that didn’t. Finally do not take this performance comparison as gospel, it is a short snippet of a very tech oriented show hosted by three white dudes from the United States, and performance might be dramatically different for your particular voice recognition use case.

Speech Recognition System	Hits	Insertions	Substitutions	WER (Word Error Rate)
Reference Transcript	800	0	0	0.0%
Flashlight (15 second chunking)	740	8	39	8.5%
iOS 14	692	9	87	14.625%
macOS Big Sur	666	14	107	18.5%
macOS Catalina	662	12	111	18.75%
Wav2Letter (Flashlight predecessor)	636	19	126	22.875%
Seq2Seq Jasper	597	14	161	27.125%
Mozilla DeepSpeech 9	541	8	101	33.375%
PocketSphinx	408	14	266	50.75%

While I didn’t document it here, there were massive deltas in the speed of the various options. Many of the poor performers were also extremely slow, while Flashlight was many times faster than real time, which is very convenient when you need to transcribe 400+ episodes. It also supported CUDA acceleration, which allowed me to use my nVidia gaming GPU to transcribe the many, many hours of Casey, Marco and John discussing technology.

TODO

While I added a beta level search, I should probably rewrite some of the web interface components to be faster (hmm, Rust maybe?). Similarly the local search implementation I’m using doesn’t have support for offsetting search results, so I plan on adding that feature in the future as well.

I also should add links from the stats page to the relevant episodes.

If I wanted to there are some medium hanging fruit for optimizing transcript highlight playback, such as splitting the dictionary containing the transcripts into one file per chapter, but I’m happy with how it’s doing for now.

Finally there is a bug where sometimes played audio and highlighted words are out of sync (primarily on Chrome browsers from my brief testing). Not sure if there is a workaround for it.

Feedback

Have questions or suggestions about catatp? Feel free to send me an e-mail. Do not expect an immediate response, as I do have a day job and family that keeps me very busy.

Mini Q&A

Why isn’t the section I want to read accurate?

This site uses a research level voice recognition engine, which while impressive compared to historical standards is still not close to a professional transcriber’s accuracy. But for transcribing hundreds of multi-hour podcast episodes as a side project, it is far more cost effective.

Are you affiliated with ATP at all?

Nope. Just someone who enjoys listening to the show and was surprised no one had tried to automatically generate transcriptions for their show recently. After working on this project though, I think I know why. I did receive permission from the hosts before releasing this project however.

Yeah, I think this is due to bugs in some browser’s (cough Chrome cough) audio playback implementations. I’ve seen it primarily in Chrome, and when I switch over to Safari it appears to be in sync even for the same episode/chapter. I find this odd, as I much prefer the speed handling engine for audio in Youtube in Chrome compared to Safari. Even within Chrome it does seem to be worse in some episodes than others. Honestly fixing this is not a high priority, the feature was a ‘fun’ one not core to the objective of the website. If this is bugging you try a different browser, and hopefully the issue will improve.

Episode XXX dropped, and I don’t see a transcript yet?

Currently this involves a number of manually executed commands, plus it needs my GPU free to run the transcription. So patience. I’ll try to get to it within a couple days of the episode dropping.

What does a cat have to do with a transcript?

🐈

Who created this?

I’m a software and firmware engineer in one of the most beautiful places in the world (Seattle, WA). I have a plan to write a personal blog which I’ll link here once that is ready.

Thanks for reading and stay safe!

catatp.fm Unofficial Accidental Tech Podcast transcripts (generated by computer, so expect errors).