catatp.fm Unofficial Accidental Tech Podcast transcripts (generated by computer, so expect errors).

About

Hey there! This is a side project I did for fun. Do not expect 100% accuracy from any of the unofficial transcripts, as they are entirely computer generated.

Background

A while ago I was listening to the Accidental Tech Podcast and the hosts were trying to recall something that had happened on a previous episode. I thought wouldn’t it be nice to have a set of transcripts for all the episodes. Surely there were some reasonably easy to use open source audio transcription engines I could use? Surely this idea wouldn’t take more than a weekend or two!

While this was not the only side project I worked on between then and now, needless to say it took a bit of time to evaluate various solutions and figure out how to implement it. This was a nice but unexpectedly large in scope side project to work on in-between the rest of life. And it is now March 2021, and it is finally launched.

A major hat tip to David Smith, who did something similiar a while ago. I leveraged some of the overall UI theme he pioneered for this sort of a project, and then added a bunch more features and functionality to it.

Note that this project is completely unaffiliated with the hosts of the Accidental Tech Podcast outside of seeking their permission to release it.

The Current Approach

Update: in mid-2026 with Claude's assistance I made significant updates to this project. All sections below this section describe earlier revisions and are kept around for posterity. (I wrote up the earlier 2023 switch to Whisper on my blog.)

While my approach with Flashlight and then Whisper were state of the art when I originally implemented (and my sliding window approach with Flashlight actually meant I beat many commercial options…), things have progressed since 2023 when I had last modified the site. Recently I updated the existing pipeline, replacing the single ASR approach with an ensemble of multiple recognition engines that effectively check each other’s work:

  • NVIDIA Parakeet-TDT 0.6b v2 and WhisperX (using Whisper large-v3) each transcribe every episode independently. Both are (as has been the case with this project) run on a local NVIDIA GPU. Both models are solid performers, but definitely still not perfect. Since the two likely had different training sets and were generated by different companies (NVIDIA and OpenAI), in my testing they have tended to make different mistakes. Wherever the two disagree on a word / spoken segment, I hand the dispute to a referee.
  • The referee is another local large language model, Google’s Gemma 4 31B, running on my own GPU via Ollama. For each disagreement it picks the reading that makes the most sense in context.
  • An LLM judging the text alone would effectively be performing intelligent guessing, so I give it additional evidence to work with: each candidate word is force-aligned against the actual audio with wav2vec2 and scored on how well it matches what was really said, and additionally a small audio-input model (Gemma 4 E4B) actually listens to the disputed clip and reports back what it heard. When both engines are clearly wrong, the judge is even allowed to utilize a proposed correction of its own if the wav2vec2 data backs the proposed correction.
  • While I thought up this approach independently, upon researching it looks like it isn’t unique, but I might be the first to leverage alignment ranking.
  • Speaker identification (“who is talking”) now uses pyannote’s community-1 model, matching voices against saved samples of Casey, Marco and John as before, just now against a slightly updated model.

On the website side, search is now Pagefind, which builds a static index that runs entirely in your browser. There is no search server to keep alive or to forget to load the latest episodes results into. The whole site is now just static files hosted on Cloudflare. The word-by-word highlight-as-it-plays feature has also been retained, though I did have to fix some AI generated bugs (flashbacks to my day job, as even frontier models often struggle with embedded firmware code…).

Each transcript page notes exactly which engines and which judge produced it. It is very possible I might experiment with different models / techniques over time.

If you’re curious about the details of ASR engine transcription disagreements, we also have a View ASR model disagreements toggle in the sidebar. Flip it on and on any episode transcribed with the new model architecture, every word the two engines disagreed about is highlighted. Click or hover over one and you’ll see what each engine heard (along with how well the audio actually backed up each guess), what the third audio-input model heard when it listened to that clip, and which version the judge ultimately picked, or very occasionally, the replacement it wrote itself. It is interesting to see how the different models interpret certain phrases / words.

I also had Claude create a dark / light mode. I also added a “system” mode which follows whatever preference your browser reports, which the page defaults to.

In the more “fun” category, I added a graph in the statistics section showing where in each episode (using % of elapsed time) the episode title was spoken. Note there are a fair number of missing data points where the episode title wasn’t correctly identified. It is possible as I re-generate older episodes using the new ensemble method it might be able to pick up additional episode titles and fill in the graph further.

Project Fun Facts (OUTDATED)

The facts below describe earlier revisions of the project (the Flashlight and Whisper eras) and are kept for posterity — see The Current Approach above for how things work today. Notably, the per-line highlighting trick mentioned below was later reverted to one span per word, and the Tantivy search has been replaced with Pagefind.

Some fun facts about the setup of this project include:

  • Uses Facebook Research’s Flashlight speech recognition engine, combined with a bunch of Python scripts to intelligently split up files into overlapping maximum 15 second chunks, transcribe them, generate alignment information for them and finally merge them back together into a (hopefully) consistent transcript.
  • The current revision of this required splitting the ATP episodes into > 500,000 files across all 400+ episodes.
  • Uses the Mutagen module for Python to extract metadata from all the MP3s.
  • Uses the Lanyon theme for Jekyll document hosting (this website). I am not a web dev (and have no interest in being one professionally), so apologies about the number of things I’m likely doing horribly wrong. I also leveraged some of the copy writing the example already had in place. Note that this gave me an appreciation for the power of Markdown, as all the transcripts (and in fact all pages on this site) are written in Markdown and then converted to HTML / etc by Jekyll. Very cool.
  • Uses Tantivy as the local search engine for this site. It doesn’t use much RAM, returns results very quickly, and supports multi-word search unlike the first engine I tested out, Sonic (which admittedly targets even smaller installations than this one). Note that I tried out Google’s custom search but it was very slow to go through ATP’s enormous archive, and wasn’t processing results across lines split up by the play and pause buttons, so I decided to partially roll my own.
  • Performs asynchronous loading of the time alignment data after you click play. That way if you’re just examining the transcript you don’t need to load and/or process the massive dictionary used for fun visualization of what word is being spoken at a given moment.
  • Instead of having a <span> for each word in the transcript (which I did in early revisions of the site), I now have one <span> per transcript line and use JavaScript to then select a specific word in the line to highlight/unhighlight. This helped reduce the size of the complete site (along with disabling Lanyon’s automatic pagination) in compressed 7z format from 214MB to 141MB. It also helps noticably improve initial load speed of transcripts, particularly on mobile networks. Note that I am not a web dev, and there are probably more ways to make it more even efficient (such as splitting the transcripts into one JS file/dictionary per chapter), but I’m happy with how it’s performing now.
  • Uses pyannote for speaker diarization (identification). I had speaker identification as a stretch goal within the original design doc for this site, but it was quickly dropped due to not finding an option with sufficient accuracy and time constraints. In advance of ATP’s 500th episode I did some more research, and discovered pyannote along with a couple other options. I was impressed with pyannote’s accuracy and ability to perform speaker matching to existing samples (preventing me from having to manually label 500 episodes worth of speakers, something I don’t have nearly enough time to do). A weekend of processing old episodes using two nVidia GPUs later and a bunch of updates to my processing scripts later and I had labeled transcriptions. Note that just like the speech recognition component accuracy is not nearly 100%, but it is still surprisingly good. I plan on writing up my experiences with speaker diarization on my blog when I get a chance.

Speech Recognition Comparison (2021)

This comparison is from the original (2021) version of the project and reflects what was available then. It is not how transcripts are generated today (see The Current Approach). Still a fun historical snapshot.

Early on in this project I chose an initial short segment of audio to use as a ‘reference’ to determine how different speech recognition systems could handle the tech-heavy speech of ATP. The first segment I selected was John’s soliloquy about the Apple Human Interface Guidelines and UI Design from Accidental Tech Podcast #392. It has a duration of 200 seconds. Using the 200 second segment I then generated a ‘reference’ transcript and ran it through a variety of speech recognition systems. This included both open source options and the built in offline transcription found in iOS 14 and MacOS Big Sur (which interestingly consistently provided slightly different results…). Note that for Flashlight it also involved some custom algorithms / techniques to merge overlapping transcripts together, as Flashlight’s accuracy with the model I was using dropped significantly when samples were longer than ~15 seconds.

For this project I did not examine cloud hosted speech recognition systems, as due to the 400+ often 3+ hour episodes of ATP transcribing it all would have been cost prohibitive for a side project. Note that they likely exceed the performance of even the best performing open source project, but were out of scope for this project.

For this project I wanted to maximize the number of Hits (number of words correctly found) while minimizing the number of Insertions (number of words inserted that did not exist in original audio) and Subsitutions (where a word was substituted for a different one). I used the Python module jiwer to help calculate all these values. I also removed all punctuation ('/"/./!/ etc) before running them through the comparison, to ensure I didn’t favor those that used punctuation over those that didn’t. Finally do not take this performance comparison as gospel, it is a short snippet of a very tech oriented show hosted by three white dudes from the United States, and performance might be dramatically different for your particular voice recognition use case.

Speech Recognition System Hits Insertions Substitutions WER (Word Error Rate)
Reference Transcript 800 0 0 0.0%
Flashlight (15 second chunking) 740 8 39 8.5%
iOS 14 692 9 87 14.625%
macOS Big Sur 666 14 107 18.5%
macOS Catalina 662 12 111 18.75%
Wav2Letter (Flashlight predecessor) 636 19 126 22.875%
Seq2Seq Jasper 597 14 161 27.125%
Mozilla DeepSpeech 9 541 8 101 33.375%
PocketSphinx 408 14 266 50.75%

While I didn’t document it here, there were massive deltas in the speed of the various options. Many of the poor performers were also extremely slow, while Flashlight was many times faster than real time, which is very convenient when you need to transcribe 400+ episodes. It also supported CUDA acceleration, which allowed me to use my nVidia gaming GPU to transcribe the many, many hours of Casey, Marco and John discussing technology.

Feedback

Have questions or suggestions about catatp? Feel free to send me an e-mail. Do not expect an immediate response, as I do have a day job and family that keeps me very busy.

Mini Q&A

Why isn’t the section I want to read accurate?

This site uses a research level voice recognition engine, which while impressive compared to historical standards is still not close to a professional transcriber’s accuracy. But for transcribing hundreds of multi-hour podcast episodes as a side project, it is far more cost effective.

Are you affiliated with ATP at all?

Nope. Just someone who enjoys listening to the show and was surprised no one had tried to automatically generate transcriptions for their show recently. After working on this project though, I think I know why. I did receive permission from the hosts before releasing this project however.

Episode XXX dropped, and I don’t see a transcript yet?

Processing a new episode is now a single command (download, transcribe with the ensemble, identify speakers, build the site, and deploy), but it still needs my GPU free to run, so please be patient. I’ll try to get to it within a couple days of the episode dropping.

What does a cat have to do with a transcript?

🐈

Who created this?

I’m a software and firmware engineer in one of the most beautiful places in the world (Seattle, WA). I have a plan to write a personal blog which I’ll link here once that is ready.

Thanks for reading and stay safe!