At UpdateAI our mission is to help SaaS organizations create a shared understanding of their customers’ needs through existing customer conversations. At our inception we’re doing this by capturing important action items from our users’ meetings, so that no customer need/request slips through the cracks. The model that powers this core use case is our proprietary Action Item Detection System. In order to ensure that this system is providing the highest quality action item detection possible, it’s imperative that our audio transcription is as accurate as possible. The more errors present in a transcript, the less accurate our model will be at detecting action items. (Example: It is clearly easier to detect that “Please send me the contract” is an action item than something that is faultily transcribed such as “Peas send meme contact.”) This is why our decision to “build vs. buy” our speech-to-text (transcription) technology was a really important one.
We briefly explored the idea of building our own ASR engine (i.e. Automated Speech Recognition), but it quickly became apparent that despite the cost benefits of using and then spicing-up an open source model (such as from Google or AWS), in the near term it would not result in the highest quality experience to our users – and that’s what we’ve committed to. Instead, we sifted through the growing landscape of speech recognition providers to find the one that would give us the best over results specific to our mission and initial use case. Our company makes a bold claim to have developed the most accurate technology of action item detection, which is why we’ve decided to share publicly our approach to sourcing the highest quality transcription to help get us there.
Every transcript provider will have some amount of errors, but it was important for us to find the provider that had the least amount of errors. We compared 7 of the leading transcription providers to see which ones proved to be the best for our mission. In our comparison we evaluated 4 categories:
For our test, we used human-transcribed transcripts of 15 customer-facing calls to generate our “source of truth.” We ran those transcripts through our model and detected 22 action items, which were then confirmed by a human. We found that DeepGram had the lowest WER, followed by Otter.ai, Gong.io, Supernormal, Zoom, Assembly.ai, then Fireflies.ai.
This lower WER enabled our model to detect 17 of the 22 action items.
DeepGram out-performed the 6 other providers in our testing, so we’ve chosen to use them as our provider so that we are able to deliver the most accurate action items to our users.