AI-powered dictation, transcription generation and workflow automation tools have existed for decades. They play essential roles in education, healthcare, finance, and customer service domains - not to mention journalism. They are also important for training new multi-modal Large Language Models (LLMs) to support various domains and use cases.
Yet, various aspects of these transcription workflows are locked up inside vendor tools. Each brings tradeoffs in accuracy, simplicity, user experience, and integration. They also present various tradeoffs in connecting workflows across disparate tools, better suited for multiple aspects of the process. Good luck trying to troubleshoot a questionable phrase in Word.
Some are better at automatically recording meetings across services, cleaning the audio, diarization (matching speakers), accurately transcribing or at least correcting esoteric words, providing a good UX, or summarization. Many newer variants add workflows for specific industries like legal, UX research, proofreading, or curating trustworthy AI training datasets.
I started wondering more deeply about gaps in modern transcription workflows when my doctor complained about the extra work he had to do thanks to his fancy new transcription workflow. He said things seemed easier when he could send his audio and notes off to an admin assistant who made it all just work. Weren't these new tools supposed to save him time for more patient visits?
A few weeks later, I found myself troubleshooting a misquote in an article across three transcription tools. It took careful listening and context to discern "totally rational" from "totally irrational," which means the opposite. That and most of these latest transcription tools seem to consistently mis-transcribe the word LLMs despite its wide use for their new gen AI summaries.
I once interviewed a very knowledgeable source with the habit of saying "no" as a filler word, where others might say "ah," "um," or "you know." I spent quite a bit of time cleaning that one up, because there were a few times he meant it.
Things can get more complicated as these transcripts are used for new gen AI-powered workflows that can amplify inaccuracies. Even when these tools actually get a quality transcript, they can sometimes make up facts in their summaries not grounded in the actual text, as Chris Middleton reported a few months ago.
I first started investigating transcription and dictation tools over a decade ago. At the time, Nuance's Dragon Dictate was starting to get accurate enough that it could save time, particularly after investing in the right headset. It also worked as a plugin directly within Microsoft Word, which simplified the workflow compared to other tools. A particularly handy feature is that it allowed you to add new words, such as industry acronyms, and would subsequently get these right. It was also a large program, slowed my computer's performance, and seemed to get increasingly buggy over time.
I moved to Express Scribe, which handily let you control audio playback with a foot panel. I could connect to Dragon or Microsoft's speech recognition engine on the back end to jumpstart the process. Dragon was always better.
A few years ago, Microsoft finally added native dictation to Word, which at least worked in the app. It was not quite as accurate as Dragon, but it also did not slow my whole computer down. It also still does not let you correct words. I hoped they might change this once Microsoft bought Nuance, but they still have not, and for that matter, neither do any of the other tools.
More recently, I find myself jumping between some of the newer transcription services for various reasons. Zoom's version is nice because it is built into the conferencing service, has a decent UI, and matches speakers using separate audio channels. The others sometimes get confused regarding who is speaking. But I don't keep those conversations on the server for long because I always seem to be hitting my 5 GB limit.
Otter does a great job of automatically showing up for meetings on the calendar across various services and keeping files around for later research. Speakers from previous meetings are also automatically matched based on the sounds of their voice. It also lets you upload files. However, it struggles with new words. It also sometimes inexplicably drops important audio segments, necessitating a trip back to Zoom for the original audio.
Meanwhile, I find myself turning to Speechmatics when I am trying to transcribe a conversation with a lot of technical jargon, since it does the best job at this. It just can't attend meetings for you, since their priority seems to be selling API access for enterprises.
But then, none of these tools can help much when you start with poor audio, like something you might record in a reverberating conference hall or with lots of background noise. This is where Descript can clean up the audio file so you can drill in on important nuances.
None of these tools can plug directly into Microsoft Office, where I would prefer to do my writing. They also require a lot of clicks and configuration changes to correct a mistake. For questionable phrases, I often find myself editing the error in Word, which is quicker, rather than giving them the benefit of my feedback.
It would be great to record separate audio files in Zoom, clean up the audio with Descript, run the transcription through Speechmatics, and label the speakers with Otter. But then, how do you create a workflow across these kinds of tools?
They all support various export options. SRT and VTT subtitle formats are good for surfacing time code information and speaker names, which is helpful for video and audio editing tools. However, none of the transcription tools make importing the subtitle formats aligned with the audio easy. And the subtitle formats don't look pretty in Word.
Some also output JSON formats that add metadata to each word. This is handy when using sentiment analysis, coding sections for UI research, or adding tags for AI training. However, JSON formats differ across vendors and require integration experience to support industry-specific workflows, proofing, or AI data labelling.
On the one hand, I recognize there is not a big market for journalist-specific tools and associated workflows. That and a good transcription workflow are just the starting point for other aspects of organizing notes, tracking provenance, distilling insights, and writing a helpful story.
However, I cannot stop thinking about the doctor and why he found his new transcription workflow so frustrating. How might he reimagine something that actually saves time and that he can trust? I am sure he will not be doing any transcription workflow API integration across tools anytime soon.
Vendors have also pursued different approaches with tradeoffs, prioritizing their unique strengths, user experience design opinions, and monetization strategies. They all have a financial interest in keeping customers on their platforms as much as possible, even when the workflow or user experience suffers a bit here or there.
But then maybe there are some bigger long-term opportunities. Good transcription is increasingly becoming a commodity. It is also an essential ingredient in a potentially much larger market that uses best-of-breed AI components to re-imagine the future of work.