You Can't Match Human Transcription

By Jason Torres Altman

Technology Not Proving It Can Match Human Transcription

Image:  “Antü Plasma Suite” used courtesy of Wkimedia Commons thanks to Fabián Alexis. Source available at: .

Image: “Antü Plasma Suite” used courtesy of Wkimedia Commons thanks to Fabián Alexis. Source available at:

We wanted to share our experiences in trying to use a few methods of technology to provide free and easy transcription, and our general perception of the value of those efforts based on the quality of what we received. We’ll detail three methods below from worst to best and will provide a letter grade for our relative success. However, the headline above is very much true. At this point, we do still feel like the best value is using a transcription service to provide accurate transcription.

3. Using YouTube


The Basics: You’ll need to complete a small amount of digital gymnastics prior to allowing YouTube to do the rest of the work for you. First, you’ll need to convert any audio-only files that you have to an MP3. Then, you’ll need to use a service like to reduce its size. Last, you’ll need to upload the audio to YouTube using as YouTube doesn’t accept audio-only files in its application. Also, in order to transcribe interviews longer than fifteen minutes, you’ll need to increase your video length limit if you haven’t already done so.

Resources Used: The first time you use this method will probably take ten times the amount of time as the next time as you try to perfect all of the steps mentioned above. After that, it probably becomes pretty simple. However, all of the mechanisations do provide more opportunities to run into a hurdle that takes more time. At the same time, YouTube will need up to an hour, or possibly more, to fully transcribe your audio. Finally, you’ll need to remove the timecodes that YouTube automatically places on its transcriptions before copying to a Word file. You will also need to remove all of the “returns” that the transcription automatically places in your text. We used a text editor named Sublime and were able to do this in just a second using multiple cursors, however, this could take a ton of time using Microsoft Word.

Results: The narrative that was the result of the massive amount of steps in preparation noted above was actually pretty disappointing. There were entire sections where the narrative string needed to be re-coded by hand prior to coding, as there were no “anchor” type words for us to code off of. You certainly wouldn’t be able to use this technique if you had not also been present during the interview, allowing you to use your memory to make adjustments. We do think that this technique might have worked better in a one-on-one interview, especially one completed over phone, or in a quiet space with a good voice recorder. The number of voices present in a focus group and the tendency for people to talk over each other probably did not help YouTube’s technology to do its thing.

Overall Grade: D (with future growth potential)

2. Using IBM's Watson


The Basics: You’ll need your computer to be turned on, with Watson working in the background for about the length of your audio recording. You’ll use the upload button on the online menu to connect your audio to the transcriber. If in a public space, you’ll need to mute your machine as the tool will play your audio while it transcribes.

Resources Used: Almost none, as it takes just a few moments to upload the file and to copy and paste the results into a Word document.

Results: We had particularly poor results using this method. We were attempting to transcribe a twenty-minute segment from a focus group and it seemed that either the tool would “time out” or our WiFi would cycle off for a moment and the tool would quit. We are guessing that using this method for shorter segments would be most successful. Accuracy appeared to be better than with using YouTube, but also was worse than with Google Voice.

Overall Grade: C-

1. Using Google Voice


The Basics: You’ll need a quiet space, a pair of headphones, your audio recording, and an open, blank Google Doc. Using Google technology, you will dictate the narrative that you hear as you go along.

Resources Used: Everything is free, assuming that you already own headphones, however it will take approximately 1.25 to 1.5 times the length of your recording to finish your transcription. The Google Document that you are working on ends up being your narrative document.

Results: We had trouble with the Google voice recorder turning off (or perhaps getting bored with us) during our speech. However, we were pleased with the relative accuracy of the effort. Edits still needed to be made to the narrative, but it was close enough to finish coding, and quotes could be constructed using context clues and memory.

Overall grade: C

Bonus Tip

In our efforts to try to utilize YouTube technology without going over the fifteen-minute limit (prior to following the steps to expand that limit), we used iTunes to segment our audio into more bite-sized portions. This might be the tip that we are most excited to pass on from our exploration of these three techniques. We were able to quickly, easily and successfully split our audio into three segments using these instructions.