how to transcribe an interview
Total time: about 5 minutes of your time, plus processing. You upload the file, SHRP does the rest.
// step 1: prepare your file
~30 secondsFind your interview recording on your device. SHRP accepts MP3, WAV, M4A, MP4, MOV, OGG, FLAC, and WebM files. If you recorded on your phone, it is probably M4A (iPhone) or MP3/OGG (Android). If you used a dedicated recorder like a Zoom H6 or Tascam, you likely have WAV files.
Check the file size: Starter plan handles up to 100MB, Pro up to 500MB. A one-hour interview as MP3 at 128kbps runs about 57MB. If your file is too large, convert it from WAV to FLAC or MP3 first using any free audio converter. You will not lose meaningful accuracy.
If you have not recorded yet: use an external microphone, sit in a quiet room, and keep the mic roughly 6-12 inches from the speaker. These three things matter more than any software setting.
// step 2: upload to shrp
~1 minuteOpen shrp.app/speech-to-text in your browser. You will see two modes: microphone (for live dictation) and file upload (for pre-recorded audio). Click the file upload tab.
Either drag your file onto the upload area or click to browse your filesystem. Upload speed depends on your connection and file size. A 57MB MP3 on a typical home connection takes 15-30 seconds. Once uploaded, SHRP automatically sends the file to AssemblyAI for processing. You do not need to click anything else.
You need a paid plan (starting at $7/month) for file upload transcription. Live microphone mode is free.
// step 3: wait for processing
~varies — roughly 1 minute per 10 minutes of audioA 15-minute interview processes in about 90 seconds. A 60-minute interview takes around 4-6 minutes. A two-hour deep dive might take 12-15 minutes. You will see a progress indicator while the transcript is being generated.
During processing, AssemblyAI runs automatic speech recognition, applies speaker diarization to separate your voices, and calculates word-level confidence scores. Everything happens server-side. You can close the tab and come back later; your transcript will be waiting in your dashboard.
Accuracy on clean audio is typically around 98%. Background noise, heavy accents, or overlapping speakers will reduce this, but the confidence heatmap will show you exactly where to look.
// step 4: review the transcript
~2-5 minutesWhen the transcript loads, you will see the full text with speaker labels ("Speaker A," "Speaker B") and a confidence heatmap. Words are colour-coded: white means the AI is confident, yellow means it is less sure, and red means you should verify the word manually.
Start by scanning just the yellow and red words. These are usually proper nouns, technical terms, or moments where speakers talked over each other. Fix those first. Then do a quick read-through of the full transcript. For a 30-minute interview, this review step takes about 2-3 minutes. For longer interviews, budget 5 minutes.
You can also run AI extraction to pull out a summary, key quotes, action items, or structured notes. These tools work on the full transcript in one click.
// step 5: export or use
~30 secondsDownload your transcript in whichever format you need. TXT and DOCX for documents and articles. SRT and VTT for video subtitles (both include timestamps). JSON for developers who want word-level data and confidence scores. All formats include speaker labels.
You can also copy the transcript to your clipboard, save it to My Projects for later, or use the AI tools to reformat it for a specific purpose. Your transcript stays in your SHRP dashboard as long as your account is active.
// tips for better results
use an external mic
Built-in laptop mics pick up fan noise and typing. Even a $20 USB mic is a major upgrade.
quiet room
Close the window, turn off the AC. Background noise is the number one cause of transcription errors.
one speaker at a time
When people talk over each other, no transcription engine can separate the words reliably.
record at 44.1kHz or higher
Higher sample rates give the AI more data to work with. Most recorders default to this.
keep mic distance consistent
6-12 inches from the speaker. Too close causes plosives, too far loses clarity.
state names clearly
At the start of the interview, have each person say their name. This helps you match speaker labels later.
// what about real-time transcription?
If you are conducting the interview live (in person or on a call) and want a transcript as you go, use SHRP's free voice typing tool instead of file upload. Open shrp.app/speech-to-text, allow microphone access, and start recording. Words appear in real-time as each person speaks.
Live microphone mode is completely free, runs in your browser, and your audio never leaves your device. It does not have speaker diarization (all speech appears as one stream), but you can use AI Enhance afterwards to clean up formatting and grammar.
For the best of both worlds, record the interview with a dedicated app and run live voice typing simultaneously. You get a rough draft in real-time and a polished, speaker-labeled transcript after uploading the recording.
ready to transcribe your interview?