What began with a simple question from our CIO: “We just want to know how the help desk is doing” quickly grew into a transformative initiative. Originally scoped as a quality assurance solution for IT support, our work evolved into a scalable AI framework that analyzes calls across multiple departments, including behavioral health crisis lines. Within a matter of weeks, we developed a custom pipeline that converts raw audio data into actionable insight, enabling teams to make informed decisions that directly affect patient care.
Our approach demonstrated that when equipped with the right tools and clinical context, artificial intelligence can extract meaning from unstructured conversations and deliver insights that traditional processes cannot. The end result was not just a reporting mechanism, but a shift in how quality is measured and managed at scale.
The Challenge of Scale in Healthcare Call Centers
Healthcare systems operate a wide range of call centers, including help desks, access lines, skilled nursing support, and health plan services. These generate thousands of calls every month, spanning interactions that range from brief password resets to hour-long crisis interventions. Traditionally, quality review has relied on manual auditing, where managers or QA staff review select calls or transcripts. Given the volume, this approach fails to capture patterns across teams or ensure consistency across shifts.
Prior to our intervention, the only available structured feedback came from customer surveys. This left significant gaps. There was no way to surface systemic issues or trends across hundreds of agents operating in a 24/7 environment. We needed a model that could process high volumes of data, identify real patterns in behavior, and deliver department-specific insights without placing additional burden on staff.
Technically, we started with encrypted WAV files and limited metadata. The content of the conversations, which held the true signal, was locked in unstructured audio that nobody had time or tools to review effectively.
Why Standard Sentiment Analysis Was Not Enough
Most sentiment analysis solutions available today rely on lexicon-based models such as VADER, which classify words as positive or negative based on predefined lists. While these models perform well on structured domains like movie reviews or e-commerce feedback, they fall short in healthcare. The conversations we handle are nonlinear, emotionally complex, and often require interpreting tone, cadence, and context.
We initially tested three approaches. Text-only sentiment models using VADER were ineffective, missing most conversational tokens and misclassifying neutral or technical speech. Even after customizing the lexicon to include healthcare terms, it could not identify when a speaker was frustrated versus simply speaking quickly.
We then explored a distilled BERT model trained on Facebook conversations. This seemed more adaptable to informal speech patterns, but it failed to align with human evaluation. Calls labeled as negative often contained no signs of distress or dissatisfaction. The model produced an even distribution of sentiment categories that did not match what our teams were hearing.
GPT-based models, by contrast, consistently understood context and could separate speaker roles automatically. The results aligned closely with internal survey data and clinical reviews. GPT could distinguish a calm but urgent request from an emotionally charged conversation, something previous models failed to capture. In our IT support environment, the model categorized 88 percent of calls as good or neutral, which correlated well with a 98 percent customer satisfaction rating from post-call surveys.
Leveraging Audio Features for Deeper Insight
We expanded the analysis beyond text. Using Librosa, a Python library traditionally used in music analytics, we extracted audio features such as pitch variation, volume changes, and zero crossing rates. These features provided additional signals on tone and engagement. A high zero crossing rate, for example, suggested rapid back-and-forth exchange, which could indicate a productive dialogue or a disagreement. Pitch variation revealed levels of emotional intensity or monotony. None of this could be inferred reliably from transcription alone.
By combining transcript-level analysis with these spectral features, we created a multi-layered sentiment engine that could assess not just what was said, but how it was said. This was especially important for clinical conversations where tone can indicate de-escalation, emotional fatigue, or urgency.
Aligning with Clinical Standards
To ensure credibility with frontline managers and clinicians, we followed a rigorous validation process. For behavioral health, we used an existing 10-question monitoring framework and refined the model’s prompt structure until it produced outputs consistent with how clinicians themselves would score the same calls.
We conducted multiple review cycles. In early iterations, GPT’s scoring was too lenient, particularly in cases involving suicidal ideation. Clinicians flagged these as requiring stricter compliance to protocol. We modified our prompts and re-tested. Over time, we reached agreement between human raters and model evaluations.
In the final validation round, we created two unlabeled groups of clinician names: one flagged as high performers by the AI and one flagged as low performers. Managers correctly identified the groupings without knowing how the list had been generated. Their unanimous agreement with the AI output confirmed the model’s alignment with clinical expectations.
This level of transparency and iterative development stands in contrast to most vendor models, which treat sentiment analysis as a one-size-fits-all exercise. We reviewed one commercial product priced at three hundred thousand dollars per year that offered no customization and no alignment with healthcare use cases. Whether analyzing a retail call or a crisis intervention, their model applied the same generic classification logic. That approach was fundamentally incompatible with our needs.
Measurable Impact Across Teams
The platform has had a tangible operational impact. In IT support, we discovered that calls longer than ten minutes almost always led to lower satisfaction, regardless of resolution quality. This insight led to a new escalation policy: if an issue cannot be resolved within ten minutes, the agent now offers to investigate further and call back. Satisfaction scores improved as a result.
We also observed that calls with a five-minute wait time were rated more negatively, even when resolved quickly. In response, leadership established a goal of answering all calls within one minute.
Analysis revealed that Microsoft-related issues were responsible for the longest and most dissatisfying calls. This enabled the team to reallocate resources and adjust training to address these problems more effectively.
In behavioral health, the insights were even more critical. The model identified that clinicians routinely failed to develop clear success plans with callers, a deficiency that was not being caught during manual review. As a result, management instituted new training focused on success plan development. The model also proved capable of evaluating whether protocols were properly followed in suicidal ideation cases, offering a level of nuance that lexicon-based systems cannot match.
Infrastructure and Workflow
To support this solution, we developed a robust architecture in Azure. Encrypted audio files are uploaded to a secure data lake, where a random sample of four hundred calls per month is selected for analysis. This provides statistical confidence while keeping compute costs manageable.
Calls are chunked if longer than five minutes, transcribed using Whisper, and processed for audio features using Librosa. The transcribed text and audio metrics are then sent to GPT with department-specific prompts. Each team: IT, behavioral health, or member services receives a tailored analysis aligned to their operational goals.
We encountered technical challenges with encoding and storage formats, but once resolved, the workflow became stable and efficient. We are now in the process of fully automating the monthly analysis cycle.
Conclusion
We began with a narrow inquiry about help desk quality and ended up building a comprehensive, AI-powered evaluation platform that spans multiple departments. This initiative has allowed us to extract actionable insights from unstructured conversations, validate those findings with clinical experts, and deploy policy changes that improve both operational efficiency and patient care.
By investing in custom architecture and iterative validation, we avoided the limitations of commercial black-box systems. The result is a flexible and scalable solution that meets the complex needs of healthcare.
Sometimes the most valuable innovations emerge not from ambitious planning, but from taking a single question seriously and being willing to follow it wherever it leads.
Brian Jacobson
Brian Jacobson is an industry leader in analytics, data science, AI, and healthcare.