The job in question

We have a conversation in a Norwegian recording to confirm details about a specific question: “Did person A ask person B if he wanted coffee?” Sounds straightforward, right?

The plan was to

  • use pyannote-audio to capture timestamps whenever the speaker switched between A and B,
  • then use nb-whisper-large to transcribe those segments in Norwegian,
  • and finally use an LLM with Ollama to answer the question: yes or no.

Let’s get into it

After multiple tests, I found three main issues with this approach:

  1. NB-Whisper sometimes skips entire sentences.
  2. The hit rate on analysis can be unreliable; it occasionally answers “yes” incorrectly.
  3. LLMs often struggle with Norwegian grammar and sometimes mix in Swedish words.

The tests I did involved a set of audio files and an “authority” file with correct answers.

I ran multiple tests on several models, eventually getting a match rate of 7, which surpassed other models.

Model: gemma3:27b
Total evaluations: 11
Correct matches: 7 (63.6%)
Partial matches: 4 (36.4%)
Wrong matches: 0 (0.0%)
Cannot determine: 0 (0.0%)

I also tried translating the text into English with “deepseek-r1:14b”. When comparing English vs. Norwegian results, the consensus for

  • Norwegian had “Correct matches: 1 (9.1%)”
  • English had “Correct matches: 8 (72.7%)”
=== Model Performance Statistics (EN) ===

Model: deepseek-r1:14b
Total evaluations: 11
Correct matches: 8 (72.7%)
Partial matches: 2 (18.2%)
Wrong matches: 0 (0.0%)
Cannot determine: 1 (9.1%)

Model: gemma3:27b
Total evaluations: 11
Correct matches: 9 (81.8%)
Partial matches: 2 (18.2%)
Wrong matches: 0 (0.0%)
Cannot determine: 0 (0.0%)

Model: gemma3:12b
Total evaluations: 11
Correct matches: 4 (36.4%)
Partial matches: 5 (45.5%)
Wrong matches: 0 (0.0%)
Cannot determine: 2 (18.2%)

Model: phi4:14b
Total evaluations: 11
Correct matches: 3 (27.3%)
Partial matches: 5 (45.5%)
Wrong matches: 1 (9.1%)
Cannot determine: 2 (18.2%)

=== Consensus Performance (EN) ===
Total evaluations: 11
Correct matches: 8 (72.7%)
Partial matches: 2 (18.2%)
Wrong matches: 0 (0.0%)
Cannot determine: 1 (9.1%)
High agreement cases: 4 (36.4%)

=== Model Performance Statistics (NO) ===

Model: deepseek-r1:14b
Total evaluations: 11
Correct matches: 0 (0.0%)
Partial matches: 6 (54.5%)
Wrong matches: 1 (9.1%)
Cannot determine: 4 (36.4%)

Model: gemma3:27b
Total evaluations: 11
Correct matches: 4 (36.4%)
Partial matches: 4 (36.4%)
Wrong matches: 0 (0.0%)
Cannot determine: 3 (27.3%)

Model: gemma3:12b
Total evaluations: 11
Correct matches: 2 (18.2%)
Partial matches: 2 (18.2%)
Wrong matches: 0 (0.0%)
Cannot determine: 7 (63.6%)

Model: phi4:14b
Total evaluations: 11
Correct matches: 2 (18.2%)
Partial matches: 4 (36.4%)
Wrong matches: 0 (0.0%)
Cannot determine: 5 (45.5%)

=== Consensus Performance (NO) ===
Total evaluations: 11
Correct matches: 1 (9.1%)
Partial matches: 4 (36.4%)
Wrong matches: 0 (0.0%)
Cannot determine: 6 (54.5%)
High agreement cases: 6 (54.5%)

pyannote-audio

Pyannote works great; it outputs as advertised:

start=0.2s stop=1.5s speaker_0
start=1.8s stop=3.9s speaker_1
start=4.2s stop=5.7s speaker_0

We then load the whisper.

nb-whisper-large

nb-whisper-large is based on OpenAI’s Whisper and is a Norwegian model for transcribing speech to text.

Using something similar to the example:

from transformers import pipeline

# Load the model
asr = pipeline("automatic-speech-recognition", "NbAiLabBeta/nb-whisper-large")

# Transcribe
asr("king.mp3", generate_kwargs={'task': 'transcribe', 'language': 'no'})

It outputs the text in one large string, but if we add “return_timestamps=True” then we can map that together with the pyannote-audio’s data. And we get: A: Hallo har du en bra dag ? B: ja jeg har en veldig bra dag.

LLM’s

The initial testing was done by creating a new model called “judge_llama3.1:8b”. You can do that by calling the Ollama API /api/create:

curl http://localhost:11434/api/create -d '{
  "model": "judge_llama3.1:8b",
  "from": "llama3.1:8b",
  "system": "Du er en svært dyktig ekspert som spesialiserer deg på å vurdere samtaler. Gjennom over to tiår med erfaring er du i stand til å oppsummere, evaluere og tolke slike samtaler grundig. Målet ditt er å returnere et rent JSON-svar på tydelig, konsis og lett forståelig norsk, der du spesielt sjekker om setningen Har du lyst på kaffe? har blitt uttalt under samtalen. Din analyse skal være grundig."
}'

But the tests did not conclude that it was better to use the judges, where sometimes it performed worse.

The prompt for the model must be simple and to the point, specifying exactly how to respond. For example:

prompt_template = """You are a JSON generator. Your task is to evaluate if person A asked person B about coffee.

Question: Did person A ask person B if they wanted coffee?

Categories and scoring rules:
1 (90-100): Clear coffee question if:
   - Person A directly asks B about wanting coffee
   - The question is explicit and clear

2 (60-89): Implied coffee question if:
   - Person A indirectly mentions coffee to B
   - The offer is made but not as a direct question
   - Coffee is discussed but the question is ambiguous

3 (30-59): No coffee question if:
   - Coffee is mentioned but not as an offer
   - No one asks about coffee
   - Wrong person asks the question

4 (0-29): Cannot determine if:
   - Conversation is incomplete or unclear
   - There is no mention of coffee
   - Cannot identify speakers clearly

Conversation:
{text}

Return ONLY this JSON format (no other text):
{{
    "coffee_question": 0-100
}}

STRICT RULES:
1. DO NOT write any explanations
2. DO NOT write any thoughts or reasoning
3. DO NOT use markdown or code blocks
4. DO NOT use quotes around numbers
5. DO NOT write anything before or after the JSON object"""

What makes this all possible is the ability to make Ollama push out JSON objects using format type object.

curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
  "model": "model1:14b",
  "prompt": "You are a JSON generator. Your task is to evaluate if person A asked person B about coffee...",
  "stream": false,
  "temperature": 0.1,
  "format": {
    "type": "object",
    "properties": {
      "coffee_question": {
        "type": "integer",
        "description": "Score from 0-100 indicating if person A asked person B about coffee",
        "minimum": 0,
        "maximum": 100
      },
      "category": {
        "type": "integer",
        "description": "Category number (1-4) based on the score",
        "enum": [1, 2, 3, 4]
      },
      "reasoning": {
        "type": "string",
        "description": "Brief explanation of the score (optional)"
      }
    },
    "required": ["coffee_question", "category"]
  },
  "system": "You are a JSON generator. Your task is to analyze conversations and return structured JSON data according to the specified format."
}'

Script

For reference

"""
Multi-Model Conversation Analysis System

This script implements a system for analyzing conversations using multiple language models
to determine if one speaker (A) asked another speaker (B) about coffee. It includes:

1. Conversation Translation:
   - Translates conversations between languages for consistent analysis
   - Caches translations to avoid redundant API calls

2. Multi-Model Analysis:
   - Uses multiple language models to analyze the same conversation
   - Implements a scoring system (0-100) with 4 categories
   - Calculates consensus between different models

3. Performance Evaluation:
   - Compares model outputs against ground truth
   - Generates detailed performance statistics
   - Creates human-readable reports

Requirements:
    - Python 3.6+
    - Ollama API running locally
    - Required packages: json, requests, tqdm

Usage:
    Run the script directly:
    $ python analyze_with_authority.py

    The script expects:
    - Conversation files in ./texts/conversation_*.txt
    - Ground truth in ./texts/coffee_evaluations.json
    
    It will generate:
    - Translation cache in conversation_translations.json
    - Analysis results in coffee_analysis_results.json
    - Human-readable report in coffee_analysis_report.txt
"""

import json
import glob
import os
import requests
import time
from tqdm import tqdm

def translate_conversation(text, target_language):
    """
    Translate conversation text using the Ollama API.
    
    This function handles translation while preserving conversation structure,
    especially the speaker labels (A:, B:, etc.).
    
    Args:
        text (str): The conversation text to translate
        target_language (str): Target language code ('en' or 'other')
    
    Returns:
        str: Translated text, or None if translation fails
        
    Example conversation format:
        A: Hello, how are you?
        B: I'm good, thanks.
    """
    url = "http://localhost:11434/api/generate"
    
    # Define translation prompts for different languages
    system_prompts = {
        "en": "You are a professional translator. Translate the conversation from any language to English. Keep the same format and structure. Maintain speaker labels (A:, B:, etc).",
        "other": "You are a professional translator. Translate the conversation from English to the target language. Keep the same format and structure. Maintain speaker labels (A:, B:, etc)."
    }
    
    user_prompts = {
        "en": "Translate this conversation to English:\n\n{text}",
        "other": "Translate this conversation to the target language:\n\n{text}"
    }
    
    # Configure API request
    payload = {
        "model": "model1:14b",  # Using a general language model for translation
        "prompt": user_prompts["en" if target_language == "en" else "other"].format(text=text),
        "system": system_prompts["en" if target_language == "en" else "other"],
        "stream": False,
        "temperature": 0.1  # Low temperature for more consistent translations
    }
    
    try:
        response = requests.post(url, json=payload)
        response.raise_for_status()
        result = response.json()
        if 'response' in result:
            translated_text = result['response'].strip()
            return translated_text
    except Exception as e:
        print(f"Translation error: {e}")
        return None

def analyze_transcription(text, model_name, language="en"):
    """
    Analyze conversation to determine if person A asked person B about coffee.
    
    This function uses a language model to analyze the conversation and score
    the likelihood that person A asked person B about coffee. The scoring is
    based on specific criteria and categories.
    
    Scoring Categories:
    1. Clear coffee question (90-100):
       - Direct question about wanting coffee
       - Explicit and clear intent
    
    2. Implied coffee question (60-89):
       - Indirect mention of coffee
       - Ambiguous but present question
    
    3. No coffee question (30-59):
       - Coffee mentioned but not as question
       - Wrong person asking
    
    4. Cannot determine (0-29):
       - Unclear or incomplete
       - No mention of coffee
    
    Args:
        text (str): The conversation text to analyze
        model_name (str): Name of the model to use for analysis
        language (str): Language of analysis ('en' or 'other')
    
    Returns:
        dict: Analysis result with score, or None if analysis fails
    """
    url = "http://localhost:11434/api/generate"
    
    # Define analysis prompts for different languages
    prompt_template = """You are a JSON generator. Your task is to evaluate if person A asked person B about coffee.

Question: Did person A ask person B if they wanted coffee?

Categories and scoring rules:
1 (90-100): Clear coffee question if:
   - Person A directly asks B about wanting coffee
   - The question is explicit and clear

2 (60-89): Implied coffee question if:
   - Person A indirectly mentions coffee to B
   - The offer is made but not as a direct question
   - Coffee is discussed but the question is ambiguous

3 (30-59): No coffee question if:
   - Coffee is mentioned but not as an offer
   - No one asks about coffee
   - Wrong person asks the question

4 (0-29): Cannot determine if:
   - Conversation is incomplete or unclear
   - There is no mention of coffee
   - Cannot identify speakers clearly

Conversation:
{text}

Return ONLY this JSON format (no other text):
{{
    "coffee_question": 0-100
}}

STRICT RULES:
1. DO NOT write any explanations
2. DO NOT write any thoughts or reasoning
3. DO NOT use markdown or code blocks
4. DO NOT use quotes around numbers
5. DO NOT write anything before or after the JSON object"""

    prompts = {
        "en": prompt_template,
        "other": prompt_template
    }

    # Configure API request
    payload = {
        "model": model_name,
        "keep_alive": 10,
        "prompt": prompts[language].format(text=text),
        "stream": False,
        "temperature": 0.1,  # Low temperature for consistent analysis
        "system": "You are a JSON generator. Your only task is to generate JSON data. Do not write any explanations, thoughts, or other text. Return only the JSON object specified in the prompt."
    }
    
    try:
        # Make API request
        response = requests.post(url, json=payload)
        response.raise_for_status()
        result = response.json()
        
        if 'response' in result:
            response_text = result['response'].strip()
            if not response_text:
                return None
                
            try:
                # Clean and parse the response
                response_text = response_text.replace("```json", "").replace("```", "").strip()
                # Extract JSON object
                response_text = response_text[response_text.find("{"):response_text.rfind("}")+1]
                verification_data = json.loads(response_text)
                
                # Extract and validate score
                score = verification_data.get('coffee_question', 0)
                if not isinstance(score, (int, float)) or score < 0 or score > 100:
                    return None
                    
                return {'score': score}
            except json.JSONDecodeError:
                return None
                
        return None
    except (requests.exceptions.RequestException, json.JSONDecodeError):
        return None

def convert_score_to_category(score):
    """
    Convert numerical score to analysis category.
    
    Categories:
    1: Clear coffee question (90-100)
    2: Implied coffee question (60-89)
    3: No coffee question (30-59)
    4: Cannot determine (0-29)
    
    Args:
        score (float): Numerical score from 0 to 100
    
    Returns:
        int: Category number from 1 to 4
    """
    if score >= 90:
        return 1  # Clear coffee question
    elif score >= 60:
        return 2  # Implied coffee question
    elif score >= 30:
        return 3  # No coffee question
    else:
        return 4  # Cannot determine

def calculate_consensus(model_results):
    """
    Calculate consensus from multiple model results.
    
    This function combines results from multiple models to determine:
    1. Average score across all models
    2. Most common category
    3. Agreement percentage between models
    
    Args:
        model_results (dict): Dictionary of model results, each containing score and category
    
    Returns:
        dict: Consensus results with score, category, and agreement percentage
        None: If no valid results to analyze
    """
    if not model_results:
        return None
    
    # Extract scores and categories from all models
    scores = []
    categories = []
    for model_data in model_results.values():
        scores.append(model_data['score'])
        categories.append(model_data['category'])
    
    # Calculate average score
    avg_score = sum(scores) / len(scores)
    
    # Find most common category
    category_counts = {}
    for cat in categories:
        category_counts[cat] = category_counts.get(cat, 0) + 1
    
    consensus_category = max(category_counts.items(), key=lambda x: x[1])[0]
    agreement_percentage = (category_counts[consensus_category] / len(categories)) * 100
    
    return {
        'score': round(avg_score, 1),
        'category': consensus_category,
        'agreement_percentage': round(agreement_percentage, 1)
    }

def analyze_model_performance():
    """
    Main function to analyze model performance against ground truth.
    
    This function:
    1. Loads conversation files and ground truth
    2. Translates conversations if needed
    3. Runs analysis with multiple models
    4. Calculates consensus and agreement
    5. Generates performance reports
    
    The analysis is done in both the original language and English,
    allowing for comparison of model performance across languages.
    """
    print("\n=== Coffee Question Analysis ===")
    
    # Load ground truth evaluations
    try:
        with open(os.path.join("texts", "coffee_evaluations.json"), "r", encoding="utf-8") as f:
            ground_truth = json.load(f)
    except Exception as e:
        print(f"Error loading ground truth: {e}")
        return
    
    # Define models to use for analysis
    MODELS = [
        "model1:14b",
        "model2:27b",
        "model3:12b",
        "model4:14b"
    ]
    
    # Initialize results storage
    results = {
        'en': {},      # Results for English analysis
        'other': {}    # Results for original language
    }
    
    # Initialize translation cache
    translations_cache = {
        'en': {},      # English translations
        'other': {}    # Original texts
    }
    
    # Get all conversation files
    conversation_files = sorted(glob.glob(os.path.join("texts", "conversation_*.txt")))
    
    # First phase: Translate all conversations
    print("\nTranslating conversations...")
    for file in tqdm(conversation_files, desc="Translating", unit="conv"):
        file_name = os.path.basename(file)
        if file_name not in ground_truth:
            continue
            
        # Read and cache original text
        with open(file, "r", encoding="utf-8") as f:
            original_text = f.read()
        
        # Translate and cache results
        translations_cache['en'][file_name] = translate_conversation(original_text, "en")
        translations_cache['other'][file_name] = original_text
        
        time.sleep(1)  # Rate limiting
    
    # Second phase: Process with each model
    for model_name in MODELS:
        print(f"\nProcessing model: {model_name}")
        
        # Process each language version
        for language in ['en', 'other']:
            print(f"\nLanguage: {language.upper()}")
            pbar = tqdm(total=len(conversation_files), desc=f"Analyzing conversations", unit="conv")
            
            # Analyze each conversation
            for file in conversation_files:
                file_name = os.path.basename(file)
                if file_name not in ground_truth:
                    pbar.update(1)
                    continue
                    
                # Initialize results structure if needed
                if file_name not in results[language]:
                    results[language][file_name] = {
                        'ground_truth': ground_truth[file_name]['choice'],
                        'model_results': {},
                        'consensus': None
                    }
                
                # Get cached translation
                conversation_text = translations_cache[language][file_name]
                if conversation_text is None:
                    pbar.update(1)
                    continue
                
                # Run analysis
                result = analyze_transcription(conversation_text, model_name, language)
                
                if result is not None:
                    score = result['score']
                    category = convert_score_to_category(score)
                    results[language][file_name]['model_results'][model_name] = {
                        'score': score,
                        'category': category
                    }
                
                pbar.update(1)
                time.sleep(1)  # Rate limiting
            
            pbar.close()
    
    # Save translations for reference
    translations_file = os.path.join("texts", "conversation_translations.json")
    try:
        with open(translations_file, "w", encoding="utf-8") as f:
            json.dump(translations_cache, f, indent=2, ensure_ascii=False)
        print(f"\nTranslations saved to {translations_file}")
    except Exception as e:
        print(f"Error saving translations: {e}")

    # Calculate consensus for all results
    for language in results:
        for file_name, data in results[language].items():
            consensus = calculate_consensus(data['model_results'])
            if consensus:
                data['consensus'] = consensus
    
    # Save detailed results
    output_file = os.path.join("texts", "coffee_analysis_results.json")
    try:
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(results, f, indent=2, ensure_ascii=False)
        print(f"\nDetailed results saved to {output_file}")
    except Exception as e:
        print(f"Error saving results: {e}")
    
    # Generate reports
    generate_text_report(results)
    
    # Print performance statistics
    for language in ['en', 'other']:
        print(f"\n=== Model Performance Statistics ({language.upper()}) ===")
        
        # Calculate per-model statistics
        for model_name in MODELS:
            total = 0
            correct = 0
            partial = 0
            wrong = 0
            cannot_determine = 0
            
            for file_name, data in results[language].items():
                if model_name in data['model_results']:
                    total += 1
                    model_category = data['model_results'][model_name]['category']
                    ground_truth_category = data['ground_truth']
                    
                    if model_category == ground_truth_category:
                        correct += 1
                    elif abs(model_category - ground_truth_category) == 1:
                        partial += 1
                    elif model_category == 4:
                        cannot_determine += 1
                    else:
                        wrong += 1
            
            if total > 0:
                print(f"\nModel: {model_name}")
                print(f"Total evaluations: {total}")
                print(f"Correct matches: {correct} ({correct/total*100:.1f}%)")
                print(f"Partial matches: {partial} ({partial/total*100:.1f}%)")
                print f"Wrong matches: {wrong} ({wrong/total*100:.1f}%)")
                print f"Cannot determine: {cannot_determine} ({cannot_determine/total*100:.1f}%)")
        
        # Calculate consensus performance
        print f"\n=== Consensus Performance ({language.upper()}) ===")
        total = 0
        correct = 0
        partial = 0
        wrong = 0
        cannot_determine = 0
        high_agreement = 0  # Cases with >75% agreement
        
        for file_name, data in results[language].items():
            if data['consensus']:
                total += 1
                consensus_category = data['consensus']['category']
                ground_truth_category = data['ground_truth']
                
                if consensus_category == ground_truth_category:
                    correct += 1
                elif abs(consensus_category - ground_truth_category) == 1:
                    partial += 1
                elif consensus_category == 4:
                    cannot_determine += 1
                else:
                    wrong += 1
                
                if data['consensus']['agreement_percentage'] >= 75:
                    high_agreement += 1
        
        if total > 0:
            print f"Total evaluations: {total}"
            print f"Correct matches: {correct} ({correct/total*100:.1f}%)"
            print f"Partial matches: {partial} ({partial/total*100:.1f}%)"
            print f"Wrong matches: {wrong} ({wrong/total*100:.1f}%)"
            print f"Cannot determine: {cannot_determine} ({cannot_determine/total*100:.1f}%)"
            print f"High agreement cases: {high_agreement} ({high_agreement/total*100:.1f}%)"

def generate_text_report(results):
    """
    Generate a human-readable text report of the analysis results.
    
    This function creates a detailed report including:
    - Results for each language
    - Individual conversation analysis
    - Model-specific results
    - Consensus information
    
    Args:
        results (dict): Complete analysis results dictionary
    
    The report is saved to coffee_analysis_report.txt
    """
    report = "=== Coffee Question Analysis Report ===\n\n"
    
    for language in ['en', 'other']:
        report += f"=== {language.upper()} Analysis ===\n\n"
        
        # Process each conversation in sorted order
        conversation_files = sorted(results[language].keys())
        
        for file_name in conversation_files:
            data = results[language][file_name]
            report += f"Conversation: {file_name}\n"
            report += f"Ground Truth: {data['ground_truth']}\n"
            
            # Add consensus information
            if data['consensus']:
                report += f"\nConsensus:\n"
                report += f"  Category: {data['consensus']['category']}\n"
                report += f"  Score: {data['consensus']['score']}\n"
                report += f"  Agreement: {data['consensus']['agreement_percentage']}%\n"
            
            # Add individual model results
            report += "\nModel Results:\n"
            for model_name in sorted(data['model_results'].keys()):
                model_data = data['model_results'][model_name]
                report += f"  {model_name}: Category {model_data['category']} (score: {model_data['score']})\n"
            
            report += "\n" + "-"*50 + "\n\n"
    
    # Save the report
    output_file = os.path.join("texts", "coffee_analysis_report.txt")
    try:
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(report)
        print f"Text report saved to {output_file}"
    except Exception as e:
        print f"Error saving text report: {e}"

if __name__ == "__main__":
    analyze_model_performance()