How to extract structured data from transcripts with Python

The real-world scenario

Imagine you are a **Technical Project Manager** or **DevOps Lead** juggling six hours of recorded meetings every day. You have the transcripts, but your project tracker requires structured data: who is doing what, by when, and what was decided. Manually reading through a **5,000-word transcript** to find three action items is like looking for a needle in a haystack while the haystack is growing. This script acts as a digital filter, instantly converting messy human conversation into a **clean JSON object** that you can pipe directly into **Jira**, **Trello**, or a **SQL database**.

The solution

We leverage the **OpenAI API** combined with **Pydantic**, a data validation library. By using **Structured Outputs**, we force the LLM to adhere to a specific schema. This eliminates the common problem of the AI returning conversational fluff or incorrectly formatted text. We use **pathlib** for robust file handling to ensure the script works across **Windows**, **macOS**, and **Linux** without path errors.

Prerequisites

Python 3.10 or higher installed
An **OpenAI API Key**
Install the required libraries: pip install openai pydantic python-dotenv

The code


"""
-----------------------------------------------------------------------
Authors: Sharanam & Vaishali Shah
Recipe: AI Transcript Structurer
Intent: Extract validated JSON action items from raw meeting text.
-----------------------------------------------------------------------
"""
import json
from pathlib import Path
from typing import List
from pydantic import BaseModel
from openai import OpenAI

# Define the data structure for validation
class ActionItem(BaseModel):
    task: str
    assignee: str
    priority: str

class MeetingSummary(BaseModel):
    decisions: List[str]
    action_items: List[ActionItem]
    sentiment: str

def extract_insights(file_path: str):
    # Initialize the client (ensure OPENAI_API_KEY is in your environment)
    client = OpenAI()
    
    # Read the transcript using pathlib
    target_file = Path(file_path)
    if not target_file.exists():
        print(f"Error: {target_file} not found.")
        return

    raw_text = target_file.read_text(encoding='utf-8')

    # Call the LLM with structured output response format
    completion = client.beta.chat.completions.parse(
        model='gpt-4o-2024-08-06',
        messages=[
            {'role': 'system', 'content': 'Extract structured insights from the meeting transcript.'},
            {'role': 'user', 'content': raw_text},
        ],
        response_format=MeetingSummary,
    )

    # Parse the validated response
    insights = completion.choices[0].message.parsed
    
    # Save the output as JSON
    output_file = target_file.with_suffix('.json')
    with open(output_file, 'w') as f:
        json.dump(insights.model_dump(), f, indent=4)
    
    print(f"Success! Insights saved to {output_file}")
    return insights

if __name__ == '__main__':
    # Create a dummy transcript for demonstration
    temp_file = Path('transcript.txt')
    temp_file.write_text('Dave: We need to fix the login bug by Friday. Sarah, can you handle that? Sarah: Sure. We also decided to switch to Postgres for the database.')
    
    # Run the extraction
    extract_insights('transcript.txt')

Code walkthrough

The script begins by defining a **Pydantic schema**. The **ActionItem** and **MeetingSummary** classes tell Python exactly what fields to expect from the AI. This is the **source of truth** for your data. We then use **Path** from the **pathlib** module to handle the file input, which is more reliable than the older **os.path** approach.

The core logic happens in the **client.beta.chat.completions.parse** method. By passing our **MeetingSummary** class into the **response_format** parameter, we tell the **OpenAI API** to validate the response against our model before it even reaches our script. If the AI tries to skip a field, the request fails, ensuring you never receive **broken JSON**. Finally, we use **model_dump** to convert the Python object back into a standard dictionary and save it as a **formatted JSON file**.

Sample output

When you run the script with the sample transcript, the terminal will display a success message, and a new file named **transcript.json** will appear in your directory with this content:


{
    "decisions": [
        "Switch to Postgres for the database"
    ],
    "action_items": [
        {
            "task": "Fix the login bug",
            "assignee": "Sarah",
            "priority": "High"
        }
    ],
    "sentiment": "Productive"
}

Conclusion

Automating meeting summaries is no longer about simple keyword matching. By combining **Python types** with **LLM intelligence**, you create a robust pipeline that understands context and enforces data integrity. This script provides a solid foundation for building larger automation tools, such as auto-filling **Jira tickets** or sending **Slack alerts** based on meeting outcomes. You can now process hundreds of transcripts in seconds, ensuring that no critical decision ever falls through the cracks again.

🚀 Don’t Just Learn AI & LLMs — Master It.

This tutorial was just the tip of the iceberg. To truly advance your career and build professional-grade systems, you need the full architectural blueprint.

My book, Large Language Models Crash Course, takes you from “making it work” to “making it scale.” I cover advanced patterns, real-world case studies, and the industry best practices that senior engineers use daily.

📖 Grab Your Copy Now →

How to Extract Structured Meeting Insights with Python and LLMs

Published by admin on February 18, 2026February 18, 2026

How to extract structured data from transcripts with Python

The real-world scenario

The solution

Prerequisites

The code

Code walkthrough

Sample output

Conclusion

🚀 Don’t Just Learn AI & LLMs — Master It.

How to Automate Java Stack Trace Extraction with Python

How to Automate Java EE to Jakarta EE Namespace Migration with Python

How to Categorize Customer Feedback at Scale with Python and LLMs

How to Extract Structured Meeting Insights with Python and LLMs

Published by admin on February 18, 2026February 18, 2026

How to extract structured data from transcripts with Python

The real-world scenario

The solution

Prerequisites

The code

Code walkthrough

Sample output

Conclusion

🚀 Don’t Just Learn AI & LLMs — Master It.

Related Posts

How to Automate Java Stack Trace Extraction with Python

How to Automate Java EE to Jakarta EE Namespace Migration with Python

How to Categorize Customer Feedback at Scale with Python and LLMs