How to extract structured data from transcripts with Python

The real-world scenario

Imagine you are a **Technical Project Manager** or **DevOps Lead** juggling six hours of recorded meetings every day. You have the transcripts, but your project tracker requires structured data: who is doing what, by when, and what was decided. Manually reading through a **5,000-word transcript** to find three action items is like looking for a needle in a haystack while the haystack is growing. This script acts as a digital filter, instantly converting messy human conversation into a **clean JSON object** that you can pipe directly into **Jira**, **Trello**, or a **SQL database**.

The solution

We leverage the **OpenAI API** combined with **Pydantic**, a data validation library. By using **Structured Outputs**, we force the LLM to adhere to a specific schema. This eliminates the common problem of the AI returning conversational fluff or incorrectly formatted text. We use **pathlib** for robust file handling to ensure the script works across **Windows**, **macOS**, and **Linux** without path errors.

Prerequisites

  • Python 3.10 or higher installed
  • An **OpenAI API Key**
  • Install the required libraries: pip install openai pydantic python-dotenv

The code


"""
-----------------------------------------------------------------------
Authors: Sharanam & Vaishali Shah
Recipe: AI Transcript Structurer
Intent: Extract validated JSON action items from raw meeting text.
-----------------------------------------------------------------------
"""
import json
from pathlib import Path
from typing import List
from pydantic import BaseModel
from openai import OpenAI

# Define the data structure for validation
class ActionItem(BaseModel):
    task: str
    assignee: str
    priority: str

class MeetingSummary(BaseModel):
    decisions: List[str]
    action_items: List[ActionItem]
    sentiment: str

def extract_insights(file_path: str):
    # Initialize the client (ensure OPENAI_API_KEY is in your environment)
    client = OpenAI()
    
    # Read the transcript using pathlib
    target_file = Path(file_path)
    if not target_file.exists():
        print(f"Error: {target_file} not found.")
        return

    raw_text = target_file.read_text(encoding='utf-8')

    # Call the LLM with structured output response format
    completion = client.beta.chat.completions.parse(
        model='gpt-4o-2024-08-06',
        messages=[
            {'role': 'system', 'content': 'Extract structured insights from the meeting transcript.'},
            {'role': 'user', 'content': raw_text},
        ],
        response_format=MeetingSummary,
    )

    # Parse the validated response
    insights = completion.choices[0].message.parsed
    
    # Save the output as JSON
    output_file = target_file.with_suffix('.json')
    with open(output_file, 'w') as f:
        json.dump(insights.model_dump(), f, indent=4)
    
    print(f"Success! Insights saved to {output_file}")
    return insights

if __name__ == '__main__':
    # Create a dummy transcript for demonstration
    temp_file = Path('transcript.txt')
    temp_file.write_text('Dave: We need to fix the login bug by Friday. Sarah, can you handle that? Sarah: Sure. We also decided to switch to Postgres for the database.')
    
    # Run the extraction
    extract_insights('transcript.txt')

Code walkthrough

The script begins by defining a **Pydantic schema**. The **ActionItem** and **MeetingSummary** classes tell Python exactly what fields to expect from the AI. This is the **source of truth** for your data. We then use **Path** from the **pathlib** module to handle the file input, which is more reliable than the older **os.path** approach.

The core logic happens in the **client.beta.chat.completions.parse** method. By passing our **MeetingSummary** class into the **response_format** parameter, we tell the **OpenAI API** to validate the response against our model before it even reaches our script. If the AI tries to skip a field, the request fails, ensuring you never receive **broken JSON**. Finally, we use **model_dump** to convert the Python object back into a standard dictionary and save it as a **formatted JSON file**.

Sample output

When you run the script with the sample transcript, the terminal will display a success message, and a new file named **transcript.json** will appear in your directory with this content:


{
    "decisions": [
        "Switch to Postgres for the database"
    ],
    "action_items": [
        {
            "task": "Fix the login bug",
            "assignee": "Sarah",
            "priority": "High"
        }
    ],
    "sentiment": "Productive"
}

Conclusion

Automating meeting summaries is no longer about simple keyword matching. By combining **Python types** with **LLM intelligence**, you create a robust pipeline that understands context and enforces data integrity. This script provides a solid foundation for building larger automation tools, such as auto-filling **Jira tickets** or sending **Slack alerts** based on meeting outcomes. You can now process hundreds of transcripts in seconds, ensuring that no critical decision ever falls through the cracks again.


🚀 Don’t Just Learn AI & LLMs — Master It.

This tutorial was just the tip of the iceberg. To truly advance your career and build professional-grade systems, you need the full architectural blueprint.

My book, Large Language Models Crash Course, takes you from “making it work” to “making it scale.” I cover advanced patterns, real-world case studies, and the industry best practices that senior engineers use daily.


📖 Grab Your Copy Now →