How to extract structured data from transcripts with Python
The real-world scenario
Imagine you are a **Technical Project Manager** or **DevOps Lead** juggling six hours of recorded meetings every day. You have the transcripts, but your project tracker requires structured data: who is doing what, by when, and what was decided. Manually reading through a **5,000-word transcript** to find three action items is like looking for a needle in a haystack while the haystack is growing. This script acts as a digital filter, instantly converting messy human conversation into a **clean JSON object** that you can pipe directly into **Jira**, **Trello**, or a **SQL database**.
The solution
We leverage the **OpenAI API** combined with **Pydantic**, a data validation library. By using **Structured Outputs**, we force the LLM to adhere to a specific schema. This eliminates the common problem of the AI returning conversational fluff or incorrectly formatted text. We use **pathlib** for robust file handling to ensure the script works across **Windows**, **macOS**, and **Linux** without path errors.
Prerequisites
- Python 3.10 or higher installed
- An **OpenAI API Key**
- Install the required libraries:
pip install openai pydantic python-dotenv
The code
"""
-----------------------------------------------------------------------
Authors: Sharanam & Vaishali Shah
Recipe: AI Transcript Structurer
Intent: Extract validated JSON action items from raw meeting text.
-----------------------------------------------------------------------
"""
import json
from pathlib import Path
from typing import List
from pydantic import BaseModel
from openai import OpenAI
# Define the data structure for validation
class ActionItem(BaseModel):
task: str
assignee: str
priority: str
class MeetingSummary(BaseModel):
decisions: List[str]
action_items: List[ActionItem]
sentiment: str
def extract_insights(file_path: str):
# Initialize the client (ensure OPENAI_API_KEY is in your environment)
client = OpenAI()
# Read the transcript using pathlib
target_file = Path(file_path)
if not target_file.exists():
print(f"Error: {target_file} not found.")
return
raw_text = target_file.read_text(encoding='utf-8')
# Call the LLM with structured output response format
completion = client.beta.chat.completions.parse(
model='gpt-4o-2024-08-06',
messages=[
{'role': 'system', 'content': 'Extract structured insights from the meeting transcript.'},
{'role': 'user', 'content': raw_text},
],
response_format=MeetingSummary,
)
# Parse the validated response
insights = completion.choices[0].message.parsed
# Save the output as JSON
output_file = target_file.with_suffix('.json')
with open(output_file, 'w') as f:
json.dump(insights.model_dump(), f, indent=4)
print(f"Success! Insights saved to {output_file}")
return insights
if __name__ == '__main__':
# Create a dummy transcript for demonstration
temp_file = Path('transcript.txt')
temp_file.write_text('Dave: We need to fix the login bug by Friday. Sarah, can you handle that? Sarah: Sure. We also decided to switch to Postgres for the database.')
# Run the extraction
extract_insights('transcript.txt')
Code walkthrough
The script begins by defining a **Pydantic schema**. The **ActionItem** and **MeetingSummary** classes tell Python exactly what fields to expect from the AI. This is the **source of truth** for your data. We then use **Path** from the **pathlib** module to handle the file input, which is more reliable than the older **os.path** approach.
The core logic happens in the **client.beta.chat.completions.parse** method. By passing our **MeetingSummary** class into the **response_format** parameter, we tell the **OpenAI API** to validate the response against our model before it even reaches our script. If the AI tries to skip a field, the request fails, ensuring you never receive **broken JSON**. Finally, we use **model_dump** to convert the Python object back into a standard dictionary and save it as a **formatted JSON file**.
Sample output
When you run the script with the sample transcript, the terminal will display a success message, and a new file named **transcript.json** will appear in your directory with this content:
{
"decisions": [
"Switch to Postgres for the database"
],
"action_items": [
{
"task": "Fix the login bug",
"assignee": "Sarah",
"priority": "High"
}
],
"sentiment": "Productive"
}
Conclusion
Automating meeting summaries is no longer about simple keyword matching. By combining **Python types** with **LLM intelligence**, you create a robust pipeline that understands context and enforces data integrity. This script provides a solid foundation for building larger automation tools, such as auto-filling **Jira tickets** or sending **Slack alerts** based on meeting outcomes. You can now process hundreds of transcripts in seconds, ensuring that no critical decision ever falls through the cracks again.
🚀 Don’t Just Learn AI & LLMs — Master It.
This tutorial was just the tip of the iceberg. To truly advance your career and build professional-grade systems, you need the full architectural blueprint.
My book, Large Language Models Crash Course, takes you from “making it work” to “making it scale.” I cover advanced patterns, real-world case studies, and the industry best practices that senior engineers use daily.