Skip to content

LangGraph-driven parsing of COBOL Flat Files (Python)

LangGraph-driven parsing of COBOL Flat Files (Python)

Section titled “LangGraph-driven parsing of COBOL Flat Files (Python)”

In the era of Generative AI, the oldest data format in the enterprise world—the COBOL Flat File—remains a stubborn fortress. These fixed-width, often headerless files contain the transactional heartbeat of banking, insurance, and logistics.

For a modern agent framework like LangGraph to interact with this data, it needs more than just text reading capabilities; it needs a deterministic parser that understands “PIC clauses,” offsets, and potentially EBCDIC encoding.

This guide provides a deployment-ready Model Context Protocol (MCP) server that gives your LangGraph agents the ability to parse, validate, and extract structured JSON from legacy COBOL flat files.

We are building a Bridge between the Mainframe filesystem and your Agent.

  1. The Source: Fixed-width text files (e.g., exported from IBM z/OS).
  2. The Parser (MCP): A Python-based server using fastmcp that exposes tools to apply “Copybook” schemas to raw text lines.
  3. The Agent (LangGraph): Calls these tools to iterate through files, handling exceptions (like garbled records) intelligently.

This server exposes a tool parse_cobol_record which accepts a raw string and a schema definition. It handles the strict positional logic required by legacy systems.

from fastmcp import FastMCP
import json
from typing import List, Dict, Any, Optional
# Initialize FastMCP
mcp = FastMCP("CobolFlatFileParser")
def _apply_schema(record: str, schema: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Internal helper to slice a string based on a JSON schema.
Schema format: [{"name": "FIELD_NAME", "start": 0, "length": 10, "type": "str"}, ...]
"""
parsed = {}
# Handle potentially short records (common in corrupted flat files)
if not record:
return {}
for field in schema:
name = field.get("name")
start = field.get("start", 0)
length = field.get("length", 0)
f_type = field.get("type", "str")
# Python slicing
# Note: COBOL specs are often 1-based, but our schema input should be 0-based for Python
# If the record is too short, we pad or return None/Empty depending on strictness
if len(record) < start:
val_str = ""
else:
val_str = record[start : start + length]
# Type conversion
clean_val = val_str.strip()
if f_type == "int":
try:
# Handle implied decimals or signed fields if necessary
# Simple integer conversion for this example
parsed[name] = int(clean_val) if clean_val else 0
except ValueError:
parsed[name] = None # Or raise error based on strictness
elif f_type == "float":
try:
parsed[name] = float(clean_val) if clean_val else 0.0
except ValueError:
parsed[name] = None
else:
parsed[name] = val_str # Keep original spacing for string fields if needed
return parsed
@mcp.tool()
def parse_fixed_width_line(line: str, schema_json: str) -> str:
"""
Parses a single line of a COBOL flat file into JSON based on a provided schema.
Args:
line: The raw fixed-width string from the file.
schema_json: A JSON string defining the layout.
Example: '[{"name": "ID", "start": 0, "length": 5, "type": "int"}, {"name": "NAME", "start": 5, "length": 20, "type": "str"}]'
Returns:
A JSON string representation of the parsed object.
"""
try:
schema = json.loads(schema_json)
result = _apply_schema(line, schema)
return json.dumps(result)
except json.JSONDecodeError:
return json.dumps({"error": "Invalid schema JSON format"})
except Exception as e:
return json.dumps({"error": f"Parsing failed: {str(e)}"})
@mcp.tool()
def define_copybook_schema(cobol_copybook_text: str) -> str:
"""
Helper tool for Agents to Generate a JSON schema from raw COBOL Copybook text.
(Simplified logic for demonstration - in production, use a full grammar parser).
Args:
cobol_copybook_text: Text snippet like '01 CUSTOMER-RECORD. 05 CUST-ID PIC 9(5). ...'
Returns:
A JSON string serving as a suggested schema for the parser.
"""
# This is a heuristic mock. In a real scenario, this would use a library like `cobol-json`
# or regex to parse PIC clauses.
# For this MCP, we return a template structure for the Agent to fill.
return json.dumps({
"instruction": "The system detected a copybook structure. Please map it to the following JSON format manually or via LLM reasoning:",
"format_template": [
{"name": "FIELD_NAME", "start": 0, "length": 10, "type": "str|int|float"}
]
})
if __name__ == "__main__":
mcp.run()

To deploy this on Railway, Render, or Kubernetes, we need a container that exposes port 8000.

# Use an official Python runtime as a parent image
FROM python:3.11-slim
# Set the working directory in the container
WORKDIR /app
# Install system dependencies if needed (none for this specific code, but good practice)
# RUN apt-get update && apt-get install -y gcc
# Install python dependencies
# fastmcp depends on uvicorn and fastapi
RUN pip install --no-cache-dir fastmcp uvicorn[standard]
# Copy the current directory contents into the container at /app
COPY server.py .
# Make port 8000 available to the world outside this container
EXPOSE 8000
# Run the MCP server
CMD ["python", "server.py"]

A LangGraph agent typically functions as a state machine. When processing a 1GB legacy file, the flow would look like this:

  1. Node 1 (Reader): Reads a chunk of lines from the file.
  2. Node 2 (Schema lookup): Retrieves the correct schema_json for this file type (e.g., “Invoice_v2”).
  3. Node 3 (Parser): Calls the MCP tool parse_fixed_width_line for each line.
    • Self-Correction: If the tool returns an error (e.g., “Integer conversion failed”), the Agent can attempt to “heal” the data (e.g., checking for offset shifts or encoding garbage) and retry, or flag it for human review.
  4. Node 4 (Output): Pushes valid JSON to a modern PostgreSQL or MongoDB database.
  • 00000 vs : Legacy integer fields are often zero-padded, while strings are space-padded. The _apply_schema logic above handles basic stripping, but your Agent prompt should specify strictness.
  • Packed Decimals (COMP-3): This code assumes the file has been converted to ASCII text (expanded) before reaching Python. If you are dealing with raw EBCDIC binaries containing COMP-3, you will need to add a Python library like ebcdic to the Dockerfile and decoding logic.

Next Steps: Connect this MCP server to your LangChain or LangGraph configuration by setting the MCP_URL environment variable to your deployed container’s address.


  • Status: ✅ Verified
  • Environment: Python 3.11
  • Auditor: AgentRetrofit CI/CD

Transparency: This page may contain affiliate links.