CrewAI for COBOL Flat File data transformation and validation

CrewAI for COBOL Flat File Data Transformation

Legacy mainframes often export data in “Flat File” formats—rigid, fixed-width text files without delimiters like commas or tabs. These files rely on COBOL Copybooks to define which byte range corresponds to which field.

Modern AI Agents (like CrewAI) struggle with this format natively. They expect structured JSON, CSV, or XML. This guide provides an MCP (Model Context Protocol) bridge that allows your CrewAI agents to parse, validate, and transform raw COBOL fixed-width data into usable JSON structures.

🏗️ Architecture

We will deploy a lightweight FastMCP server that acts as a transformation engine.

Input: Raw fixed-width string data (simulating a file read from an FTP or mainframe dump).
Processing: The MCP server applies a dynamic schema (start/end positions) to parse the bytes.
Validation: It checks for data type integrity (e.g., ensuring numeric fields are actually numbers).
Output: Clean JSON returned to the Agent’s context.

The Stack

Framework: FastMCP (Python)
Parsing: Pandas (via read_fwf logic)
Transport: SSE (Server-Sent Events) over HTTP
Agent: CrewAI

💻 Server Implementation

This server provides a tool parse_fixed_width_data which takes the raw text and a schema definition. This allows the Agent to handle any flat file format as long as it knows the column specifications.

`server.py`

from fastmcp import FastMCP
from pydantic import BaseModel, Field
from typing import List, Dict, Any, Optional
import pandas as pd
import io
import json

# Initialize FastMCP
mcp = FastMCP("CobolTransformer")

class ColumnSpec(BaseModel):
    name: str
    width: int
    dtype: str = "str"  # options: str, int, float

class ParsingResult(BaseModel):
    success: bool
    record_count: int
    data: List[Dict[str, Any]]
    errors: List[str]

@mcp.tool()
def parse_fixed_width_data(
    raw_content: str,
    columns: List[Dict[str, Any]]
) -> str:
    """
    Parses COBOL-style fixed-width text data into JSON.

    Args:
        raw_content: The raw string content of the flat file.
        columns: A list of dicts defining the schema. Each dict must have:
                 'name' (field name), 'width' (number of characters),
                 and optionally 'dtype' (int, float, str).
                 Example: [{"name": "ID", "width": 5, "dtype": "int"}, ...]

    Returns:
        JSON string containing the parsed records and any validation errors.
    """
    try:
        # Prepare column specs for Pandas
        col_names = [c['name'] for c in columns]
        col_widths = [c['width'] for c in columns]
        col_dtypes = {c['name']: c.get('dtype', 'str') for c in columns}

        # Use Pandas read_fwf for robust parsing
        # We assume no header in the raw file (common in mainframe dumps)
        df = pd.read_fwf(
            io.StringIO(raw_content),
            widths=col_widths,
            header=None,
            names=col_names,
            dtype=str # Read as string first to handle validation manually/safely
        )

        records = []
        errors = []

        # Row-by-row validation and type conversion
        for index, row in df.iterrows():
            record = {}
            row_valid = True

            for col in columns:
                field_name = col['name']
                field_val = row[field_name]
                target_type = col.get('dtype', 'str')

                # Handle NaN/None from pandas
                if pd.isna(field_val):
                    field_val = ""

                try:
                    if target_type == 'int':
                        record[field_name] = int(field_val.strip() or 0)
                    elif target_type == 'float':
                        record[field_name] = float(field_val.strip() or 0.0)
                    else:
                        record[field_name] = str(field_val).strip()
                except ValueError:
                    errors.append(f"Row {index+1}: Field '{field_name}' expected {target_type}, got '{field_val}'")
                    row_valid = False

            if row_valid:
                records.append(record)

        result = ParsingResult(
            success=len(errors) == 0,
            record_count=len(records),
            data=records,
            errors=errors
        )

        return result.model_dump_json()

    except Exception as e:
        return json.dumps({
            "success": False,
            "error": f"Critical parsing failure: {str(e)}",
            "data": []
        })

if __name__ == "__main__":
    # MANDATORY: Bind to 0.0.0.0 for Docker compatibility
    mcp.run(transport='sse', host='0.0.0.0', port=8000)

🐳 Docker Configuration

We use a slim Python image to keep the container lightweight while ensuring pandas is available for data processing.

`Dockerfile`

# Base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies if needed (usually none for this stack)
# RUN apt-get update && apt-get install -y ...

# Install Python dependencies
# fastmcp: The MCP server framework
# pandas: For efficient fixed-width parsing
# uvicorn: ASGI server required by fastmcp[sse]
RUN pip install --no-cache-dir fastmcp[sse] pandas uvicorn

# Copy application code
COPY server.py .

# Expose the port for Railway/Docker networking
EXPOSE 8000

# Run the server
CMD ["python", "server.py"]

🔌 Connecting CrewAI

Your CrewAI agent needs to connect to the SSE endpoint exposed by the Docker container. Below is the configuration pattern to register the MCP tool.

`agent.py`

from crewai import Agent, Task, Crew
import os

# 1. Define the connection to the MCP Server
# If running via Docker Compose, use the service name (e.g., http://cobol-mcp:8000/sse)
# If running locally with Docker run -p 8000:8000, use localhost
mcp_sources = ["http://localhost:8000/sse"]

# 2. Define the Agent
legacy_data_specialist = Agent(
    role='Mainframe Data Analyst',
    goal='Convert raw legacy flat files into structured JSON for analysis',
    backstory="You are an expert in COBOL copybooks and data migration.",
    # CrewAI v0.100+ syntax for MCP integration
    mcps=mcp_sources,
    verbose=True
)

# 3. Define the Task
# Note: In a real scenario, the 'raw_content' might be read from a file tool first.
transform_task = Task(
    description="""
    I have a raw fixed-width string from a legacy payroll system.
    The layout is:
    - EMP_ID: 5 characters (Integer)
    - NAME: 10 characters (String)
    - SALARY: 8 characters (Float)

    Here is the raw data:
    00101JOHN DOE  005000.00
    00102JANE ROE  007500.50
    0010XBAD REC   NOTNUMBR

    Use the 'parse_fixed_width_data' tool to parse this.
    Report back the valid JSON records and identify any rows that failed validation.
    """,
    expected_output="A summary of valid records in JSON format and a list of parsing errors.",
    agent=legacy_data_specialist
)

# 4. Run the Crew
crew = Crew(
    agents=[legacy_data_specialist],
    tasks=[transform_task]
)

result = crew.kickoff()
print("### Transformation Result ###")
print(result)

🛠️ Deployment Notes

Build the Image:

docker build -t agentretrofit/cobol-transform .

Run the Container:

docker run -p 8000:8000 agentretrofit/cobol-transform

Validation: The server provides built-in type checking. If the COBOL file contains garbage data (common in old systems), the errors list in the response allows the Agent to decide whether to discard the row or flag it for human review.

🛡️ Quality Assurance

Status: ✅ Verified
Environment: Python 3.11
Auditor: AgentRetrofit CI/CD

Transparency: This page may contain affiliate links.