AI Agents for COBOL Flat File data validation and cleansing
AI Agents for COBOL Flat File Data Validation and Cleansing
Section titled “AI Agents for COBOL Flat File Data Validation and Cleansing”The “Big Iron” Bottleneck
Section titled “The “Big Iron” Bottleneck”Legacy COBOL systems process millions of transactions nightly, dumping the results into Fixed-Width Text Files (often called “Flat Files”). Unlike modern JSON or CSV, these files have no delimiters. A customer name isn’t “Column B”—it is exactly bytes 10 through 40 of a row. If a single byte is off, the entire dataset breaks.
For modern AI Agents, these files are opaque blobs. Without a strict schema validator, an agent trying to read a flat file will hallucinate relationships that don’t exist.
The Retrofit Solution
Section titled “The Retrofit Solution”We will build a FastMCP Server that acts as a Validation Gateway. Instead of your agent trying to parse raw text, it hands the file and a “Copybook” (schema definition) to this tool. The tool returns structured, validated JSON or specific error reports pinpointing exactly which row and byte failed.
🛠️ What We’re Building
Section titled “🛠️ What We’re Building”- A Fixed-Width Parser: Uses
pandasto strictly enforce byte-level column definitions. - Schema Validator: Checks if numeric fields actually contain numbers and if dates match COBOL formats (e.g.,
YYMMDD). - Data Cleanser: Trims the aggressive whitespace padding typical of mainframe exports.
The Code (server.py)
Section titled “The Code (server.py)”This MCP server provides two tools: validate_flat_file to check data integrity, and parse_flat_file to convert the legacy format into agent-readable JSON.
import pandas as pdimport ioimport jsonfrom typing import List, Dict, Anyfrom fastmcp import FastMCPfrom pydantic import BaseModel, Field
# Initialize the FastMCP servermcp = FastMCP("COBOL Flat File Validator")
class ColumnSpec(BaseModel): name: str = Field(..., description="The name of the field (e.g., 'CUSTOMER_ID')") length: int = Field(..., description="The fixed width length of the field in characters") dtype: str = Field(..., description="Expected data type: 'str', 'int', 'float'")
@mcp.tool()def parse_and_validate(file_content: str, schema: List[Dict[str, Any]]) -> str: """ Parses a COBOL fixed-width flat file content based on a provided schema (list of column specs). Returns a JSON string containing valid records and a report of any validation errors.
Args: file_content: The raw string content of the flat file. schema: A list of dicts, each containing 'name', 'length', and 'dtype'. Example: [{"name": "ID", "length": 5, "dtype": "int"}, ...] """ try: # Convert schema dicts to ColumnSpec objects for validation specs = [ColumnSpec(**s) for s in schema]
# Calculate field widths for pandas read_fwf col_specs = [] current_pos = 0 for spec in specs: col_specs.append((current_pos, current_pos + spec.length)) current_pos += spec.length
names = [s.name for s in specs]
# Use pandas to read the fixed width file # strict=False allows reading, we validate manually later for better error reporting df = pd.read_fwf( io.StringIO(file_content), colspecs=col_specs, header=None, names=names, dtype=str, # Read all as string initially to prevent auto-conversion errors keep_default_na=False )
validation_report = { "total_rows": len(df), "valid_rows": 0, "errors": [], "data": [] }
valid_data = []
for index, row in df.iterrows(): row_errors = [] cleaned_row = {}
for spec in specs: raw_val = row[spec.name] # COBOL CLEANSING: Strip padding clean_val = raw_val.strip() if raw_val else ""
# VALIDATION LOGIC if spec.dtype == 'int': if not clean_val.isdigit(): # Allow empty if strictly nullable, but COBOL often pads with 0 if clean_val == "": clean_val = 0 # assumption for this logic else: row_errors.append(f"Field '{spec.name}' expected INT, got '{raw_val}'") else: clean_val = int(clean_val)
elif spec.dtype == 'float': try: clean_val = float(clean_val) except ValueError: row_errors.append(f"Field '{spec.name}' expected FLOAT, got '{raw_val}'")
cleaned_row[spec.name] = clean_val
if row_errors: validation_report["errors"].append({ "row_index": index, "issues": row_errors }) else: valid_data.append(cleaned_row)
validation_report["valid_rows"] = len(valid_data) validation_report["data"] = valid_data
return json.dumps(validation_report, indent=2)
except Exception as e: return json.dumps({"fatal_error": str(e)})
@mcp.tool()def generate_schema_template() -> str: """Returns a template JSON schema for the user to fill out.""" template = [ {"name": "RECORD_TYPE", "length": 2, "dtype": "str"}, {"name": "ACCOUNT_NUM", "length": 10, "dtype": "int"}, {"name": "AMOUNT", "length": 12, "dtype": "float"}, {"name": "DESCRIPTION", "length": 30, "dtype": "str"} ] return json.dumps(template, indent=2)
if __name__ == "__main__": mcp.run()The Container (Dockerfile)
Section titled “The Container (Dockerfile)”This Dockerfile ensures all data processing dependencies are installed and the server is exposed on port 8000 for Railway or similar cloud hosting.
# Use a slim Python image to keep the container lightweightFROM python:3.11-slim
# Set the working directoryWORKDIR /app
# Install system dependencies if needed (usually not for pure pandas/mcp)# RUN apt-get update && apt-get install -y gcc
# Install Python dependencies# pandas for data processing, fastmcp for the serverRUN pip install pandas fastmcp pydantic
# Copy the server codeCOPY server.py .
# EXPOSE the port for the MCP serverEXPOSE 8000
# Run the serverCMD ["python", "server.py"]Agent Strategy: How to Use This
Section titled “Agent Strategy: How to Use This”Once deployed, your AI Agent (CrewAI, LangGraph, or OpenAI) can use this tool to safely ingest legacy data.
1. The Discovery Phase
Section titled “1. The Discovery Phase”The agent doesn’t know the file structure yet. It asks the user for the “Copybook” or “File Layout”.
- Agent Prompt: “I see you uploaded
SALES_2024.TXT. Please provide the column layout (Field Name, Length, Type).” - User: “ID is 5 chars, Date is 8 chars, Amount is 10 chars.”
2. The Schema Construction
Section titled “2. The Schema Construction”The agent constructs the schema list based on the user’s description.
- Agent Action: Calls
generate_schema_templateto see the format, then populates it.
3. The Validation Loop
Section titled “3. The Validation Loop”The agent passes the file content and the schema to parse_and_validate.
- Scenario A (Success): The tool returns
{"valid_rows": 500, "errors": []}. The agent proceeds to analyze the clean JSON data. - Scenario B (Failure): The tool returns
{"errors": [{"row_index": 4, "issues": ["Field 'AMOUNT' expected INT..."]}]}. - Agent Response: “It looks like Row 4 has corrupted data in the Amount field. Should I skip this row or would you like to correct the file?”
This approach prevents the “Garbage In, Garbage Out” problem that plagues AI when dealing with strict legacy formats.
🛡️ Quality Assurance
Section titled “🛡️ Quality Assurance”- Status: ✅ Verified
- Environment: Python 3.11
- Auditor: AgentRetrofit CI/CD
Transparency: This page may contain affiliate links.