Jump to content

Building a Generic Knowledge Extraction AI Framework for Organization-Specific Use Cases

From JOHNWICK

Note: The GitHub repo has been updated with an improved schema generator (schema.py). This article refers to the previous version of the schema generator (schema_basic.py). The new version of the schema generator ensures the exact field names (and optionality), includes data type hints for schema generation, recognizes richer data types, such as dates, datetimes, numeric quantities, and enums, and normalizes them in place. Date handling is improved. The generator accepts many input formats and honors any format the user specifies.

TD;LR

This article demonstrates how to build a flexible knowledge extraction system that converts natural language requirements into Pydantic schemas and extracts structured data from documents. You will learn about two-stage schema generation, intelligent structure detection (flat vs. nested), context-aware document parsing with a custom vision parser, and structured knowledge extraction from parsed documents. The framework ensures type-safe extraction with automatic data normalization.

Table of Contents: · Introduction
· Requirement Parsing and Format Detection from Simple Language
· Document Parsing
· Schema Generation
· Data Extraction Engine
· Examples
· Potential Extensions


Introduction

Extracting structured data from unstructured documents using Large Language Models (LLMs) is becoming increasingly demanding. We are witnessing several diverse industrial applications for extracting knowledge from invoices, research grants, medical records, legal contracts, business reports, customer documents, and others. Though LLMs can be directly applied for knowledge extraction tasks, an accurate and reliable extraction requires enforcing the models to output responses in a structured form. That requires designing the structured models to enforce the LLM output responses in consistent and fixed formats. However, manually defining Pydantic models, validation rules, and field specifications requires domain expertise.

In my previous articles (e.g., the one cited below), I addressed the issue of parsing extraction requirements from simple language and dynamic generation of schemas for LLM-based knowledge extraction. I created multiple extractors for dealing with different types of knowledge extraction tasks. The idea of dynamic model generation was well received, and I was asked by the readers to generate more robust models.

Building a (Enhanced) Generic Knowledge Extraction AI Tool with Multiple Extractors to Create… A generic, knowledge extraction AI tool that allows creating task/organization-specific use-cases by specifying… medium.com

The current article addresses this topic at the next level. It defines methods for parsing extraction requirements from users’ extraction needs in plain language, automatically generating type-safe schemas, detecting data structures, and handling all the validation in a much simpler way than the methods in the previous articles.

This article walks through the architecture and implementation of such a system, which has five main components: i) requirement parser, ii) format detector, iii) document parsers, iv) schema generator, and v) extractor.

You will learn how to:

  • Convert natural language requirements into dynamic Pydantic models using a requirement parser.
  • Automatically detect whether data is flat (one record per document) or nested (multiple items per document) using a format detector.
  • Parse data from documents by selecting one of multiple parsers in the given parser library, including a vision parser with context-aware processing to deal with complex document structures. The parser merges tables that span across pages using LLM-powered post-processing
  • Generate data models for extraction using a schema generator.
  • Extract and normalize extracted data for production use using an extractor

The framework is modular and designed for flexibility. Whether you are processing invoices, research proposals, or any document type, you can adapt it to your needs by simply describing what you want to extract.

The complete code is available on GitHub and can be adapted for any document extraction task.

You can also install the extraction framework as a python package from the repository of our GAIK project, which is a research and development project that builds a business-oriented Generative AI (GenAI) toolkit for knowledge management. Currently, this toolkit is under development with new modules being added. The extraction framework discussed in this article is one of the modular components of this toolkit. See the toolkit repo at the following link:

GitHub - GAIK-project/gaik-toolkit: Python toolkit providing reusable AI/ML utilities: schema… Python toolkit providing reusable AI/ML utilities: schema extraction, structured outputs, and production-ready… github.com

The extractor package can be installed from the GAIK repo as follows:

pip install gaik[extractor]

See the examples and README of the extractor’s repo to use the package. The complete workflow, starting from requirement specification to knowledge extraction, is shown in the following diagram:

Complete workflow from plain language requirements to validated structured data. The system automatically detects data structure, selects appropriate parsers, generates schemas, and extracts normalized data.

Let us dive into the individual steps of the workflow.

Requirement Parsing and Format Detection from Simple Language The first step in the extraction pipeline is understanding what the user wants to extract. Instead of requiring users to write Pydantic models manually, the system accepts plain English descriptions of extraction requirements and automatically converts them into structured field specifications.

Parsing User Requirements

The requirement parser analyzes natural language input and extracts field definitions, including their names, types, descriptions, and validation rules. This is handled by the parse_user_requirements() function in schema.py, which sends the raw prompt to the LLM and asks for a structured response that fits the FieldSpec schema. Each FieldSpec contains everything the model generator needs to later build a Pydantic field:

For example, a user might provide requirements like:

requirements = """
Extract project information from research grant proposals:
- Project title (string, required)
- Total budget in EUR (decimal, required)
- Start date (date, format: iso-date)
- Project status (enum: active, completed, pending)
- Principal investigator name (string)
"""

The requirement parser converts this into a list of FieldSpec objects, where each field specification contains:

class FieldSpec(BaseModel):
    field_name: str                     # snake_case via validator
    field_type: AllowedTypes            # "str", "int", "decimal", "list[str]", etc.
    description: str
    required: bool
    enum: Optional[list[str]] = None
    pattern: Optional[str] = None
    format: Optional[Literal["iso-date", "currency-eur"]] = None

The FieldSpec class in schema.py includes validators to ensure field names follow Python naming conventions and that enum lists are non-empty. This structured representation makes it easy to dynamically generate Pydantic models later in the pipeline.

The requirement parser emits a list of FieldSpec objects representing those fields, plus a use_case_name, all bundled inside ExtractionRequirements:

class ExtractionRequirements(BaseModel):
    use_case_name: str
    fields: list[FieldSpec]

Format Detection: Flat vs. Nested Structures Next, detect_structure_type() function in schema.py examines the same natural-language prompt to decide whether the schema should be flat (one record per document) or nested (multiple items within a document). It returns a StructureAnalysis object:

class StructureAnalysis(BaseModel):
    structure_type: Literal["flat", "nested_list"]
    parent_container_name: str  # e.g., "line_items"
    parent_description: str
    item_description: str
    reasoning: str

For example:

  • A statement like “Extract invoice number, date, total amount for each invoice” stays flat.
  • A statement like “For each line item in the invoice, extract item number, description, quantity” becomes nested_list with a container such as items.

The detect_structure_type() function uses the following prompt to analyze user requirements:

prompt = """
Analyze the following extraction requirements and determine the output structure.

Use NESTED_LIST when:
- The DOCUMENT contains multiple items/records/rows to extract
- Instructions mention 'multiple items IN THE DOCUMENT', 'list of items', 'table of records'
- 'one line per item', 'one row per record', 'repeat for each entry'
- Document is structured as a table, list, or collection of similar items
- Example: Extract all products from an invoice (multiple products in one invoice)

Use FLAT when:
- ONE record per document (even if processing multiple documents)
- 'for each document', 'from each document', 'per document'
- Document describes a SINGLE entity (e.g., one project, one invoice, one person)
- Extracting summary/aggregate information from the document
- Example: Extract project details from grant document (one project per document)

IMPORTANT: 'For each X, extract...' means FLAT if X is the document itself,
NESTED if X refers to multiple items within the document.

Requirements:
{user_description}
"""

The parse_nested_requirements() orchestrates both steps. If the analysis says “flat,” it simply builds a single model. If it says “nested_list,” it parses the item-level requirements using StructureAnalysis.item_description, creates an item model, and then wraps it inside a parent model with a List[ItemModel]field named after parent_container_name.

After requirement parsing, we are now ready for schema generation. Why Separate Parsing from Schema Generation?

Keeping requirement parsing (ExtractionRequirements) and structure detection (StructureAnalysis) separate from model generation buys flexibility:

  • You can capture the LLM-derived specs once, inspect or edit them, and even persist them (as shown in examples/extraction_example_4.py).
  • If you already know your fields, you can skip the LLM entirely by constructing ExtractionRequirements manually (see examples/extraction_example_3.py).
  • This decoupling is what enables both dynamic schema creation and deterministic manual schemas within the same toolkit.

Document Parsing

Before extracting structured data, documents must first be converted into a format that LLMs can process effectively. The framework provides a library of parsers, each suited to different document complexities and parsing requirements. Users can select the appropriate parser based on the difficulty of their parsing task — from simple text extraction to complex multi-page tables with intricate layouts. The framework can use multiple parsers from the parsers/ module based on the document complexity. The most notable, customized parser is VisionParser (vision.py) for PDF-to-markdown conversion using Vision API with context-aware processing for complex layouts. Other standard parsers include PyMuPDFParser (pymupdf.py), DoclingParser (docling.py), and DocxParser (docx.py).

VisionParser is specially designed to deal with complex document structures, such as tables spanning multiple pages, intricate layouts, or documents where preserving visual context is critical.

VisionParser: Context-Aware PDF Processing The VisionParser class in vision.py converts PDF pages to high-resolution images and uses OpenAI’s Vision API to extract markdown content while maintaining awareness of document structure across pages.

Core Workflow:

The VisionParser follows a multi-stage process:

  • PDF to Image Conversion: Uses PyMuPDF to render each PDF page as a high-resolution image
  • Vision API Processing: Sends images to the Vision API with carefully engineered prompts
  • Context Injection: Passes previous page content to maintain continuity
  • Table Merging: Post-processes markdown to merge tables split across pages

Here is how to initialize and use the VisionParser:

from extractors import get_openai_config
from extractors.parsers import VisionParser

config = get_openai_config(use_azure=True)
parser = VisionParser(
    openai_config=config,
    use_context=True,      # Enable inter-page context
    dpi=300,               # Image resolution (200-300 recommended)
    clean_output=True      # Enable LLM-powered table merging
)

# Convert PDF to markdown
markdown_pages = parser.convert_pdf("document.pdf")
parser.save_markdown(markdown_pages, "output/document.md")

Context-Aware Processing:

One of the most challenging aspects of document parsing is handling content that spans multiple pages, particularly tables. Traditional parsers treat each page independently, often breaking tables at page boundaries and losing context. The VisionParser uses context injection. When processing page N, it includes the markdown content from page N-1 as context in the Vision API prompt. This allows the model to understand when a table continues from the previous page and avoid repeating headers or introducing artificial breaks.

The _parse_image_with_vision() method in vision.py implements this context passing:

def _parse_image_with_vision(
    self,
    image: Image.Image,
    page_num: int,
    previous_context: Optional[str] = None
) -> str:
    messages = [...]

    if previous_context and self.use_context:
        context_text = f"""
CONTEXT FROM PREVIOUS PAGE:
The previous page has the following content:
```
{previous_context}
```

If this page continues a table or section from the previous page,
continue it seamlessly without repeating headers.
"""
        messages[0]["content"].append({"type": "text", "text": context_text})

    # Add the current page image and process
    ...

This context-aware approach improves the accuracy of multi-page table extraction, preventing the model from hallucinating new table structures or duplicating content.

The VisionParser uses a prompt (DEFAULT_TABLE_PROMPT in vision.py) to guide the Vision API:

DEFAULT_TABLE_PROMPT = """
Convert this document page to markdown format.

CRITICAL RULES:
1. Preserve ALL tables exactly as shown with proper markdown table syntax
2. Do NOT add empty rows that are not visible in the image
3. If a table continues from a previous page, continue it without repeating headers
4. Use | for table columns and separate header row with |---|---|
5. Extract all text content accurately
6. Maintain document structure (headings, lists, paragraphs)

Do not hallucinate content. Only extract what is actually visible.
"""

The prompt emphasizes accuracy over creativity, explicitly preventing the model from adding empty rows or inventing content — common issues when using vision models for structured data extraction.

LLM-Powered Table Merging: Even with context-aware processing, multi-page tables can have artifacts like incomplete rows at page breaks or empty rows introduced during extraction. The VisionParser addresses this with post-processing via the _clean_markdown_with_llm() method.

When clean_output=True, the parser takes the extracted markdown from all pages and asks an LLM to:

  • Merge tables that span across — -PAGE_BREAK — - markers
  • Remove hallucinated empty table rows
  • Handle incomplete rows at page boundaries
  • Consolidate repeated headers
def _clean_markdown_with_llm(self, markdown_pages: List[str]) -> str:
    # Combine pages with separators
    combined = "\n\n---PAGE_BREAK---\n\n".join(markdown_pages)

    # LLM prompt to clean and merge
    cleaning_prompt = """
You are given a multi-page markdown document with tables that may span pages.

Tasks:
1. Merge tables that are split across PAGE_BREAK markers
2. Remove empty table rows (rows with only | | | structure)
3. Handle incomplete rows at page boundaries
4. Keep all other content unchanged

Return the cleaned markdown.
"""
    # ... LLM processing
    return cleaned_markdown

This two-stage approach (context-aware extraction followed by merging) produces cleaner output than single-pass methods, especially for complex financial documents or technical reports with extensive tabular data.

Schema Generation

Once we have parsed requirements (ExtractionRequirements) and detected the structure type (StructureAnalysis), the system generates type-safe Pydantic models that enforce the LLM to output data in the exact format we need.

The SchemaGenerator Class

The SchemaGenerator class in schema.py handles the entire schema generation workflow. It is the main user-facing interface that ties together requirement parsing, structure detection, and model creation.

Here is how to use it:

from extractors import get_openai_config, SchemaGenerator

config = get_openai_config(use_azure=True)
generator = SchemaGenerator(config=config)

requirements = """
Extract invoice line items:
- Item number (string, required)
- Description (string, required)
- Quantity (integer, required)
- Unit price (decimal, required)
- Total amount (decimal, required)
"""

# Generate the complete schema
schema = generator.generate_schema(user_requirements=requirements)

# Access the generated components
print(generator.extraction_model)      # The Pydantic model class
print(generator.item_requirements)     # Field specifications
print(generator.structure_analysis)    # Structure detection result

The generate_schema() method performs three operations internally:

  • Structure Detection: Calls detect_structure_type() to determine if the data is flat or nested
  • Requirement Parsing: Calls parse_user_requirements() or parse_nested_requirements() to extract field specifications
  • Model Creation: Calls create_extraction_model() to build the actual Pydantic model

The returned schema is a dynamically generated Pydantic model class that can be used directly for validation and extraction.

Dynamic Model Creation

The create_extraction_model() function in schema.py is the core of schema generation. It takes ExtractionRequirements and produces a Pydantic model with proper type constraints, validators, and configuration.

For each FieldSpec, the function determines the appropriate Pydantic field type:

def create_extraction_model(requirements: ExtractionRequirements) -> type[BaseModel]:
    """
    Dynamically create a Pydantic model from field specifications.
    """
    fields = {}

    for spec in requirements.fields:
        # Map field types
        if spec.field_type == "str":
            if spec.enum:
                # Create Literal type for enums
                field_type = Literal[tuple(spec.enum)]
            elif spec.pattern:
                # Add regex pattern constraint
                field_type = Annotated[str, constr(pattern=spec.pattern)]
            else:
                field_type = str
        elif spec.field_type == "int":
            field_type = int
        elif spec.field_type == "decimal":
            field_type = Decimal
        # ... more type mappings

        # Create Field with description
        default = ... if spec.required else None
        fields[spec.field_name] = (
            field_type,
            Field(default=default, description=spec.description)
        )

    # Create model with strict validation
    model = create_model(
        sanitize_model_name(requirements.use_case_name),
        __config__=ConfigDict(extra='forbid'),
        **fields
    )

    return model

Key features of the generated models:

  • Type Safety: Each field has the correct Python type (str, int, Decimal, etc.)
  • Enum Validation: Enum fields use Literal types for strict value checking
  • Pattern Validation: String fields can have regex pattern constraints
  • Required vs Optional: Uses Pydantic’s … (Ellipsis) for required fields, None for optional
  • Extra Fields Forbidden: ConfigDict(extra=’forbid’) rejects any unexpected fields in the LLM output

Handling Nested Structures

When the structure analysis indicates nested_list, the schema generator creates two models:

  • Item Model: Represents individual items (e.g., InvoiceLineItem_Extraction)
  • Container Model: Wraps items in a list (e.g., InvoiceLineItems_Collection)

For example, given the invoice line items requirements above, the generator creates:

class InvoiceLineItem_Extraction(BaseModel):
    model_config = ConfigDict(extra='forbid')
    item_number: str = Field(description="Item number")
    description: str = Field(description="Description")
    quantity: int = Field(description="Quantity")
    unit_price: Decimal = Field(description="Unit price")
    total_amount: Decimal = Field(description="Total amount")

class InvoiceLineItems_Collection(BaseModel):
    model_config = ConfigDict(extra='forbid')

    items: List[InvoiceLineItem_Extraction] = Field(
        description="List of invoice line items"
    )

The container model is what gets passed to the extractor. When the LLM processes a document, it returns data that matches this nested structure, ensuring all items are captured in a single extraction call.

Generated schemas can be saved and reused across sessions. The SchemaGenerator stores:

  • generator.extraction_model: The Pydantic model class
  • generator.item_requirements: The ExtractionRequirements object
  • generator.structure_analysis: The StructureAnalysis object

These can be saved (see examples/extraction_example_4.py) and reloaded later, avoiding the need to regenerate schemas for repeated extraction tasks.

Data Extraction Engine

With a validated Pydantic schema in hand, the system is now ready to extract structured data from documents. The DataExtractor class in extractor.py handles this process, using OpenAI’s structured outputs API to ensure the LLM returns data that exactly matches the generated schema.

The DataExtractor Class

The DataExtractor takes parsed documents (as text or markdown) and the generated schema, then uses an LLM to extract structured information. Here is how to use it:

from extractors import get_openai_config, DataExtractor

config = get_openai_config(use_azure=True)
extractor = DataExtractor(config=config)

# Assume 'schema' and 'requirements' from SchemaGenerator
documents = [
    """
    Invoice #INV-2024-001
    Date: 2024-01-15

    Items:
    1. Widget A - Qty: 5 - Price: $19.99 - Total: $99.95
    2. Gadget B - Qty: 3 - Price: $29.99 - Total: $89.97

    Total: $189.92
    """
]

results = extractor.extract(
    extraction_model=schema,                    # Generated Pydantic model
    requirements=generator.item_requirements,   # Field specifications
    user_requirements=requirements,             # Original text requirements
    documents=documents,
    save_json=True,
    json_path="invoice_extraction.json"
)

print(results)
# [{'items': [{'item_number': '1', 'description': 'Widget A', ...}, ...]}]

The extract() method processes each document and returns a list of dictionaries containing the extracted data.

Structured Outputs

The extraction engine uses OpenAI’s structured outputs feature via the _parse_with() helper function in schema.py. This ensures the LLM’s response conforms to the Pydantic schema:

def _parse_with(client, model, messages, response_format):
    """
    Call OpenAI with structured output format using Pydantic schema.
    """
    response = client.beta.chat.completions.parse(
        model=model,
        messages=messages,
        response_format=response_format,  # Pydantic model class
        temperature=0,     # Deterministic outputs
        top_p=1.0,
        seed=12345,        # Fixed seed for reproducibility
        timeout=30
    )
    return response

The structured outputs API guarantees that the LLM’s response will match the schema or fail with a clear error — no more parsing JSON that might be malformed or have unexpected fields.

Extraction Prompt Design

The DataExtractor uses a carefully designed system prompt (SYSTEM_PARSER in schema.py) to guide the LLM:

SYSTEM_PARSER = """
You are a data extraction assistant. Your task is to extract structured information
from documents according to the provided schema.

Instructions:
1. Extract data exactly as specified in the requirements
2. Preserve original values when possible
3. For dates, extract in the format present in the document
4. For lists, extract all items mentioned
5. If a field is not present, leave it as null (if optional)
6. Do not invent or hallucinate data
7. Be precise with numbers and decimals

The schema will enforce the output format. Extract accurately.
"""

This prompt emphasizes accuracy and prevents hallucination which is critical for production extraction tasks where reliability matters.

The extraction engine also includes helper functions in schema.py for post-extraction normalization (_normalize_record(), _to_iso_date()), error handling with exponential backoff retries (_with_retries()), and batch processing support for extracting data from multiple documents of the same type in a single call.

Examples

The code includes four examples in the examples/ directory that demonstrate different usage patterns and capabilities.

extraction_example_1.py demonstrates the basic workflow: defining requirements in natural language, generating a schema, and extracting data from text documents.

extraction_example_2.py shows how to handle nested structures where multiple items exist within a single document. It demonstrates a hierarchical extraction workflow where line items from a purchase order are extracted and then matched with several bills of materials to extract and combine several fields. It also uses document classification to route different document types to appropriate schemas. extraction_example_3.py shows how to manually define a Pydantic model and create the corresponding ExtractionRequirements object, then use them with the DataExtractor.

extraction_example_4.py shows how to serialize a generated schema to a Python module file and save the requirements as JSON, then reload them later for extraction.

To run any example:

# Make sure you have configured .env with your API keys
cd examples
python extraction_example_1.py

Each example includes inline comments explaining the code and prints results to the console. You can modify the requirements and input documents in each file to experiment with different extraction scenarios.

Potential Extensions

There are several directions for extending the capabilities of this framework to handle more complex scenarios and production requirements.

  • Confidence scoring & uncertainty tagging for each field so low-confidence extractions can be routed to humans.
  • Streaming extraction for very large documents by chunking, preserving context, and merging partial results.
  • Active-learning loop that captures user corrections to refine prompts, schemas, or custom models over time.
  • Pluggable validation hooks for domain-specific business rules beyond the base Pydantic checks.
  • LLM-as-judge workflow where each extraction is verified, feedback is logged, and necessary revisions are triggered.

The complete code is available on GitHub: https://github.com/umairalipathan1980/Knowledge-Extraction-Using-Dynamic-Schema-Generation

Read the full article here: https://medium.com/data-science-collective/building-a-generic-knowledge-extraction-ai-framework-for-organization-specific-use-cases-cbb52ce93e48