Skip to content

Tender Document Processing Pipeline

Overview

The tender document processing pipeline is a comprehensive system designed to transform raw tender documents into structured, searchable, and intelligent data that can be leveraged for bid analysis, question extraction, and automated response planning. This pipeline combines multiple technologies and processing stages to extract maximum value from tender documentation.

What We Process

Document Types

  • Primary: Official tender documentation from buyers (ITTs, RFPs, RFQs)
  • Secondary: Supporting documents (specifications, contracts, schedules)
  • Contextual: Planning notes, client research, market analysis
  • Reference: Previous proposals, templates, company resources

Supported Formats

  • Documents: PDF, DOCX, XLSX, PPTX
  • Images: PNG, JPG, JPEG (with OCR capabilities)
  • Text: Plain text files for preprocessed content

Why We Process Documents This Way

Business Objectives

  1. Automated Intelligence: Extract structured information from unstructured documents
  2. Time Efficiency: Reduce manual document review time from hours to minutes
  3. Consistency: Ensure no critical requirements or deadlines are missed
  4. Searchability: Enable semantic search across all tender content
  5. Answer Planning: Identify questions that require responses and map relevant context
  6. Risk Mitigation: Highlight compliance requirements and potential issues

Technical Benefits

  1. Scalability: Process multiple documents simultaneously
  2. Accuracy: Combine AI/ML with human oversight for reliable extraction
  3. Retrievability: Enable fast, contextual search across all content
  4. Integration: Seamlessly connect with other platform modules
  5. Persistence: Store processed data for future analysis and reuse

Processing Architecture

graph TB
    Input[Document Upload] --> Validation[Document Validation]
    Validation --> Parse[Azure Document Intelligence]
    Parse --> Chunk[Text Chunking Pipeline]

    subgraph "Python Backend Processing"
        Chunk --> TagEngine[ML Tagging Engine]
        TagEngine --> QuestionEngine[Question Extraction]
        QuestionEngine --> RefEngine[Reference Processing]
    end

    subgraph "NestJS Backend Processing"
        RefEngine --> Analysis[LLM Analysis]
        Analysis --> Embed[Embedding Generation]
        Embed --> Vector[Vector Storage]
    end

    Vector --> Cache[Redis Cache]
    Cache --> Persist[MongoDB Persistence]

    subgraph "Storage Layer"
        Cache
        Persist
        Vector
    end

Stage 1: Document Ingestion and Validation

Input Processing

Documents are received as base64-encoded content with metadata including:

  • Document name and type
  • Project association
  • User context
  • Company identification

Validation Steps

  1. Format Verification: Ensure supported file types
  2. Size Limits: Check document size constraints
  3. Base64 Validation: Verify encoding integrity
  4. Content Detection: Identify document structure

Azure Document Intelligence Integration

We leverage Azure's Document Intelligence service for robust text extraction:

const processedDocs = await this.documentParseService.processDocuments([
  {
    name: document.name,
    type: document.type,
    content: base64Content,
    format: "per_page", // Enable page-level extraction
  },
]);

Benefits of Azure Document Intelligence:

  • Multi-format Support: Handles various document types seamlessly
  • OCR Capabilities: Extracts text from scanned documents and images
  • Layout Preservation: Maintains document structure and formatting context
  • Table Recognition: Identifies and extracts tabular data
  • Form Processing: Recognizes structured form elements

Stage 2: Text Chunking Pipeline

Chunking Strategy

The Python backend implements an intelligent chunking system that adapts to document structure:

class TextChunker:
    def __init__(self,
                max_tokens: int = 800,
                min_tokens: int = 100,
                overlap_tokens: int = 400):  # 50% overlap for context preservation

Multi-Level Chunking Approach

1. Section-Based Chunking (Primary Strategy)

  • Identifies document sections using pattern recognition
  • Preserves logical document structure
  • Maintains context within sections

Pattern Recognition:

header_patterns = [
    r'^\d+\.\s+[A-Z][^\.]+$',  # Numbered sections (1. Section Title)
    r'^[A-Z][A-Z\s]+$',        # ALL CAPS headings
    r'^[A-Z][^\.]+:$',         # Title ending with colon
    r'^[^\n]+\n[=\-]{3,}$'     # Underlined titles
]

2. Paragraph-Based Chunking (Fallback)

  • Splits by paragraph boundaries when sections aren't clear
  • Combines small paragraphs to meet token requirements
  • Maintains narrative flow

3. Token-Based Chunking (Final Fallback)

  • Simple token-count splitting for edge cases
  • Ensures processing of all content regardless of structure

Overlap Strategy

Purpose: Maintain context between chunks to prevent information loss at boundaries.

Implementation:

  • 50% overlap between chunks (400 tokens out of 800)
  • Smart overlap that prioritizes sentence boundaries
  • Prevents question or requirement fragmentation

Chunk Metadata Generation

Each chunk receives comprehensive metadata:

chunk_metadata = ChunkMetadata(
    chunk_id=str(uuid.uuid4()),
    chunk_index=chunk_index,
    chunk_size=token_count,
    text=chunk_text,
    tags=ml_generated_tags,
    content_type=detected_type,  # question, requirement, specification
    page_number=page_mapping,
    section_number=extracted_section,
    section_title=extracted_title,
    quality_score=calculated_quality,
    contains_references=reference_detection,
    referenced_documents=extracted_refs,
    overlap_tokens=overlap_count,
    metadata={
        "processing_timestamp": time.time(),
        "token_count": token_count,
        "character_count": len(chunk_text),
        "word_count": len(chunk_text.split()),
        "chunking_method": strategy_used
    }
)

Stage 3: ML-Powered Content Analysis

Tagging Engine

Uses machine learning models to categorise content:

Tag Categories:

  • Respondent Questions: Content requiring bidder responses
  • Requirements: Mandatory specifications and criteria
  • Deadlines: Time-sensitive information
  • Compliance: Regulatory and legal requirements
  • Evaluation Criteria: Scoring and assessment methods
  • Technical Specifications: Detailed technical requirements
  • Commercial Terms: Pricing and contract conditions

Question Extraction Engine

Sophisticated natural language processing to identify questions requiring responses:

def extract_questions(text: str) -> Dict[str, Any]:
    """
    Extract respondent questions from text using NLP patterns
    Returns structured question data with metadata
    """

Question Detection Patterns:

  • Direct interrogatives ("What is...", "How will...", "When does...")
  • Instruction-based questions ("Describe...", "Provide...", "Explain...")
  • Compliance questions ("Confirm that...", "Demonstrate...")
  • Specification requests ("Detail your approach to...")

Reference Processing Engine

Identifies cross-references between documents and sections:

Reference Types:

  • Document references (e.g., "See Appendix A", "As per Schedule 2")
  • Section references (e.g., "Section 3.2", "Clause 4.1.1")
  • External references (e.g., regulatory standards, industry guidelines)

Stage 4: Embedding Generation and Vector Storage

Embedding Strategy

Each processed chunk is converted into a high-dimensional vector representation using Azure OpenAI's embedding models:

// Generate embedding for semantic search
const embedding = await this.embeddings.embedQuery(cleanedText);
const normalisedEmbedding = this.normaliseEmbedding(embedding, 1536);

Embedding Benefits:

  • Semantic Search: Find conceptually similar content
  • Question Matching: Identify relevant chunks for specific questions
  • Context Retrieval: Surface related information across documents
  • Answer Planning: Match questions to potential answer sources

Vector Database Structure

Embeddings are stored with comprehensive metadata for efficient retrieval:

interface VectorChunk {
  text: string;
  embedding: number[1536];
  documentId: string;
  projectId: string;
  chunkIndex: number;
  tags: string[];
  contentType: string;
  hasRespondentQuestions: boolean;
  questionIds: string[];
  sectionNumber: string;
  pageNumber: number;
  qualityScore: number;
  metadata: ChunkMetadata;
}

Storage in PostgreSQL with pgvector

We use PostgreSQL with the pgvector extension for efficient vector operations:

CREATE TABLE tender_chunks (
  id SERIAL PRIMARY KEY,
  text TEXT NOT NULL,
  embedding vector(1536),
  document_id TEXT NOT NULL,
  project_id TEXT,
  chunk_index INTEGER NOT NULL,
  tags JSONB,
  content_type TEXT,
  has_respondent_questions BOOLEAN DEFAULT FALSE,
  quality_score INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON tender_chunks USING hnsw (embedding vector_cosine_ops);

Stage 5: Intelligent Caching and Persistence

Redis Caching Strategy

Purpose: Provide fast access to frequently accessed documents during active editing sessions.

Implementation:

// Cache key structure
const cacheKey = `tenderdoc:${projectId}:${documentId}:data`;

// TTL of 15 minutes ensures fresh data
await redis.setex(cacheKey, 900, JSON.stringify(documentData));

Cache Benefits:

  • Sub-millisecond access times for active documents
  • Reduced database load during intensive editing sessions
  • Automatic expiration prevents stale data issues

MongoDB Persistence

Purpose: Durable, long-term storage with rich querying capabilities.

Document Structure:

{
  _id: ObjectId,
  documentId: String,
  projectId: String,
  name: String,
  type: String,
  content: String,
  tags: [String],
  respondentQuestions: {
    questionId: {
      question: String,
      section: String,
      priority: String,
      answerPlanningStatus: Boolean
    }
  },
  chunks: [ChunkReference],
  processingMetadata: {
    version: String,
    processedAt: Date,
    totalChunks: Number,
    qualityScore: Number
  },
  createdAt: Date,
  updatedAt: Date
}

Stage 6: Analysis and Intelligence Layer

LLM-Powered Analysis

The NestJS backend orchestrates comprehensive document analysis using LangChain and Azure OpenAI:

async parseTenderDocument(documentContent: string) {
  const llm = await this.langchainService.getLLM();

  const modelWithFunctions = llm.bind({
    functions: [{
      name: 'parseTenderDocument',
      description: 'Parse tender document information into structured format',
      parameters: zodToJsonSchema(TenderPackSchema),
    }],
    function_call: { name: 'parseTenderDocument' },
  });

  return await modelWithFunctions.invoke(analysisPrompt);
}

Structured Extraction

The LLM extracts key information including:

Key Information Categories:

  • Deadlines: Submission dates, milestone dates, contract start dates
  • Requirements: Technical, commercial, and compliance requirements
  • Evaluation Criteria: Scoring methodology and weightings
  • Key Dates: Important timeline events
  • Risks: Potential compliance or delivery risks
  • Opportunities: Value-add possibilities and differentiators

Response Formatting

Analysis results are structured for consumption by other platform modules:

interface TenderAnalysis {
  projectId: string;
  keyInformation: {
    deadlines: Deadline[];
    requirements: Requirement[];
    keyDates: KeyDate[];
    evaluationCriteria: EvaluationCriteria[];
  };
  summary: {
    keyPoints: string[];
    risks: string[];
    opportunities: string[];
  };
  processingMetadata: {
    totalDocuments: number;
    successfullyProcessed: number;
    failed: number;
    processingTime: number;
  };
}

Integration Points

Frontend Integration

The processed data is consumed by various frontend components:

  1. Document Browser: Displays categorised documents with intelligent filtering
  2. Question Manager: Shows extracted questions with answer planning status
  3. Search Interface: Enables semantic search across all processed content
  4. Analysis Dashboard: Presents key insights and extracted information
  5. Answer Editor: Provides contextual information while writing responses

API Endpoints

Key endpoints for accessing processed data:

// Process new tender documents
POST /api/tender-parsing/tag-documents
POST /api/tender-parsing/process-bid-analysis

// Retrieve processed data
GET /api/tender-parsing/project-documentation/:projectId
GET /api/tender-parsing/search-documents

// Update processing status
PUT /api/tender-parsing/update-tags
PUT /api/tender-parsing/update-answer-planning

Background Processing

Utilizes BullMQ for non-blocking operations:

// Queue linked references update for background processing
await this.linkedReferencesQueue.add(
  "updateLinkedReferences",
  { projectId, documents },
  {
    delay: 5000, // Process after initial response
    attempts: 3,
    backoff: "exponential",
  }
);

Performance Optimisations

Chunking Optimisations

  1. Adaptive Chunk Sizes: Adjust based on content density
  2. Smart Overlap: Preserve context without excessive duplication
  3. Parallel Processing: Use ThreadPoolExecutor for concurrent chunk processing
  4. Quality Scoring: Prioritize high-quality chunks for analysis

Embedding Optimisations

  1. Batch Processing: Generate embeddings in batches of 5 to avoid timeouts
  2. Retry Logic: Implement exponential backoff for transient failures
  3. Dimension Normalization: Ensure consistent vector dimensions
  4. Caching: Store embeddings to avoid regeneration

Storage Optimisations

  1. Indexed Queries: Optimise database queries with proper indexing
  2. Batch Insertions: Insert multiple chunks in single transactions
  3. Connection Pooling: Efficiently manage database connections
  4. TTL Management: Automatic cleanup of expired cache entries

Quality Assurance

Data Validation

  1. Input Validation: Comprehensive validation of incoming documents
  2. Processing Validation: Verify each processing stage completion
  3. Output Validation: Ensure generated data meets quality standards
  4. Error Recovery: Graceful handling of processing failures

Monitoring and Logging

  1. Processing Metrics: Track success rates, processing times, and error rates
  2. Quality Metrics: Monitor chunk quality scores and extraction accuracy
  3. Usage Metrics: Track API usage patterns and performance
  4. Alert Systems: Notify of processing failures or quality degradation

Future Enhancements

Planned Improvements

  1. Enhanced ML Models: Improve tagging and question extraction accuracy
  2. Multi-language Support: Process documents in multiple languages
  3. Visual Element Extraction: Better handling of charts, diagrams, and images
  4. Collaborative Annotations: Enable manual corrections and improvements
  5. Version Control: Track document processing versions and changes
  6. Advanced Analytics: Provide deeper insights into tender patterns and trends

Scalability Roadmap

  1. Horizontal Scaling: Support processing across multiple servers
  2. Streaming Processing: Handle very large documents efficiently
  3. Real-time Updates: Enable live document processing and updates
  4. Advanced Caching: Implement multi-tier caching strategies
  5. API Rate Limiting: Protect against abuse and ensure fair usage

Conclusion

The tender document processing pipeline represents a sophisticated integration of multiple technologies and methodologies designed to extract maximum value from tender documentation. By combining Azure's document intelligence capabilities with custom ML models, intelligent chunking strategies, and comprehensive storage solutions, we create a system that not only processes documents efficiently but also provides the intelligence layer necessary for effective bid management.

This pipeline serves as the foundation for all downstream activities including question answering, response planning, compliance checking, and competitive analysis, making it a critical component of the overall BidScript platform.