Tender Document Processing Pipeline¶
Overview¶
The tender document processing pipeline is a comprehensive system designed to transform raw tender documents into structured, searchable, and intelligent data that can be leveraged for bid analysis, question extraction, and automated response planning. This pipeline combines multiple technologies and processing stages to extract maximum value from tender documentation.
What We Process¶
Document Types¶
- Primary: Official tender documentation from buyers (ITTs, RFPs, RFQs)
- Secondary: Supporting documents (specifications, contracts, schedules)
- Contextual: Planning notes, client research, market analysis
- Reference: Previous proposals, templates, company resources
Supported Formats¶
- Documents: PDF, DOCX, XLSX, PPTX
- Images: PNG, JPG, JPEG (with OCR capabilities)
- Text: Plain text files for preprocessed content
Why We Process Documents This Way¶
Business Objectives¶
- Automated Intelligence: Extract structured information from unstructured documents
- Time Efficiency: Reduce manual document review time from hours to minutes
- Consistency: Ensure no critical requirements or deadlines are missed
- Searchability: Enable semantic search across all tender content
- Answer Planning: Identify questions that require responses and map relevant context
- Risk Mitigation: Highlight compliance requirements and potential issues
Technical Benefits¶
- Scalability: Process multiple documents simultaneously
- Accuracy: Combine AI/ML with human oversight for reliable extraction
- Retrievability: Enable fast, contextual search across all content
- Integration: Seamlessly connect with other platform modules
- Persistence: Store processed data for future analysis and reuse
Processing Architecture¶
graph TB
Input[Document Upload] --> Validation[Document Validation]
Validation --> Parse[Azure Document Intelligence]
Parse --> Chunk[Text Chunking Pipeline]
subgraph "Python Backend Processing"
Chunk --> TagEngine[ML Tagging Engine]
TagEngine --> QuestionEngine[Question Extraction]
QuestionEngine --> RefEngine[Reference Processing]
end
subgraph "NestJS Backend Processing"
RefEngine --> Analysis[LLM Analysis]
Analysis --> Embed[Embedding Generation]
Embed --> Vector[Vector Storage]
end
Vector --> Cache[Redis Cache]
Cache --> Persist[MongoDB Persistence]
subgraph "Storage Layer"
Cache
Persist
Vector
end
Stage 1: Document Ingestion and Validation¶
Input Processing¶
Documents are received as base64-encoded content with metadata including:
- Document name and type
- Project association
- User context
- Company identification
Validation Steps¶
- Format Verification: Ensure supported file types
- Size Limits: Check document size constraints
- Base64 Validation: Verify encoding integrity
- Content Detection: Identify document structure
Azure Document Intelligence Integration¶
We leverage Azure's Document Intelligence service for robust text extraction:
const processedDocs = await this.documentParseService.processDocuments([
{
name: document.name,
type: document.type,
content: base64Content,
format: "per_page", // Enable page-level extraction
},
]);
Benefits of Azure Document Intelligence:
- Multi-format Support: Handles various document types seamlessly
- OCR Capabilities: Extracts text from scanned documents and images
- Layout Preservation: Maintains document structure and formatting context
- Table Recognition: Identifies and extracts tabular data
- Form Processing: Recognizes structured form elements
Stage 2: Text Chunking Pipeline¶
Chunking Strategy¶
The Python backend implements an intelligent chunking system that adapts to document structure:
class TextChunker:
def __init__(self,
max_tokens: int = 800,
min_tokens: int = 100,
overlap_tokens: int = 400): # 50% overlap for context preservation
Multi-Level Chunking Approach¶
1. Section-Based Chunking (Primary Strategy)¶
- Identifies document sections using pattern recognition
- Preserves logical document structure
- Maintains context within sections
Pattern Recognition:
header_patterns = [
r'^\d+\.\s+[A-Z][^\.]+$', # Numbered sections (1. Section Title)
r'^[A-Z][A-Z\s]+$', # ALL CAPS headings
r'^[A-Z][^\.]+:$', # Title ending with colon
r'^[^\n]+\n[=\-]{3,}$' # Underlined titles
]
2. Paragraph-Based Chunking (Fallback)¶
- Splits by paragraph boundaries when sections aren't clear
- Combines small paragraphs to meet token requirements
- Maintains narrative flow
3. Token-Based Chunking (Final Fallback)¶
- Simple token-count splitting for edge cases
- Ensures processing of all content regardless of structure
Overlap Strategy¶
Purpose: Maintain context between chunks to prevent information loss at boundaries.
Implementation:
- 50% overlap between chunks (400 tokens out of 800)
- Smart overlap that prioritizes sentence boundaries
- Prevents question or requirement fragmentation
Chunk Metadata Generation¶
Each chunk receives comprehensive metadata:
chunk_metadata = ChunkMetadata(
chunk_id=str(uuid.uuid4()),
chunk_index=chunk_index,
chunk_size=token_count,
text=chunk_text,
tags=ml_generated_tags,
content_type=detected_type, # question, requirement, specification
page_number=page_mapping,
section_number=extracted_section,
section_title=extracted_title,
quality_score=calculated_quality,
contains_references=reference_detection,
referenced_documents=extracted_refs,
overlap_tokens=overlap_count,
metadata={
"processing_timestamp": time.time(),
"token_count": token_count,
"character_count": len(chunk_text),
"word_count": len(chunk_text.split()),
"chunking_method": strategy_used
}
)
Stage 3: ML-Powered Content Analysis¶
Tagging Engine¶
Uses machine learning models to categorise content:
Tag Categories:
- Respondent Questions: Content requiring bidder responses
- Requirements: Mandatory specifications and criteria
- Deadlines: Time-sensitive information
- Compliance: Regulatory and legal requirements
- Evaluation Criteria: Scoring and assessment methods
- Technical Specifications: Detailed technical requirements
- Commercial Terms: Pricing and contract conditions
Question Extraction Engine¶
Sophisticated natural language processing to identify questions requiring responses:
def extract_questions(text: str) -> Dict[str, Any]:
"""
Extract respondent questions from text using NLP patterns
Returns structured question data with metadata
"""
Question Detection Patterns:
- Direct interrogatives ("What is...", "How will...", "When does...")
- Instruction-based questions ("Describe...", "Provide...", "Explain...")
- Compliance questions ("Confirm that...", "Demonstrate...")
- Specification requests ("Detail your approach to...")
Reference Processing Engine¶
Identifies cross-references between documents and sections:
Reference Types:
- Document references (e.g., "See Appendix A", "As per Schedule 2")
- Section references (e.g., "Section 3.2", "Clause 4.1.1")
- External references (e.g., regulatory standards, industry guidelines)
Stage 4: Embedding Generation and Vector Storage¶
Embedding Strategy¶
Each processed chunk is converted into a high-dimensional vector representation using Azure OpenAI's embedding models:
// Generate embedding for semantic search
const embedding = await this.embeddings.embedQuery(cleanedText);
const normalisedEmbedding = this.normaliseEmbedding(embedding, 1536);
Embedding Benefits:
- Semantic Search: Find conceptually similar content
- Question Matching: Identify relevant chunks for specific questions
- Context Retrieval: Surface related information across documents
- Answer Planning: Match questions to potential answer sources
Vector Database Structure¶
Embeddings are stored with comprehensive metadata for efficient retrieval:
interface VectorChunk {
text: string;
embedding: number[1536];
documentId: string;
projectId: string;
chunkIndex: number;
tags: string[];
contentType: string;
hasRespondentQuestions: boolean;
questionIds: string[];
sectionNumber: string;
pageNumber: number;
qualityScore: number;
metadata: ChunkMetadata;
}
Storage in PostgreSQL with pgvector¶
We use PostgreSQL with the pgvector extension for efficient vector operations:
CREATE TABLE tender_chunks (
id SERIAL PRIMARY KEY,
text TEXT NOT NULL,
embedding vector(1536),
document_id TEXT NOT NULL,
project_id TEXT,
chunk_index INTEGER NOT NULL,
tags JSONB,
content_type TEXT,
has_respondent_questions BOOLEAN DEFAULT FALSE,
quality_score INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON tender_chunks USING hnsw (embedding vector_cosine_ops);
Stage 5: Intelligent Caching and Persistence¶
Redis Caching Strategy¶
Purpose: Provide fast access to frequently accessed documents during active editing sessions.
Implementation:
// Cache key structure
const cacheKey = `tenderdoc:${projectId}:${documentId}:data`;
// TTL of 15 minutes ensures fresh data
await redis.setex(cacheKey, 900, JSON.stringify(documentData));
Cache Benefits:
- Sub-millisecond access times for active documents
- Reduced database load during intensive editing sessions
- Automatic expiration prevents stale data issues
MongoDB Persistence¶
Purpose: Durable, long-term storage with rich querying capabilities.
Document Structure:
{
_id: ObjectId,
documentId: String,
projectId: String,
name: String,
type: String,
content: String,
tags: [String],
respondentQuestions: {
questionId: {
question: String,
section: String,
priority: String,
answerPlanningStatus: Boolean
}
},
chunks: [ChunkReference],
processingMetadata: {
version: String,
processedAt: Date,
totalChunks: Number,
qualityScore: Number
},
createdAt: Date,
updatedAt: Date
}
Stage 6: Analysis and Intelligence Layer¶
LLM-Powered Analysis¶
The NestJS backend orchestrates comprehensive document analysis using LangChain and Azure OpenAI:
async parseTenderDocument(documentContent: string) {
const llm = await this.langchainService.getLLM();
const modelWithFunctions = llm.bind({
functions: [{
name: 'parseTenderDocument',
description: 'Parse tender document information into structured format',
parameters: zodToJsonSchema(TenderPackSchema),
}],
function_call: { name: 'parseTenderDocument' },
});
return await modelWithFunctions.invoke(analysisPrompt);
}
Structured Extraction¶
The LLM extracts key information including:
Key Information Categories:
- Deadlines: Submission dates, milestone dates, contract start dates
- Requirements: Technical, commercial, and compliance requirements
- Evaluation Criteria: Scoring methodology and weightings
- Key Dates: Important timeline events
- Risks: Potential compliance or delivery risks
- Opportunities: Value-add possibilities and differentiators
Response Formatting¶
Analysis results are structured for consumption by other platform modules:
interface TenderAnalysis {
projectId: string;
keyInformation: {
deadlines: Deadline[];
requirements: Requirement[];
keyDates: KeyDate[];
evaluationCriteria: EvaluationCriteria[];
};
summary: {
keyPoints: string[];
risks: string[];
opportunities: string[];
};
processingMetadata: {
totalDocuments: number;
successfullyProcessed: number;
failed: number;
processingTime: number;
};
}
Integration Points¶
Frontend Integration¶
The processed data is consumed by various frontend components:
- Document Browser: Displays categorised documents with intelligent filtering
- Question Manager: Shows extracted questions with answer planning status
- Search Interface: Enables semantic search across all processed content
- Analysis Dashboard: Presents key insights and extracted information
- Answer Editor: Provides contextual information while writing responses
API Endpoints¶
Key endpoints for accessing processed data:
// Process new tender documents
POST /api/tender-parsing/tag-documents
POST /api/tender-parsing/process-bid-analysis
// Retrieve processed data
GET /api/tender-parsing/project-documentation/:projectId
GET /api/tender-parsing/search-documents
// Update processing status
PUT /api/tender-parsing/update-tags
PUT /api/tender-parsing/update-answer-planning
Background Processing¶
Utilizes BullMQ for non-blocking operations:
// Queue linked references update for background processing
await this.linkedReferencesQueue.add(
"updateLinkedReferences",
{ projectId, documents },
{
delay: 5000, // Process after initial response
attempts: 3,
backoff: "exponential",
}
);
Performance Optimisations¶
Chunking Optimisations¶
- Adaptive Chunk Sizes: Adjust based on content density
- Smart Overlap: Preserve context without excessive duplication
- Parallel Processing: Use ThreadPoolExecutor for concurrent chunk processing
- Quality Scoring: Prioritize high-quality chunks for analysis
Embedding Optimisations¶
- Batch Processing: Generate embeddings in batches of 5 to avoid timeouts
- Retry Logic: Implement exponential backoff for transient failures
- Dimension Normalization: Ensure consistent vector dimensions
- Caching: Store embeddings to avoid regeneration
Storage Optimisations¶
- Indexed Queries: Optimise database queries with proper indexing
- Batch Insertions: Insert multiple chunks in single transactions
- Connection Pooling: Efficiently manage database connections
- TTL Management: Automatic cleanup of expired cache entries
Quality Assurance¶
Data Validation¶
- Input Validation: Comprehensive validation of incoming documents
- Processing Validation: Verify each processing stage completion
- Output Validation: Ensure generated data meets quality standards
- Error Recovery: Graceful handling of processing failures
Monitoring and Logging¶
- Processing Metrics: Track success rates, processing times, and error rates
- Quality Metrics: Monitor chunk quality scores and extraction accuracy
- Usage Metrics: Track API usage patterns and performance
- Alert Systems: Notify of processing failures or quality degradation
Future Enhancements¶
Planned Improvements¶
- Enhanced ML Models: Improve tagging and question extraction accuracy
- Multi-language Support: Process documents in multiple languages
- Visual Element Extraction: Better handling of charts, diagrams, and images
- Collaborative Annotations: Enable manual corrections and improvements
- Version Control: Track document processing versions and changes
- Advanced Analytics: Provide deeper insights into tender patterns and trends
Scalability Roadmap¶
- Horizontal Scaling: Support processing across multiple servers
- Streaming Processing: Handle very large documents efficiently
- Real-time Updates: Enable live document processing and updates
- Advanced Caching: Implement multi-tier caching strategies
- API Rate Limiting: Protect against abuse and ensure fair usage
Conclusion¶
The tender document processing pipeline represents a sophisticated integration of multiple technologies and methodologies designed to extract maximum value from tender documentation. By combining Azure's document intelligence capabilities with custom ML models, intelligent chunking strategies, and comprehensive storage solutions, we create a system that not only processes documents efficiently but also provides the intelligence layer necessary for effective bid management.
This pipeline serves as the foundation for all downstream activities including question answering, response planning, compliance checking, and competitive analysis, making it a critical component of the overall BidScript platform.