Chunking and Embedding Strategy¶
Overview¶
The chunking and embedding process is a critical component of the tender document processing pipeline that transforms unstructured document text into searchable, contextual chunks with semantic vector representations. This enables powerful search capabilities, question matching, and intelligent content retrieval across all processed tender documents.
Why Chunking is Essential¶
The Challenge of Large Documents¶
Tender documents often contain:
- 50-200+ pages of content
- Multiple sections with different content types
- Complex cross-references between sections
- Mixed content (text, tables, lists, specifications)
- Questions scattered throughout the document
Limitations Without Chunking¶
- Context Loss: LLMs have token limits that prevent processing entire documents
- Poor Search Results: Large text blocks reduce search precision
- Inefficient Processing: Cannot parallelize analysis of document sections
- Memory Constraints: Embedding entire documents is computationally expensive
- Retrieval Issues: Finding specific information becomes needle-in-haystack
Benefits of Intelligent Chunking¶
- Preserved Context: Maintains logical relationships within document sections
- Efficient Search: Enables targeted retrieval of relevant content
- Parallel Processing: Multiple chunks can be analyzed simultaneously
- Resource Optimization: Better memory usage and processing efficiency
- Enhanced Accuracy: Improved precision in question matching and answer sourcing
Chunking Strategy Implementation¶
Multi-Tier Approach¶
The system employs a sophisticated multi-tier chunking strategy that adapts to document structure:
def chunk_document(self, text: str) -> List[Dict[str, Any]]:
"""
Chunk document using multiple strategies and select the best result
"""
# Tier 1: Section-based chunking (preferred)
section_chunks = self.chunk_by_sections(text)
# Tier 2: Paragraph-based chunking (fallback)
if len(section_chunks) <= 1 and self.count_tokens(text) > self.max_tokens:
paragraph_chunks = self.chunk_by_paragraphs(text)
chunks_text = paragraph_chunks
else:
chunks_text = section_chunks
# Tier 3: Token-based chunking (final fallback)
if len(chunks_text) <= 1 and self.count_tokens(text) > self.max_tokens:
chunks_text = self.simple_token_chunking(text)
return self.create_chunk_metadata(chunks_text)
Tier 1: Section-Based Chunking¶
Objective: Preserve document structure and logical organization.
Pattern Recognition:
header_patterns = [
r'^\d+\.\s+[A-Z][^\.]+$', # "1. Introduction"
r'^[A-Z][A-Z\s]+$', # "TECHNICAL REQUIREMENTS"
r'^[A-Z][^\.]+:$', # "Evaluation Criteria:"
r'^[^\n]+\n[=\-]{3,}$' # Underlined headings
]
Benefits:
- Maintains semantic coherence within sections
- Preserves context for requirements and specifications
- Enables section-level search and analysis
- Respects document author's organization intent
Example Output:
Chunk 1: "1. Introduction\nThis tender invites proposals for..."
Chunk 2: "2. Technical Requirements\n2.1 System Architecture..."
Chunk 3: "3. Evaluation Criteria\nProposals will be assessed..."
Tier 2: Paragraph-Based Chunking¶
Objective: When clear sections aren't detected, maintain narrative flow.
Strategy:
- Split on paragraph boundaries (double newlines)
- Combine small paragraphs to meet minimum token requirements
- Preserve conversational and contextual flow
Implementation:
def chunk_by_paragraphs(self, text: str) -> List[str]:
paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
chunks = []
current_chunk = ""
current_tokens = 0
for paragraph in paragraphs:
paragraph_tokens = self.count_tokens(paragraph)
if current_tokens + paragraph_tokens > self.max_tokens and current_chunk:
chunks.append(current_chunk)
# Start new chunk with overlap
current_chunk = self.create_overlap(current_chunk) + paragraph
else:
current_chunk += "\n\n" + paragraph if current_chunk else paragraph
current_tokens = self.count_tokens(current_chunk)
return chunks
Tier 3: Token-Based Chunking¶
Objective: Ensure all content is processed when structure detection fails.
When Used:
- Highly unstructured documents
- Documents with inconsistent formatting
- Scanned documents with OCR artifacts
- Emergency fallback for any content
Implementation:
def simple_token_chunking(self, text: str) -> List[str]:
tokens = self.encoding.encode(text)
chunks = []
for i in range(0, len(tokens), self.max_tokens - self.overlap_tokens):
chunk_tokens = tokens[i:i + self.max_tokens]
if len(chunk_tokens) >= self.min_tokens:
chunk = self.encoding.decode(chunk_tokens)
chunks.append(chunk)
return chunks
Overlap Strategy¶
Why Overlap Matters¶
Context Preservation:
- Questions might span chunk boundaries
- Requirements could be split across chunks
- References might lose context without overlap
Example Without Overlap:
Chunk 1: "...suppliers must demonstrate compliance with ISO"
Chunk 2: "27001 certification and provide evidence of..."
Example With Overlap:
Chunk 1: "...suppliers must demonstrate compliance with ISO 27001 certification..."
Chunk 2: "...ISO 27001 certification and provide evidence of regular audits..."
Overlap Configuration¶
class TextChunker:
def __init__(self,
max_tokens: int = 800,
min_tokens: int = 100,
overlap_tokens: int = 400): # 50% overlap
Overlap Calculation:
- 50% overlap ensures no critical information is lost
- Sentence-boundary aware prevents mid-sentence cuts
- Dynamic adjustment based on content density
Smart Overlap Implementation¶
def create_overlap(self, current_chunk: str) -> str:
"""Create intelligent overlap that preserves sentence boundaries"""
tokens = self.encoding.encode(current_chunk)
overlap_tokens = tokens[-min(self.overlap_tokens, len(tokens)):]
overlap_text = self.encoding.decode(overlap_tokens)
# Try to start overlap at sentence boundary
sentences = re.split(r'[.!?]+', overlap_text)
if len(sentences) > 1:
# Start from the beginning of the last complete sentence
return sentences[-2] + sentences[-1] if len(sentences) > 2 else overlap_text
return overlap_text
Chunk Metadata Generation¶
Comprehensive Metadata Structure¶
Each chunk receives rich metadata for enhanced searchability and context:
@dataclass
class ChunkMetadata:
# Core identification
chunk_id: str = field(default_factory=lambda: str(uuid.uuid4()))
chunk_index: int
chunk_size: int
text: str
# Content classification
tags: List[str] = field(default_factory=list)
content_type: Optional[str] = None # question, requirement, specification
has_respondent_questions: bool = False
question_ids: List[str] = field(default_factory=list)
# Document structure
section_number: Optional[str] = None
section_title: Optional[str] = None
page_number: Optional[int] = None
# Reference tracking
contains_references: bool = False
referenced_documents: List[str] = field(default_factory=list)
referenced_sections: List[str] = field(default_factory=list)
# Quality indicators
quality_score: Optional[int] = None # 1-100
overlap_tokens: int = 50
# Processing metadata
metadata: Dict[str, Any] = field(default_factory=dict)
Quality Score Calculation¶
def calculate_chunk_quality(text: str, tags: List[str], has_questions: bool) -> int:
"""Calculate quality score for a chunk (1-100)"""
score = 50 # Base score
# Text length factors
word_count = len(text.split())
if 50 <= word_count <= 300:
score += 20 # Optimal length
elif word_count < 20:
score -= 20 # Too short, likely incomplete
elif word_count > 500:
score -= 10 # Too long, may need further chunking
# Content value factors
if has_questions:
score += 15 # High-value content requiring responses
if tags:
score += len(tags) * 2 # More tags indicate structured content
# Text quality indicators
if re.search(r'[.!?]', text):
score += 5 # Proper punctuation
if len(re.findall(r'[A-Z][a-z]+', text)) > 5:
score += 5 # Proper capitalization
# Structural content indicators
structure_keywords = r'\b(?:section|clause|requirement|specification|deadline|evaluation)\b'
if re.search(structure_keywords, text.lower()):
score += 10 # Contains structural keywords
return max(1, min(100, score)) # Clamp between 1-100
Embedding Generation Process¶
Azure OpenAI Integration¶
The system uses Azure OpenAI's text-embedding-ada-002 model for generating high-quality embeddings:
async generateEmbedding(text: string): Promise<number[]> {
// Clean text for optimal embedding quality
const cleanedText = text
.replace(/\s+/g, ' ') // Normalize whitespace
.trim() // Remove leading/trailing spaces
.substring(0, 8191); // Respect model limits
// Generate embedding with retry logic
let retries = 3;
while (retries > 0) {
try {
const embedding = await this.embeddings.embedQuery(cleanedText);
return this.normalizeEmbedding(embedding, 1536);
} catch (error) {
retries--;
if (retries === 0) throw error;
await this.delay(1000 * (4 - retries)); // Exponential backoff
}
}
}
Embedding Quality Optimizations¶
Text Preprocessing:
function preprocessForEmbedding(text: string): string {
return text
.replace(/\s+/g, " ") // Normalize whitespace
.replace(/[^\w\s.,!?-]/g, "") // Remove special characters
.replace(/\b\d{4,}\b/g, "[NUM]") // Normalize long numbers
.trim();
}
Batch Processing:
async function batchGenerateEmbeddings(chunks: string[]): Promise<number[][]> {
const BATCH_SIZE = 5; // Prevent API timeouts
const results: number[][] = [];
for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
const batch = chunks.slice(i, i + BATCH_SIZE);
const batchResults = await Promise.all(
batch.map((chunk) => this.generateEmbedding(chunk))
);
results.push(...batchResults);
// Rate limiting
if (i + BATCH_SIZE < chunks.length) {
await this.delay(100);
}
}
return results;
}
Vector Normalization¶
private normalizeEmbedding(
embedding: number[],
targetDimension: number = 1536
): number[] {
if (embedding.length === targetDimension) {
return embedding;
}
if (embedding.length > targetDimension) {
// Truncate if too large
this.logger.warn(`Truncating embedding from ${embedding.length} to ${targetDimension}`);
return embedding.slice(0, targetDimension);
} else {
// Pad with zeros if too small
this.logger.warn(`Padding embedding from ${embedding.length} to ${targetDimension}`);
return [...embedding, ...Array(targetDimension - embedding.length).fill(0)];
}
}
Vector Storage Architecture¶
PostgreSQL with pgvector¶
The system uses PostgreSQL with the pgvector extension for efficient vector operations:
-- Table structure optimized for vector operations
CREATE TABLE tender_chunks (
id SERIAL PRIMARY KEY,
text TEXT NOT NULL,
embedding vector(1536),
-- Document identification
document_id TEXT NOT NULL,
project_id TEXT,
-- Chunk metadata
chunk_index INTEGER NOT NULL,
chunk_size INTEGER NOT NULL,
overlap_tokens INTEGER DEFAULT 50,
-- Content classification
tags JSONB,
content_type TEXT,
has_respondent_questions BOOLEAN DEFAULT FALSE,
question_ids JSONB,
-- Document structure
section_number TEXT,
section_title TEXT,
page_number INTEGER,
-- Quality and references
quality_score INTEGER,
contains_references BOOLEAN DEFAULT FALSE,
referenced_documents JSONB,
referenced_sections JSONB,
-- Processing metadata
processing_version TEXT DEFAULT '1.0',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Optimized indexes for vector operations
CREATE INDEX tender_chunks_embedding_idx
ON tender_chunks
USING hnsw (embedding vector_cosine_ops);
-- Additional indexes for metadata queries
CREATE INDEX tender_chunks_document_idx ON tender_chunks(document_id);
CREATE INDEX tender_chunks_project_idx ON tender_chunks(project_id);
CREATE INDEX tender_chunks_content_type_idx ON tender_chunks(content_type);
CREATE INDEX tender_chunks_quality_idx ON tender_chunks(quality_score);
CREATE INDEX tender_chunks_questions_idx ON tender_chunks(has_respondent_questions);
Vector Search Optimization¶
Similarity Search with Metadata Filtering:
-- Find similar chunks with quality filtering
SELECT
id,
text,
document_id,
section_number,
quality_score,
1 - (embedding <=> $1::vector) AS similarity_score
FROM tender_chunks
WHERE
project_id = $2
AND quality_score >= 70
AND (content_type = 'question' OR has_respondent_questions = true)
ORDER BY embedding <=> $1::vector
LIMIT 10;
Hybrid Search Combining Vector and Text:
-- Combine vector similarity with text search
WITH vector_search AS (
SELECT *, 1 - (embedding <=> $1::vector) AS vector_score
FROM tender_chunks
WHERE project_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 20
),
text_search AS (
SELECT *, ts_rank(to_tsvector('english', text), plainto_tsquery($3)) AS text_score
FROM tender_chunks
WHERE
project_id = $2
AND to_tsvector('english', text) @@ plainto_tsquery($3)
)
SELECT
v.*,
COALESCE(t.text_score, 0) AS text_score,
(v.vector_score * 0.7 + COALESCE(t.text_score, 0) * 0.3) AS combined_score
FROM vector_search v
LEFT JOIN text_search t ON v.id = t.id
ORDER BY combined_score DESC
LIMIT 10;
Performance Considerations¶
Chunking Performance¶
Parallel Processing:
def process_chunks_parallel(self, chunks: List[Dict]) -> List[ChunkMetadata]:
"""Process chunks in parallel using ThreadPoolExecutor"""
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = [
executor.submit(self.process_single_chunk, chunk, idx)
for idx, chunk in enumerate(chunks)
]
results = []
for future in concurrent.futures.as_completed(futures):
result = future.result()
results.append(result)
# Sort by chunk_index to maintain order
return sorted(results, key=lambda x: x.chunk_index)
Memory Management:
def chunk_large_document(self, text: str) -> Iterator[Dict[str, Any]]:
"""Stream chunks for very large documents to manage memory"""
if len(text) > 1_000_000: # > 1MB text
# Process in streaming fashion
for chunk_text in self.stream_chunks(text):
yield self.create_chunk_metadata(chunk_text)
else:
# Standard processing
for chunk in self.chunk_document(text):
yield chunk
Embedding Performance¶
Connection Pooling:
// Efficient connection management
class EmbeddingService {
private connectionPool: Pool;
constructor() {
this.connectionPool = new Pool({
max: 10, // Maximum connections
min: 2, // Minimum connections
acquireTimeoutMillis: 30000,
createTimeoutMillis: 30000,
idleTimeoutMillis: 30000,
});
}
}
Caching Strategy:
// Cache embeddings to avoid regeneration
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string): Promise<number[]> {
const textHash = createHash("sha256").update(text).digest("hex");
if (embeddingCache.has(textHash)) {
return embeddingCache.get(textHash)!;
}
const embedding = await this.generateEmbedding(text);
embeddingCache.set(textHash, embedding);
return embedding;
}
Quality Assurance and Monitoring¶
Chunk Quality Metrics¶
interface ChunkQualityMetrics {
averageChunkSize: number;
chunkSizeDistribution: number[];
overlapEfficiency: number;
contentTypeDistribution: Record<string, number>;
questionDetectionRate: number;
averageQualityScore: number;
embeddingSuccessRate: number;
}
async function calculateQualityMetrics(
projectId: string
): Promise<ChunkQualityMetrics> {
const chunks = await this.getProjectChunks(projectId);
return {
averageChunkSize:
chunks.reduce((sum, c) => sum + c.chunk_size, 0) / chunks.length,
chunkSizeDistribution: this.calculateDistribution(
chunks.map((c) => c.chunk_size)
),
overlapEfficiency: this.calculateOverlapEfficiency(chunks),
contentTypeDistribution: this.groupBy(chunks, "content_type"),
questionDetectionRate:
chunks.filter((c) => c.has_respondent_questions).length / chunks.length,
averageQualityScore:
chunks.reduce((sum, c) => sum + c.quality_score, 0) / chunks.length,
embeddingSuccessRate:
chunks.filter((c) => c.embedding).length / chunks.length,
};
}
Monitoring and Alerting¶
// Quality thresholds for alerting
const QUALITY_THRESHOLDS = {
MIN_AVERAGE_QUALITY_SCORE: 70,
MIN_EMBEDDING_SUCCESS_RATE: 0.95,
MIN_QUESTION_DETECTION_RATE: 0.1, // At least 10% of chunks should contain questions
MAX_PROCESSING_TIME_MS: 30000, // 30 seconds per document
};
async function monitorProcessingQuality(projectId: string): Promise<void> {
const metrics = await this.calculateQualityMetrics(projectId);
if (
metrics.averageQualityScore < QUALITY_THRESHOLDS.MIN_AVERAGE_QUALITY_SCORE
) {
await this.alerting.sendAlert({
type: "QUALITY_WARNING",
message: `Low average quality score: ${metrics.averageQualityScore}`,
projectId,
});
}
if (
metrics.embeddingSuccessRate < QUALITY_THRESHOLDS.MIN_EMBEDDING_SUCCESS_RATE
) {
await this.alerting.sendAlert({
type: "EMBEDDING_WARNING",
message: `Low embedding success rate: ${metrics.embeddingSuccessRate}`,
projectId,
});
}
}
Best Practices and Recommendations¶
Chunking Best Practices¶
- Respect Document Structure: Always prefer section-based chunking when possible
- Maintain Context: Use adequate overlap to preserve contextual relationships
- Quality over Quantity: Prefer fewer, higher-quality chunks over many small fragments
- Content-Aware Processing: Adjust strategies based on detected content types
- Validation: Always validate chunk quality before embedding generation
Embedding Best Practices¶
- Text Preprocessing: Clean and normalize text before embedding generation
- Batch Processing: Process embeddings in batches to optimize API usage
- Error Handling: Implement robust retry mechanisms for API failures
- Dimension Consistency: Ensure all embeddings have consistent dimensions
- Cache Management: Cache embeddings to avoid unnecessary regeneration
Storage Best Practices¶
- Index Optimization: Maintain proper indexes for both vector and metadata queries
- Data Partitioning: Consider partitioning large tables by project or date
- Backup Strategy: Regular backups of both vector data and metadata
- Performance Monitoring: Monitor query performance and optimize as needed
- Cleanup Procedures: Remove outdated embeddings and maintain data hygiene
Conclusion¶
The chunking and embedding strategy forms the foundation of intelligent document processing in the BidScript platform. By combining sophisticated chunking algorithms with high-quality embedding generation and efficient vector storage, we create a system that can understand, search, and analyze tender documents with remarkable precision and speed.
This approach enables advanced features like semantic search, intelligent question matching, automated answer sourcing, and contextual document analysis, making it an essential component of the overall tender processing pipeline.