Chunking and Embedding Strategy¶

Overview¶

The chunking and embedding process is a critical component of the tender document processing pipeline that transforms unstructured document text into searchable, contextual chunks with semantic vector representations. This enables powerful search capabilities, question matching, and intelligent content retrieval across all processed tender documents.

Why Chunking is Essential¶

The Challenge of Large Documents¶

Tender documents often contain:

50-200+ pages of content
Multiple sections with different content types
Complex cross-references between sections
Mixed content (text, tables, lists, specifications)
Questions scattered throughout the document

Limitations Without Chunking¶

Context Loss: LLMs have token limits that prevent processing entire documents
Poor Search Results: Large text blocks reduce search precision
Inefficient Processing: Cannot parallelize analysis of document sections
Memory Constraints: Embedding entire documents is computationally expensive
Retrieval Issues: Finding specific information becomes needle-in-haystack

Benefits of Intelligent Chunking¶

Preserved Context: Maintains logical relationships within document sections
Efficient Search: Enables targeted retrieval of relevant content
Parallel Processing: Multiple chunks can be analyzed simultaneously
Resource Optimization: Better memory usage and processing efficiency
Enhanced Accuracy: Improved precision in question matching and answer sourcing

Chunking Strategy Implementation¶

Multi-Tier Approach¶

The system employs a sophisticated multi-tier chunking strategy that adapts to document structure:

def chunk_document(self, text: str) -> List[Dict[str, Any]]:
    """
    Chunk document using multiple strategies and select the best result
    """
    # Tier 1: Section-based chunking (preferred)
    section_chunks = self.chunk_by_sections(text)

    # Tier 2: Paragraph-based chunking (fallback)
    if len(section_chunks) <= 1 and self.count_tokens(text) > self.max_tokens:
        paragraph_chunks = self.chunk_by_paragraphs(text)
        chunks_text = paragraph_chunks
    else:
        chunks_text = section_chunks

    # Tier 3: Token-based chunking (final fallback)
    if len(chunks_text) <= 1 and self.count_tokens(text) > self.max_tokens:
        chunks_text = self.simple_token_chunking(text)

    return self.create_chunk_metadata(chunks_text)

Tier 1: Section-Based Chunking¶

Objective: Preserve document structure and logical organization.

Pattern Recognition:

header_patterns = [
    r'^\d+\.\s+[A-Z][^\.]+$',      # "1. Introduction"
    r'^[A-Z][A-Z\s]+$',            # "TECHNICAL REQUIREMENTS"
    r'^[A-Z][^\.]+:$',             # "Evaluation Criteria:"
    r'^[^\n]+\n[=\-]{3,}$'         # Underlined headings
]

Benefits:

Maintains semantic coherence within sections
Preserves context for requirements and specifications
Enables section-level search and analysis
Respects document author's organization intent

Example Output:

Chunk 1: "1. Introduction\nThis tender invites proposals for..."
Chunk 2: "2. Technical Requirements\n2.1 System Architecture..."
Chunk 3: "3. Evaluation Criteria\nProposals will be assessed..."

Tier 2: Paragraph-Based Chunking¶

Objective: When clear sections aren't detected, maintain narrative flow.

Strategy:

Split on paragraph boundaries (double newlines)
Combine small paragraphs to meet minimum token requirements
Preserve conversational and contextual flow

Implementation:

def chunk_by_paragraphs(self, text: str) -> List[str]:
    paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]

    chunks = []
    current_chunk = ""
    current_tokens = 0

    for paragraph in paragraphs:
        paragraph_tokens = self.count_tokens(paragraph)

        if current_tokens + paragraph_tokens > self.max_tokens and current_chunk:
            chunks.append(current_chunk)
            # Start new chunk with overlap
            current_chunk = self.create_overlap(current_chunk) + paragraph
        else:
            current_chunk += "\n\n" + paragraph if current_chunk else paragraph

        current_tokens = self.count_tokens(current_chunk)

    return chunks

Tier 3: Token-Based Chunking¶

Objective: Ensure all content is processed when structure detection fails.

When Used:

Highly unstructured documents
Documents with inconsistent formatting
Scanned documents with OCR artifacts
Emergency fallback for any content

Implementation:

def simple_token_chunking(self, text: str) -> List[str]:
    tokens = self.encoding.encode(text)
    chunks = []

    for i in range(0, len(tokens), self.max_tokens - self.overlap_tokens):
        chunk_tokens = tokens[i:i + self.max_tokens]
        if len(chunk_tokens) >= self.min_tokens:
            chunk = self.encoding.decode(chunk_tokens)
            chunks.append(chunk)

    return chunks

Overlap Strategy¶

Why Overlap Matters¶

Context Preservation:

Questions might span chunk boundaries
Requirements could be split across chunks
References might lose context without overlap

Example Without Overlap:

Chunk 1: "...suppliers must demonstrate compliance with ISO"
Chunk 2: "27001 certification and provide evidence of..."

Example With Overlap:

Chunk 1: "...suppliers must demonstrate compliance with ISO 27001 certification..."
Chunk 2: "...ISO 27001 certification and provide evidence of regular audits..."

Overlap Configuration¶

class TextChunker:
    def __init__(self,
                max_tokens: int = 800,
                min_tokens: int = 100,
                overlap_tokens: int = 400):  # 50% overlap

Overlap Calculation:

50% overlap ensures no critical information is lost
Sentence-boundary aware prevents mid-sentence cuts
Dynamic adjustment based on content density

Smart Overlap Implementation¶

def create_overlap(self, current_chunk: str) -> str:
    """Create intelligent overlap that preserves sentence boundaries"""
    tokens = self.encoding.encode(current_chunk)
    overlap_tokens = tokens[-min(self.overlap_tokens, len(tokens)):]
    overlap_text = self.encoding.decode(overlap_tokens)

    # Try to start overlap at sentence boundary
    sentences = re.split(r'[.!?]+', overlap_text)
    if len(sentences) > 1:
        # Start from the beginning of the last complete sentence
        return sentences[-2] + sentences[-1] if len(sentences) > 2 else overlap_text

    return overlap_text

Chunk Metadata Generation¶

Comprehensive Metadata Structure¶

Each chunk receives rich metadata for enhanced searchability and context:

@dataclass
class ChunkMetadata:
    # Core identification
    chunk_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    chunk_index: int
    chunk_size: int
    text: str

    # Content classification
    tags: List[str] = field(default_factory=list)
    content_type: Optional[str] = None  # question, requirement, specification
    has_respondent_questions: bool = False
    question_ids: List[str] = field(default_factory=list)

    # Document structure
    section_number: Optional[str] = None
    section_title: Optional[str] = None
    page_number: Optional[int] = None

    # Reference tracking
    contains_references: bool = False
    referenced_documents: List[str] = field(default_factory=list)
    referenced_sections: List[str] = field(default_factory=list)

    # Quality indicators
    quality_score: Optional[int] = None  # 1-100
    overlap_tokens: int = 50

    # Processing metadata
    metadata: Dict[str, Any] = field(default_factory=dict)

Quality Score Calculation¶

def calculate_chunk_quality(text: str, tags: List[str], has_questions: bool) -> int:
    """Calculate quality score for a chunk (1-100)"""
    score = 50  # Base score

    # Text length factors
    word_count = len(text.split())
    if 50 <= word_count <= 300:
        score += 20  # Optimal length
    elif word_count < 20:
        score -= 20  # Too short, likely incomplete
    elif word_count > 500:
        score -= 10  # Too long, may need further chunking

    # Content value factors
    if has_questions:
        score += 15  # High-value content requiring responses

    if tags:
        score += len(tags) * 2  # More tags indicate structured content

    # Text quality indicators
    if re.search(r'[.!?]', text):
        score += 5  # Proper punctuation

    if len(re.findall(r'[A-Z][a-z]+', text)) > 5:
        score += 5  # Proper capitalization

    # Structural content indicators
    structure_keywords = r'\b(?:section|clause|requirement|specification|deadline|evaluation)\b'
    if re.search(structure_keywords, text.lower()):
        score += 10  # Contains structural keywords

    return max(1, min(100, score))  # Clamp between 1-100

Embedding Generation Process¶

Azure OpenAI Integration¶

The system uses Azure OpenAI's text-embedding-ada-002 model for generating high-quality embeddings:

async generateEmbedding(text: string): Promise<number[]> {
  // Clean text for optimal embedding quality
  const cleanedText = text
    .replace(/\s+/g, ' ')        // Normalize whitespace
    .trim()                      // Remove leading/trailing spaces
    .substring(0, 8191);         // Respect model limits

  // Generate embedding with retry logic
  let retries = 3;
  while (retries > 0) {
    try {
      const embedding = await this.embeddings.embedQuery(cleanedText);
      return this.normalizeEmbedding(embedding, 1536);
    } catch (error) {
      retries--;
      if (retries === 0) throw error;
      await this.delay(1000 * (4 - retries)); // Exponential backoff
    }
  }
}

Embedding Quality Optimizations¶

Text Preprocessing:

function preprocessForEmbedding(text: string): string {
  return text
    .replace(/\s+/g, " ") // Normalize whitespace
    .replace(/[^\w\s.,!?-]/g, "") // Remove special characters
    .replace(/\b\d{4,}\b/g, "[NUM]") // Normalize long numbers
    .trim();
}

Batch Processing:

async function batchGenerateEmbeddings(chunks: string[]): Promise<number[][]> {
  const BATCH_SIZE = 5; // Prevent API timeouts
  const results: number[][] = [];

  for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
    const batch = chunks.slice(i, i + BATCH_SIZE);
    const batchResults = await Promise.all(
      batch.map((chunk) => this.generateEmbedding(chunk))
    );
    results.push(...batchResults);

    // Rate limiting
    if (i + BATCH_SIZE < chunks.length) {
      await this.delay(100);
    }
  }

  return results;
}

Vector Normalization¶

private normalizeEmbedding(
  embedding: number[],
  targetDimension: number = 1536
): number[] {
  if (embedding.length === targetDimension) {
    return embedding;
  }

  if (embedding.length > targetDimension) {
    // Truncate if too large
    this.logger.warn(`Truncating embedding from ${embedding.length} to ${targetDimension}`);
    return embedding.slice(0, targetDimension);
  } else {
    // Pad with zeros if too small
    this.logger.warn(`Padding embedding from ${embedding.length} to ${targetDimension}`);
    return [...embedding, ...Array(targetDimension - embedding.length).fill(0)];
  }
}

Vector Storage Architecture¶

PostgreSQL with pgvector¶

The system uses PostgreSQL with the pgvector extension for efficient vector operations:

-- Table structure optimized for vector operations
CREATE TABLE tender_chunks (
  id SERIAL PRIMARY KEY,
  text TEXT NOT NULL,
  embedding vector(1536),

  -- Document identification
  document_id TEXT NOT NULL,
  project_id TEXT,

  -- Chunk metadata
  chunk_index INTEGER NOT NULL,
  chunk_size INTEGER NOT NULL,
  overlap_tokens INTEGER DEFAULT 50,

  -- Content classification
  tags JSONB,
  content_type TEXT,
  has_respondent_questions BOOLEAN DEFAULT FALSE,
  question_ids JSONB,

  -- Document structure
  section_number TEXT,
  section_title TEXT,
  page_number INTEGER,

  -- Quality and references
  quality_score INTEGER,
  contains_references BOOLEAN DEFAULT FALSE,
  referenced_documents JSONB,
  referenced_sections JSONB,

  -- Processing metadata
  processing_version TEXT DEFAULT '1.0',
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

-- Optimized indexes for vector operations
CREATE INDEX tender_chunks_embedding_idx
ON tender_chunks
USING hnsw (embedding vector_cosine_ops);

-- Additional indexes for metadata queries
CREATE INDEX tender_chunks_document_idx ON tender_chunks(document_id);
CREATE INDEX tender_chunks_project_idx ON tender_chunks(project_id);
CREATE INDEX tender_chunks_content_type_idx ON tender_chunks(content_type);
CREATE INDEX tender_chunks_quality_idx ON tender_chunks(quality_score);
CREATE INDEX tender_chunks_questions_idx ON tender_chunks(has_respondent_questions);

Vector Search Optimization¶

Similarity Search with Metadata Filtering:

-- Find similar chunks with quality filtering
SELECT
  id,
  text,
  document_id,
  section_number,
  quality_score,
  1 - (embedding <=> $1::vector) AS similarity_score
FROM tender_chunks
WHERE
  project_id = $2
  AND quality_score >= 70
  AND (content_type = 'question' OR has_respondent_questions = true)
ORDER BY embedding <=> $1::vector
LIMIT 10;

Hybrid Search Combining Vector and Text:

-- Combine vector similarity with text search
WITH vector_search AS (
  SELECT *, 1 - (embedding <=> $1::vector) AS vector_score
  FROM tender_chunks
  WHERE project_id = $2
  ORDER BY embedding <=> $1::vector
  LIMIT 20
),
text_search AS (
  SELECT *, ts_rank(to_tsvector('english', text), plainto_tsquery($3)) AS text_score
  FROM tender_chunks
  WHERE
    project_id = $2
    AND to_tsvector('english', text) @@ plainto_tsquery($3)
)
SELECT
  v.*,
  COALESCE(t.text_score, 0) AS text_score,
  (v.vector_score * 0.7 + COALESCE(t.text_score, 0) * 0.3) AS combined_score
FROM vector_search v
LEFT JOIN text_search t ON v.id = t.id
ORDER BY combined_score DESC
LIMIT 10;

Performance Considerations¶

Chunking Performance¶

Parallel Processing:

def process_chunks_parallel(self, chunks: List[Dict]) -> List[ChunkMetadata]:
    """Process chunks in parallel using ThreadPoolExecutor"""
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        futures = [
            executor.submit(self.process_single_chunk, chunk, idx)
            for idx, chunk in enumerate(chunks)
        ]

        results = []
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            results.append(result)

    # Sort by chunk_index to maintain order
    return sorted(results, key=lambda x: x.chunk_index)

Memory Management:

def chunk_large_document(self, text: str) -> Iterator[Dict[str, Any]]:
    """Stream chunks for very large documents to manage memory"""
    if len(text) > 1_000_000:  # > 1MB text
        # Process in streaming fashion
        for chunk_text in self.stream_chunks(text):
            yield self.create_chunk_metadata(chunk_text)
    else:
        # Standard processing
        for chunk in self.chunk_document(text):
            yield chunk

Embedding Performance¶

Connection Pooling:

// Efficient connection management
class EmbeddingService {
  private connectionPool: Pool;

  constructor() {
    this.connectionPool = new Pool({
      max: 10, // Maximum connections
      min: 2, // Minimum connections
      acquireTimeoutMillis: 30000,
      createTimeoutMillis: 30000,
      idleTimeoutMillis: 30000,
    });
  }
}

Caching Strategy:

// Cache embeddings to avoid regeneration
const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string): Promise<number[]> {
  const textHash = createHash("sha256").update(text).digest("hex");

  if (embeddingCache.has(textHash)) {
    return embeddingCache.get(textHash)!;
  }

  const embedding = await this.generateEmbedding(text);
  embeddingCache.set(textHash, embedding);
  return embedding;
}

Quality Assurance and Monitoring¶

Chunk Quality Metrics¶

interface ChunkQualityMetrics {
  averageChunkSize: number;
  chunkSizeDistribution: number[];
  overlapEfficiency: number;
  contentTypeDistribution: Record<string, number>;
  questionDetectionRate: number;
  averageQualityScore: number;
  embeddingSuccessRate: number;
}

async function calculateQualityMetrics(
  projectId: string
): Promise<ChunkQualityMetrics> {
  const chunks = await this.getProjectChunks(projectId);

  return {
    averageChunkSize:
      chunks.reduce((sum, c) => sum + c.chunk_size, 0) / chunks.length,
    chunkSizeDistribution: this.calculateDistribution(
      chunks.map((c) => c.chunk_size)
    ),
    overlapEfficiency: this.calculateOverlapEfficiency(chunks),
    contentTypeDistribution: this.groupBy(chunks, "content_type"),
    questionDetectionRate:
      chunks.filter((c) => c.has_respondent_questions).length / chunks.length,
    averageQualityScore:
      chunks.reduce((sum, c) => sum + c.quality_score, 0) / chunks.length,
    embeddingSuccessRate:
      chunks.filter((c) => c.embedding).length / chunks.length,
  };
}

Monitoring and Alerting¶

// Quality thresholds for alerting
const QUALITY_THRESHOLDS = {
  MIN_AVERAGE_QUALITY_SCORE: 70,
  MIN_EMBEDDING_SUCCESS_RATE: 0.95,
  MIN_QUESTION_DETECTION_RATE: 0.1, // At least 10% of chunks should contain questions
  MAX_PROCESSING_TIME_MS: 30000, // 30 seconds per document
};

async function monitorProcessingQuality(projectId: string): Promise<void> {
  const metrics = await this.calculateQualityMetrics(projectId);

  if (
    metrics.averageQualityScore < QUALITY_THRESHOLDS.MIN_AVERAGE_QUALITY_SCORE
  ) {
    await this.alerting.sendAlert({
      type: "QUALITY_WARNING",
      message: `Low average quality score: ${metrics.averageQualityScore}`,
      projectId,
    });
  }

  if (
    metrics.embeddingSuccessRate < QUALITY_THRESHOLDS.MIN_EMBEDDING_SUCCESS_RATE
  ) {
    await this.alerting.sendAlert({
      type: "EMBEDDING_WARNING",
      message: `Low embedding success rate: ${metrics.embeddingSuccessRate}`,
      projectId,
    });
  }
}

Best Practices and Recommendations¶

Chunking Best Practices¶

Respect Document Structure: Always prefer section-based chunking when possible
Maintain Context: Use adequate overlap to preserve contextual relationships
Quality over Quantity: Prefer fewer, higher-quality chunks over many small fragments
Content-Aware Processing: Adjust strategies based on detected content types
Validation: Always validate chunk quality before embedding generation

Embedding Best Practices¶

Text Preprocessing: Clean and normalize text before embedding generation
Batch Processing: Process embeddings in batches to optimize API usage
Error Handling: Implement robust retry mechanisms for API failures
Dimension Consistency: Ensure all embeddings have consistent dimensions
Cache Management: Cache embeddings to avoid unnecessary regeneration

Storage Best Practices¶

Index Optimization: Maintain proper indexes for both vector and metadata queries
Data Partitioning: Consider partitioning large tables by project or date
Backup Strategy: Regular backups of both vector data and metadata
Performance Monitoring: Monitor query performance and optimize as needed
Cleanup Procedures: Remove outdated embeddings and maintain data hygiene

Conclusion¶

The chunking and embedding strategy forms the foundation of intelligent document processing in the BidScript platform. By combining sophisticated chunking algorithms with high-quality embedding generation and efficient vector storage, we create a system that can understand, search, and analyze tender documents with remarkable precision and speed.

This approach enables advanced features like semantic search, intelligent question matching, automated answer sourcing, and contextual document analysis, making it an essential component of the overall tender processing pipeline.