document-parse.service.ts)¶

Overview¶

The DocumentParseService is responsible for extracting text and structure from various document formats using Azure's Form Recognizer service. This service plays a critical role in the document processing pipeline, converting raw documents into structured data that can be analyzed and processed by the BidScript platform.

Dependencies¶

import { Injectable, Logger } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
import { AzureKeyCredential, DocumentAnalysisClient } from '@azure/ai-form-recognizer';
import { BlobService } from './blob.service';
import { DocumentParseResult, ParsedTable, ParsedPage } from '../types/interfaces/parse.interface';

Key Features¶

Document text extraction with layout preservation
Table detection and structured extraction
Key-value pair recognition
Support for multiple document formats (PDF, DOCX, PPTX, images)
Handling of large documents through pagination
OCR capabilities for scanned documents
Structure preservation with formatting information
Document metadata extraction

Core Methods¶

parseDocument¶

async parseDocument(
  documentId: string,
  options?: {
    modelId?: string;
    containerName?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses a document using Azure Form Recognizer.

Parameters: - documentId: The ID of the document to parse - options: Parsing options - modelId: The Form Recognizer model to use (default: 'prebuilt-document') - containerName: The container name for the document - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult containing: - documentId: The original document ID - content: Extracted text content - pages: Array of parsed pages with text, tables, and elements - tables: Array of detected tables - metadata: Document metadata - rawResponse: (optional) Raw API response

Example:

const parseResult = await documentParseService.parseDocument('invoice-123');
console.log(`Parsed ${parseResult.pages.length} pages with ${parseResult.tables.length} tables`);
console.log(`Content length: ${parseResult.content.length}`);

parseDocumentFromUrl¶

async parseDocumentFromUrl(
  documentUrl: string,
  options?: {
    modelId?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses a document using a publicly accessible URL.

Parameters: - documentUrl: The URL of the document to parse - options: Parsing options - modelId: The Form Recognizer model to use - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult as described above

Example:

const url = 'https://example.com/documents/contract.pdf';
const parseResult = await documentParseService.parseDocumentFromUrl(url);

parseDocumentContent¶

async parseDocumentContent(
  content: Buffer,
  options?: {
    filename?: string;
    contentType?: string;
    modelId?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses document content provided as a buffer.

Parameters: - content: The document content as a Buffer - options: Parsing options - filename: Optional filename for mime-type detection - contentType: The content type of the document - modelId: The Form Recognizer model to use - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult as described above

Example:

const fileBuffer = fs.readFileSync('contract.pdf');
const parseResult = await documentParseService.parseDocumentContent(fileBuffer, {
  filename: 'contract.pdf',
  contentType: 'application/pdf'
});

extractTablesFromDocument¶

async extractTablesFromDocument(
  documentId: string,
  options?: {
    containerName?: string;
    format?: 'markdown' | 'csv' | 'json';
  }
): Promise<{
  tables: string[];
  rawTables: ParsedTable[];
}>

Extracts tables from a document.

Parameters: - documentId: The ID of the document - options: Extraction options - containerName: The container name - format: Output format for tables (default: 'markdown')

Returns: - Object containing: - tables: Array of formatted table strings - rawTables: Array of structured table objects

Example:

// Extract tables as CSV
const tableResults = await documentParseService.extractTablesFromDocument('financial-report', {
  format: 'csv'
});

console.log(`Extracted ${tableResults.tables.length} tables`);
tableResults.tables.forEach((table, index) => {
  fs.writeFileSync(`table-${index}.csv`, table);
});

Implementation Details¶

Azure Form Recognizer Client Initialization¶

private documentAnalysisClient: DocumentAnalysisClient;

constructor(
  private configService: ConfigService,
  private blobService: BlobService
) {
  const endpoint = this.configService.get<string>('AZURE_FORM_RECOGNIZER_ENDPOINT');
  const apiKey = this.configService.get<string>('AZURE_FORM_RECOGNIZER_API_KEY');

  if (!endpoint || !apiKey) {
    throw new Error('Azure Form Recognizer not properly configured');
  }

  this.documentAnalysisClient = new DocumentAnalysisClient(
    endpoint,
    new AzureKeyCredential(apiKey)
  );
}

Processing the Form Recognizer Response¶

private processAnalyzeResult(analyzeResult: any): DocumentParseResult {
  const pages: ParsedPage[] = [];
  const tables: ParsedTable[] = [];
  let content = '';

  // Process pages
  for (const page of analyzeResult.pages || []) {
    const parsedPage: ParsedPage = {
      pageNumber: page.pageNumber,
      width: page.width,
      height: page.height,
      unit: page.unit,
      text: '',
      tables: [],
      elements: []
    };

    // Process content and elements
    // ...implementation details...

    pages.push(parsedPage);
  }

  // Process tables
  for (const table of analyzeResult.tables || []) {
    const parsedTable: ParsedTable = {
      rowCount: table.rowCount,
      columnCount: table.columnCount,
      cells: table.cells.map(cell => ({
        rowIndex: cell.rowIndex,
        columnIndex: cell.columnIndex,
        rowSpan: cell.rowSpan || 1,
        columnSpan: cell.columnSpan || 1,
        content: cell.content,
        isHeader: cell.kind === 'columnHeader' || cell.kind === 'rowHeader'
      }))
    };

    tables.push(parsedTable);
  }

  // Generate full content by combining all pages
  content = pages.map(page => page.text).join('\n\n');

  return {
    content,
    pages,
    tables,
    metadata: analyzeResult.metadata || {},
    documentId: analyzeResult.documentId
  };
}

Table Formatting¶

private formatTableAsMarkdown(table: ParsedTable): string {
  // Implementation for converting the table to Markdown format
  // ...
}

private formatTableAsCsv(table: ParsedTable): string {
  // Implementation for converting the table to CSV format
  // ...
}

private formatTableAsJson(table: ParsedTable): string {
  // Implementation for converting the table to JSON format
  // ...
}

Integration with Other Services¶

The DocumentParseService integrates with:

BlobService: To retrieve documents for parsing
RAG Module: For document processing and indexing
Azure OpenAI: For further processing of extracted content

Error Handling¶

The service includes specialized error handling for:

Service unavailability: When Azure Form Recognizer is unavailable
Rate limiting: When API rate limits are exceeded
Document format issues: When documents are in unsupported formats
Parsing failures: When the service cannot parse a document
Timeout errors: When a parse operation takes too long

Logging¶

The service uses NestJS Logger for detailed logging:

private readonly logger = new Logger(DocumentParseService.name);

// Usage examples
this.logger.log(`Parsing document with ID: ${documentId}`);
this.logger.debug(`Using model: ${modelId}`);
this.logger.error(`Error parsing document: ${error.message}`, error.stack);

Configuration¶

Required environment variables:

AZURE_FORM_RECOGNIZER_ENDPOINT=https://your-instance.cognitiveservices.azure.com/
AZURE_FORM_RECOGNIZER_API_KEY=your-api-key

Optional configuration:

AZURE_FORM_RECOGNIZER_DEFAULT_MODEL=prebuilt-document
AZURE_FORM_RECOGNIZER_TIMEOUT_MS=120000