Skip to content

Document Parse Service (src/azure/document-parse.service.ts)

Overview

The DocumentParseService is responsible for extracting text and structure from various document formats using Azure's Form Recognizer service. This service plays a critical role in the document processing pipeline, converting raw documents into structured data that can be analyzed and processed by the BidScript platform.

Dependencies

import { Injectable, Logger } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
import { AzureKeyCredential, DocumentAnalysisClient } from '@azure/ai-form-recognizer';
import { BlobService } from './blob.service';
import { DocumentParseResult, ParsedTable, ParsedPage } from '../types/interfaces/parse.interface';

Key Features

  • Document text extraction with layout preservation
  • Table detection and structured extraction
  • Key-value pair recognition
  • Support for multiple document formats (PDF, DOCX, PPTX, images)
  • Handling of large documents through pagination
  • OCR capabilities for scanned documents
  • Structure preservation with formatting information
  • Document metadata extraction

Core Methods

parseDocument

async parseDocument(
  documentId: string,
  options?: {
    modelId?: string;
    containerName?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses a document using Azure Form Recognizer.

Parameters: - documentId: The ID of the document to parse - options: Parsing options - modelId: The Form Recognizer model to use (default: 'prebuilt-document') - containerName: The container name for the document - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult containing: - documentId: The original document ID - content: Extracted text content - pages: Array of parsed pages with text, tables, and elements - tables: Array of detected tables - metadata: Document metadata - rawResponse: (optional) Raw API response

Example:

const parseResult = await documentParseService.parseDocument('invoice-123');
console.log(`Parsed ${parseResult.pages.length} pages with ${parseResult.tables.length} tables`);
console.log(`Content length: ${parseResult.content.length}`);

parseDocumentFromUrl

async parseDocumentFromUrl(
  documentUrl: string,
  options?: {
    modelId?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses a document using a publicly accessible URL.

Parameters: - documentUrl: The URL of the document to parse - options: Parsing options - modelId: The Form Recognizer model to use - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult as described above

Example:

const url = 'https://example.com/documents/contract.pdf';
const parseResult = await documentParseService.parseDocumentFromUrl(url);

parseDocumentContent

async parseDocumentContent(
  content: Buffer,
  options?: {
    filename?: string;
    contentType?: string;
    modelId?: string;
    includeRawResponse?: boolean;
  }
): Promise<DocumentParseResult>

Parses document content provided as a buffer.

Parameters: - content: The document content as a Buffer - options: Parsing options - filename: Optional filename for mime-type detection - contentType: The content type of the document - modelId: The Form Recognizer model to use - includeRawResponse: Whether to include the raw API response

Returns: - DocumentParseResult as described above

Example:

const fileBuffer = fs.readFileSync('contract.pdf');
const parseResult = await documentParseService.parseDocumentContent(fileBuffer, {
  filename: 'contract.pdf',
  contentType: 'application/pdf'
});

extractTablesFromDocument

async extractTablesFromDocument(
  documentId: string,
  options?: {
    containerName?: string;
    format?: 'markdown' | 'csv' | 'json';
  }
): Promise<{
  tables: string[];
  rawTables: ParsedTable[];
}>

Extracts tables from a document.

Parameters: - documentId: The ID of the document - options: Extraction options - containerName: The container name - format: Output format for tables (default: 'markdown')

Returns: - Object containing: - tables: Array of formatted table strings - rawTables: Array of structured table objects

Example:

// Extract tables as CSV
const tableResults = await documentParseService.extractTablesFromDocument('financial-report', {
  format: 'csv'
});

console.log(`Extracted ${tableResults.tables.length} tables`);
tableResults.tables.forEach((table, index) => {
  fs.writeFileSync(`table-${index}.csv`, table);
});

Implementation Details

Azure Form Recognizer Client Initialization

private documentAnalysisClient: DocumentAnalysisClient;

constructor(
  private configService: ConfigService,
  private blobService: BlobService
) {
  const endpoint = this.configService.get<string>('AZURE_FORM_RECOGNIZER_ENDPOINT');
  const apiKey = this.configService.get<string>('AZURE_FORM_RECOGNIZER_API_KEY');

  if (!endpoint || !apiKey) {
    throw new Error('Azure Form Recognizer not properly configured');
  }

  this.documentAnalysisClient = new DocumentAnalysisClient(
    endpoint,
    new AzureKeyCredential(apiKey)
  );
}

Processing the Form Recognizer Response

private processAnalyzeResult(analyzeResult: any): DocumentParseResult {
  const pages: ParsedPage[] = [];
  const tables: ParsedTable[] = [];
  let content = '';

  // Process pages
  for (const page of analyzeResult.pages || []) {
    const parsedPage: ParsedPage = {
      pageNumber: page.pageNumber,
      width: page.width,
      height: page.height,
      unit: page.unit,
      text: '',
      tables: [],
      elements: []
    };

    // Process content and elements
    // ...implementation details...

    pages.push(parsedPage);
  }

  // Process tables
  for (const table of analyzeResult.tables || []) {
    const parsedTable: ParsedTable = {
      rowCount: table.rowCount,
      columnCount: table.columnCount,
      cells: table.cells.map(cell => ({
        rowIndex: cell.rowIndex,
        columnIndex: cell.columnIndex,
        rowSpan: cell.rowSpan || 1,
        columnSpan: cell.columnSpan || 1,
        content: cell.content,
        isHeader: cell.kind === 'columnHeader' || cell.kind === 'rowHeader'
      }))
    };

    tables.push(parsedTable);
  }

  // Generate full content by combining all pages
  content = pages.map(page => page.text).join('\n\n');

  return {
    content,
    pages,
    tables,
    metadata: analyzeResult.metadata || {},
    documentId: analyzeResult.documentId
  };
}

Table Formatting

private formatTableAsMarkdown(table: ParsedTable): string {
  // Implementation for converting the table to Markdown format
  // ...
}

private formatTableAsCsv(table: ParsedTable): string {
  // Implementation for converting the table to CSV format
  // ...
}

private formatTableAsJson(table: ParsedTable): string {
  // Implementation for converting the table to JSON format
  // ...
}

Integration with Other Services

The DocumentParseService integrates with:

  • BlobService: To retrieve documents for parsing
  • RAG Module: For document processing and indexing
  • Azure OpenAI: For further processing of extracted content

Error Handling

The service includes specialized error handling for:

  • Service unavailability: When Azure Form Recognizer is unavailable
  • Rate limiting: When API rate limits are exceeded
  • Document format issues: When documents are in unsupported formats
  • Parsing failures: When the service cannot parse a document
  • Timeout errors: When a parse operation takes too long

Logging

The service uses NestJS Logger for detailed logging:

private readonly logger = new Logger(DocumentParseService.name);

// Usage examples
this.logger.log(`Parsing document with ID: ${documentId}`);
this.logger.debug(`Using model: ${modelId}`);
this.logger.error(`Error parsing document: ${error.message}`, error.stack);

Configuration

Required environment variables:

AZURE_FORM_RECOGNIZER_ENDPOINT=https://your-instance.cognitiveservices.azure.com/
AZURE_FORM_RECOGNIZER_API_KEY=your-api-key

Optional configuration:

AZURE_FORM_RECOGNIZER_DEFAULT_MODEL=prebuilt-document
AZURE_FORM_RECOGNIZER_TIMEOUT_MS=120000