Document Parse Service (src/azure/document-parse.service.ts)¶
Overview¶
The DocumentParseService is responsible for extracting text and structure from various document formats using Azure's Form Recognizer service. This service plays a critical role in the document processing pipeline, converting raw documents into structured data that can be analyzed and processed by the BidScript platform.
Dependencies¶
import { Injectable, Logger } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
import { AzureKeyCredential, DocumentAnalysisClient } from '@azure/ai-form-recognizer';
import { BlobService } from './blob.service';
import { DocumentParseResult, ParsedTable, ParsedPage } from '../types/interfaces/parse.interface';
Key Features¶
- Document text extraction with layout preservation
- Table detection and structured extraction
- Key-value pair recognition
- Support for multiple document formats (PDF, DOCX, PPTX, images)
- Handling of large documents through pagination
- OCR capabilities for scanned documents
- Structure preservation with formatting information
- Document metadata extraction
Core Methods¶
parseDocument¶
async parseDocument(
documentId: string,
options?: {
modelId?: string;
containerName?: string;
includeRawResponse?: boolean;
}
): Promise<DocumentParseResult>
Parses a document using Azure Form Recognizer.
Parameters:
- documentId: The ID of the document to parse
- options: Parsing options
- modelId: The Form Recognizer model to use (default: 'prebuilt-document')
- containerName: The container name for the document
- includeRawResponse: Whether to include the raw API response
Returns:
- DocumentParseResult containing:
- documentId: The original document ID
- content: Extracted text content
- pages: Array of parsed pages with text, tables, and elements
- tables: Array of detected tables
- metadata: Document metadata
- rawResponse: (optional) Raw API response
Example:
const parseResult = await documentParseService.parseDocument('invoice-123');
console.log(`Parsed ${parseResult.pages.length} pages with ${parseResult.tables.length} tables`);
console.log(`Content length: ${parseResult.content.length}`);
parseDocumentFromUrl¶
async parseDocumentFromUrl(
documentUrl: string,
options?: {
modelId?: string;
includeRawResponse?: boolean;
}
): Promise<DocumentParseResult>
Parses a document using a publicly accessible URL.
Parameters:
- documentUrl: The URL of the document to parse
- options: Parsing options
- modelId: The Form Recognizer model to use
- includeRawResponse: Whether to include the raw API response
Returns:
- DocumentParseResult as described above
Example:
const url = 'https://example.com/documents/contract.pdf';
const parseResult = await documentParseService.parseDocumentFromUrl(url);
parseDocumentContent¶
async parseDocumentContent(
content: Buffer,
options?: {
filename?: string;
contentType?: string;
modelId?: string;
includeRawResponse?: boolean;
}
): Promise<DocumentParseResult>
Parses document content provided as a buffer.
Parameters:
- content: The document content as a Buffer
- options: Parsing options
- filename: Optional filename for mime-type detection
- contentType: The content type of the document
- modelId: The Form Recognizer model to use
- includeRawResponse: Whether to include the raw API response
Returns:
- DocumentParseResult as described above
Example:
const fileBuffer = fs.readFileSync('contract.pdf');
const parseResult = await documentParseService.parseDocumentContent(fileBuffer, {
filename: 'contract.pdf',
contentType: 'application/pdf'
});
extractTablesFromDocument¶
async extractTablesFromDocument(
documentId: string,
options?: {
containerName?: string;
format?: 'markdown' | 'csv' | 'json';
}
): Promise<{
tables: string[];
rawTables: ParsedTable[];
}>
Extracts tables from a document.
Parameters:
- documentId: The ID of the document
- options: Extraction options
- containerName: The container name
- format: Output format for tables (default: 'markdown')
Returns:
- Object containing:
- tables: Array of formatted table strings
- rawTables: Array of structured table objects
Example:
// Extract tables as CSV
const tableResults = await documentParseService.extractTablesFromDocument('financial-report', {
format: 'csv'
});
console.log(`Extracted ${tableResults.tables.length} tables`);
tableResults.tables.forEach((table, index) => {
fs.writeFileSync(`table-${index}.csv`, table);
});
Implementation Details¶
Azure Form Recognizer Client Initialization¶
private documentAnalysisClient: DocumentAnalysisClient;
constructor(
private configService: ConfigService,
private blobService: BlobService
) {
const endpoint = this.configService.get<string>('AZURE_FORM_RECOGNIZER_ENDPOINT');
const apiKey = this.configService.get<string>('AZURE_FORM_RECOGNIZER_API_KEY');
if (!endpoint || !apiKey) {
throw new Error('Azure Form Recognizer not properly configured');
}
this.documentAnalysisClient = new DocumentAnalysisClient(
endpoint,
new AzureKeyCredential(apiKey)
);
}
Processing the Form Recognizer Response¶
private processAnalyzeResult(analyzeResult: any): DocumentParseResult {
const pages: ParsedPage[] = [];
const tables: ParsedTable[] = [];
let content = '';
// Process pages
for (const page of analyzeResult.pages || []) {
const parsedPage: ParsedPage = {
pageNumber: page.pageNumber,
width: page.width,
height: page.height,
unit: page.unit,
text: '',
tables: [],
elements: []
};
// Process content and elements
// ...implementation details...
pages.push(parsedPage);
}
// Process tables
for (const table of analyzeResult.tables || []) {
const parsedTable: ParsedTable = {
rowCount: table.rowCount,
columnCount: table.columnCount,
cells: table.cells.map(cell => ({
rowIndex: cell.rowIndex,
columnIndex: cell.columnIndex,
rowSpan: cell.rowSpan || 1,
columnSpan: cell.columnSpan || 1,
content: cell.content,
isHeader: cell.kind === 'columnHeader' || cell.kind === 'rowHeader'
}))
};
tables.push(parsedTable);
}
// Generate full content by combining all pages
content = pages.map(page => page.text).join('\n\n');
return {
content,
pages,
tables,
metadata: analyzeResult.metadata || {},
documentId: analyzeResult.documentId
};
}
Table Formatting¶
private formatTableAsMarkdown(table: ParsedTable): string {
// Implementation for converting the table to Markdown format
// ...
}
private formatTableAsCsv(table: ParsedTable): string {
// Implementation for converting the table to CSV format
// ...
}
private formatTableAsJson(table: ParsedTable): string {
// Implementation for converting the table to JSON format
// ...
}
Integration with Other Services¶
The DocumentParseService integrates with:
- BlobService: To retrieve documents for parsing
- RAG Module: For document processing and indexing
- Azure OpenAI: For further processing of extracted content
Error Handling¶
The service includes specialized error handling for:
- Service unavailability: When Azure Form Recognizer is unavailable
- Rate limiting: When API rate limits are exceeded
- Document format issues: When documents are in unsupported formats
- Parsing failures: When the service cannot parse a document
- Timeout errors: When a parse operation takes too long
Logging¶
The service uses NestJS Logger for detailed logging:
private readonly logger = new Logger(DocumentParseService.name);
// Usage examples
this.logger.log(`Parsing document with ID: ${documentId}`);
this.logger.debug(`Using model: ${modelId}`);
this.logger.error(`Error parsing document: ${error.message}`, error.stack);
Configuration¶
Required environment variables:
AZURE_FORM_RECOGNIZER_ENDPOINT=https://your-instance.cognitiveservices.azure.com/
AZURE_FORM_RECOGNIZER_API_KEY=your-api-key
Optional configuration: