feed Azure OCR json with docling pipeline
#21
by
jufrit
- opened
Hi,
I want to feed docling my original PDF (for visual analysis) and my Azure OCR json (as the text source) to fuse them.
Is that possible? Or is it possible to parse the Azure OCR json into a compatible format?
Thanks already
Hi,
I want to feed docling my original PDF (for visual analysis) and my Azure OCR json (as the text source) to fuse them.
Is that possible? Or is it possible to parse the Azure OCR json into a compatible format?Thanks already
You use the docling lib pipeline and implement a custom OCR model.
Something along the lines of
from docling.models.factories import get_ocr_factory
factory.register(
TextractOcrModel, "custom", "textract_ocr_model"
)
pipeline_options = PdfPipelineOptions(
do_ocr=True,
ocr_options=TextractPrecomputedOcrOptions(force_full_page_ocr=True),
allow_external_plugins=False,
)
log.info("Creating DocumentConverter")
self.converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options),
}
)
Just make sure that your custom class inherits from docling.models.base_ocr_model.BaseOcrModel and implements the constructor properly and implements the __call__ method with the correct interfaces:
def __call__(
self, conv_res: ConversionResult, page_batch: Iterable[Page]
) -> Iterable[Page]
You may need to do some manual conversion from your OCR json to the Page class of docling. Also, beware of reference axis scaling and rotations as they may differ from the docling frame of reference