Transformers

feed Azure OCR json with docling pipeline

#21
by jufrit - opened

Hi,

I want to feed docling my original PDF (for visual analysis) and my Azure OCR json (as the text source) to fuse them.
Is that possible? Or is it possible to parse the Azure OCR json into a compatible format?

Thanks already

Hi,

I want to feed docling my original PDF (for visual analysis) and my Azure OCR json (as the text source) to fuse them.
Is that possible? Or is it possible to parse the Azure OCR json into a compatible format?

Thanks already

You use the docling lib pipeline and implement a custom OCR model.
Something along the lines of

        from docling.models.factories import get_ocr_factory

        factory.register(
            TextractOcrModel, "custom", "textract_ocr_model"
        )
        pipeline_options = PdfPipelineOptions(
            do_ocr=True,
            ocr_options=TextractPrecomputedOcrOptions(force_full_page_ocr=True),
            allow_external_plugins=False,
        )
        log.info("Creating DocumentConverter")
        self.converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
                InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options),
            }
        )

Just make sure that your custom class inherits from docling.models.base_ocr_model.BaseOcrModel and implements the constructor properly and implements the __call__ method with the correct interfaces:
def __call__(
        self, conv_res: ConversionResult, page_batch: Iterable[Page]
    ) -> Iterable[Page]

You may need to do some manual conversion from your OCR json to the Page class of docling. Also, beware of reference axis scaling and rotations as they may differ from the docling frame of reference

Sign up or log in to comment