1. Overview
  2. Data Extraction
  3. OCR Parsing of PDF Files and Images

OCR Parsing of PDF Files and Images

1. AI-powered parser (Extract structured data)

Parsio can automatically extract structured data, tables and handwritten text from text PDFs, scanned PDFs and images. It uses Machine Learning for OCR and data extraction.

There is a set of prebuilt AI models to automatically extract data from some commonly-used document types:

  • Invoices
  • Receipts
  • Bank statements
  • Business cards
  • Identity documents: passports, driving licenses, ID cards etc
  • Contracts
  • Credit cards
  • Pay stubs
  • Bank checks
  • W-2 tax forms (US)
  • 1098 tax forms (US)
  • Health insurance cards (US)
  • Marriage certificates (US)
  • General documents and forms, including handwritten text in different languages.

How to use the AI engine

1. Create a mailbox, choose "AI-powered PDF parser" and select a pre-built model.

2. Send email with attachments as usual, upload files manually or use our API to import PDF files.

3. Parsio will automatically identify fields, tables and data to extract.

4. After this, you can export the parsed data as usual (Google Sheets, automation platforms, webhooks or files).

Raw data

By default, Parsio only includes structured data based on the document type. However, customers can choose to include raw data by enabling it in the mailbox settings.

The raw data is stored in in the __raw__ field , which consists of the fully extracted content in plain text and an array of paragraphs with associated attributes such as page number, type (title, footer, etc.), and polygon (coordinates of the paragraph on the page).

You can enable or disable the raw data from the Mailbox Settings page:

2. OCR converter (Extract text)

Parsio’s OCR Converter engine is designed for converting PDFs and images into editable text formats, while preserving the original layout as closely as possible.

Key Features

  • Text Extraction from Scanned Documents: The OCR Converter can process PDFs, images, and scanned documents in formats such as PDF, JPG, PNG, and TIFF.
  • Table Extraction: Parsio’s OCR Converter accurately captures tables and other structured data from scanned documents and images, which can then be exported into formats like Excel, CSV, Markdown, TXT and HTML.
  • Layout Preservation: This engine conserves the original document layout, ensuring that text, tables, and other elements retain their structure as they are converted to editable formats.

Supported Formats

The OCR Converter engine supports the following formats for text extraction: PDF, JPG, PNG, TIFF.

How to Use the OCR Converter

  1. Create a Mailbox: Select "OCR Converter" as the engine for your mailbox.
  2. Import Files: You can send an email with attachments, upload files manually, or use our API to import documents in supported formats.
  3. Automatic Extraction: Parsio’s OCR engine will automatically process the text, tables, and layout, converting the content into editable formats.
  4. Export Data: Once the extraction is complete, you can export the data in formats like Google Sheets, CSV, JSON, or use webhooks and automation platforms for integration.

Use Cases

The OCR Converter is ideal for cases where users need to:

  • Convert printed or scanned documents to editable formats for digital archiving.
  • Extract tables and text from paper forms or receipts for data analysis.
  • Digitize documents while preserving the original layout, such as contracts or business reports.

 


Was this article helpful?
© 2024 Parsio Knowledge Base