1. Overview
  2. Data Extraction
  3. OCR Parsing of PDF Files and Images

OCR Parsing of PDF Files and Images

Parsio can automatically extract structured data, tables and handwritten text from text PDFs, scanned PDFs and images. It uses Machine Learning for OCR and data extraction.

There is a set of prebuilt AI models to automatically extract data from some commonly-used document types:

  • Invoices
  • Receipts
  • Business cards
  • Identity documents: passports, driving licenses, ID cards etc
  • W-2 forms (US)
  • General documents and forms, including handwritten text in different languages.

How to use the AI engine

1. Create a mailbox, choose "AI-powered PDF parser" and select a pre-built model.

2. Send email with attachments as usual, upload files manually or use our API to import PDF files.

3. Parsio will automatically identify fields, tables and data to extract.

4. After this, you can export the parsed data as usual (Google Sheets, automation platforms, webhooks or files).

Raw data

By default, Parsio only includes structured data based on the document type. However, customers can choose to include raw data by enabling it in the mailbox settings.

The raw data is stored in in the __raw__ field , which consists of the fully extracted content in plain text and an array of paragraphs with associated attributes such as page number, type (title, footer, etc.), and polygon (coordinates of the paragraph on the page).

You can enable or disable the raw data from the Mailbox Settings page:

 

 


Was this article helpful?
© 2024 Parsio Knowledge Base