OCR Parsing of PDF Files and Images

1. AI-powered parser (Extract structured data)

Parsio can automatically extract structured data, tables and handwritten text from text PDFs, scanned PDFs and images. It uses Machine Learning for OCR and data extraction.

There is a set of prebuilt AI models to automatically extract data from some commonly-used document types:

Invoices
Receipts
Bank statements
Business cards
Identity documents: passports, driving licenses, ID cards etc
Contracts
Credit cards
Pay stubs
Bank checks
W-2 tax forms (US)
1098 tax forms (US)
Health insurance cards (US)
Marriage certificates (US)
General documents and forms, including handwritten text in different languages.

How to use the AI engine

1. Create a mailbox, choose "AI-powered PDF parser" and select a pre-built model.

2. Send email with attachments as usual, upload files manually or use our API to import PDF files.

3. Parsio will automatically identify fields, tables and data to extract.

4. After this, you can export the parsed data as usual (Google Sheets, automation platforms, webhooks or files).

Raw data

By default, Parsio only includes structured data based on the document type. However, customers can choose to include raw data by enabling it in the mailbox settings.

The raw data is stored in in the __raw__ field , which consists of the fully extracted content in plain text and an array of paragraphs with associated attributes such as page number, type (title, footer, etc.), and polygon (coordinates of the paragraph on the page).

You can enable or disable the raw data from the Mailbox Settings page:

2. OCR converter (Extract text)

Parsio’s OCR Converter engine is designed for converting PDFs and images into editable text formats, while preserving the original layout as closely as possible.

Key Features

Text Extraction from Scanned Documents: The OCR Converter can process PDFs, images, and scanned documents in formats such as PDF, JPG, PNG, and TIFF.
Table Extraction: Parsio’s OCR Converter accurately captures tables and other structured data from scanned documents and images, which can then be exported into formats like Excel, CSV, Markdown, TXT and HTML.
Layout Preservation: This engine conserves the original document layout, ensuring that text, tables, and other elements retain their structure as they are converted to editable formats.

Supported Formats

The OCR Converter engine supports the following formats for text extraction: PDF, JPG, PNG, TIFF.

How to Use the OCR Converter

Create a Mailbox: Select "OCR Converter" as the engine for your mailbox.
Import Files: You can send an email with attachments, upload files manually, or use our API to import documents in supported formats.
Automatic Extraction: Parsio’s OCR engine will automatically process the text, tables, and layout, converting the content into editable formats.
Export Data: Once the extraction is complete, you can export the data in formats like Google Sheets, CSV, JSON, or use webhooks and automation platforms for integration.

Use Cases

The OCR Converter is ideal for cases where users need to:

Convert printed or scanned documents to editable formats for digital archiving.
Extract tables and text from paper forms or receipts for data analysis.
Digitize documents while preserving the original layout, such as contracts or business reports.

Choosing the Right OCR Model

OCR models have their strengths and weaknesses. Some work better on certain document types than others. Don't hesitate to test and see which one works best for your needs.

Parsio currently supports two AI-powered OCR models, each optimized for different use cases.

Default OCR

Supports: PDF, JPG, PNG, TIFF
Can extract tables and save them in Excel, CSV, and other formats
Sometimes more precise
Can be relatively slow on larger PDFs

Mistral OCR

Supports: PDF, JPG, PNG
Faster processing
Extracts images from PDFs
Better for scientific papers and formulas
Better for mixed content (text, tables, images), such as magazine articles
Mistral claims it is "the world’s best document understanding model"

Choose the model based on your document type and processing needs.