OCR Parsing of PDF Files and Images
Parsio can automatically extract structured data, tables and handwritten text from text PDFs, scanned PDFs and images. It uses Machine Learning for OCR and data extraction.
There is a set of prebuilt AI models to automatically extract data from some commonly-used document types:
- Business cards
- Identity documents: passports, driving licenses, ID cards etc
- W-2 forms (US)
- General documents and forms, including handwritten text in different languages.
How to use the AI engine
1. Create a mailbox, choose "AI-powered PDF parser" and select a pre-built model.
2. Send email with attachments as usual, upload files manually or use our API to import PDF files.
3. Parsio will automatically identify fields, tables and data to extract.
4. After this, you can export the parsed data as usual (Google Sheets, automation platforms, webhooks or files).
By default, Parsio only includes structured data based on the document type. However, customers can choose to include raw data by enabling it in the mailbox settings.
The raw data is stored in in the
__raw__ field , which consists of the fully extracted
content in plain text and an array of
paragraphs with associated attributes such as
type (title, footer, etc.), and
polygon (coordinates of the paragraph on the page).
You can enable or disable the raw data from the Mailbox Settings page: