Logo Parsio Knowledge Base
  1. Overview
  2. Data Extraction
  3. Field Types. Data Formatting and Normalization

Field Types. Data Formatting and Normalization

Data normalization is the process of structuring parsed data. You may want to eliminate unwanted spaces and special characters, convert comma-separated numbers to decimal-separated etc.

This can be done using Field Types. Parsio supports 9 built-in types:

  • Multi-line Text
  • Single-line Text
  • Raw ("Keep as Source")
  • Downloadable Document (URL)
  • Date
  • Time
  • Date and Time
  • Number
  • Address

Field Types

Note that these settings are defined at the Mailbox level, e.g. all the templates in the mailbox will share the same types for the same fields.

1. Multi-line Text

The Multi-line Text type comes with a couple of post-processing operations:

    • Remove all the HTML tags from the found text.
    • Decode HTML entities to their corresponding characters (e.g. & will become & etc).

Example: an anchor tag <a href="https://example.com" target="_blank">Click Here</a> will be converted to "Click Here".

This field type preserves paragraphs.

2. Single-line Text

The Single-line Text type is similar to Multi-line Text but removes all the line breaks keeping the value as a single paragraph.

3. Keep as Source

The Raw field type ("Keep as source") simply keeps the found value "as is" without any additional formatting.

4. Downloadable Document

The Downloadable document allows to parse external documents. If the parsed value is a valid URL, Parsio will try to download the linked document: PDF file, HTML page, Excel, CSV, DOCX etc.

The file must be publicly available, e.g. you should not enter login / password to access it.

5-7. Date, Time, Date and Time

Dates and times come in a variety of formats into your mailbox. Sometimes the date format may be quite ambiguous to parse:

  • 11/07/2022. In different countries this can be either the 11th of July or the 7th of November.

To disambiguate that situation, Parsio allows you to specify the input and output format for your Date, Time and DateTime field types. You can find these settings under your Mailbox > Settings page:

Date Formats

If you leave the input fields empty, Parsio will try to "guess" the correct date & time formats. In the case where Parsio can't parse a field, its value will be kept unchanged.

Date Formats

Weekday

    • d 0..6
    • dd Su    
    • ddd Sun    
    • dddd Sunday

Year

    • YY 13
    • YYYY 2013

Month

    • M 1..12 (Jan is 1)
    • Mo 1st..12th    
    • MM 01..12 (Jan is 1)    
    • MMM Jan    
    • MMMM January

Quarter

    • Q 1..4
    • Qo 1st..4th

Day

    • D 1..31
    • Do 1st..31st
    • DD 01..31

Day of year

    • DDD 1..365
    • DDDo 1st..365th
    • DDDD 001..365

Week of year

    • w 1..53
    • wo 1st..53rd
    • ww 01..53

Time Formats

24h hour

    • H 0..23
    • HH 00..23

12h hour

    • h 1..12
    • hh 01..12

Minutes

    • m 0..59
    • mm 00..59

Seconds

    • s 0..59
    • ss 00..59

AM/PM

  • a am
  • A AM

Timezone offset

  • Z +07:00
  • ZZ +0730

 

  • Unix timestamp: X
  • Millisecond Unix timestamp: x

Here are some examples:

  • YYYY-MM-DD 2014-01-01
  • dddd, MMMM Do YYYY Friday, May 16th 2014
  • dddd [the] Do [of] MMMM Friday the 16th of May
  • hh:mm a 12:30 pm

8. Number

The Number format removes any spaces and special characters from the parsed value. Additionally, you can convert comma-separated numbers to decimal-separated and vice versa. You can find these settings under your Mailbox > Settings page:

Number Type Format

In the case where Parsio can't parse a numeric field, its value will be kept unchanged.

9. Address

The Address field format converts a freeform text address to a structured format.

Example. Let's say you captured the following address: 1 Piazza del Colosseo 00184 Roma.

Parsio will store it as a field with multiple properties:

If, however, Parsio is unable to convert the parsed text in a structured address format, the resulted value will be:

 

Related Articles


Was this article helpful?