How Document Extraction Works - Docs

A plain-language walkthrough of what happens from the moment you upload a document to when it's ready for review.

When you upload a document to CargoLint, a lot happens behind the scenes in just a few seconds. Here’s how the extraction pipeline works.

Step 1: Upload and Ingestion

You upload a PDF, PNG, JPEG, or TIFF file (up to 5 MB) through the CargoLint dashboard. The file is securely stored and queued for processing.

Step 2: Document Type Classification

The first thing CargoLint does is figure out what kind of document it is. Our AI analyzes the page layout, header text, and structural patterns to classify the document as a Commercial Invoice, Packing List, Bill of Lading, or Certificate of Origin.

If the document doesn’t match any of these types, it’s marked for manual review.

Step 3: Language Detection

Next, CargoLint detects the language of the document. We support:

English
Simplified Chinese
Traditional Chinese
Spanish

Language detection is important because our extraction models are language-specific. Once we know the language, we apply the right extraction rules.

Step 4: Field Extraction

Now the AI extracts the specific fields relevant to the document type. For example:

From an invoice: seller name, buyer name, invoice number, line items, total amount, payment terms
From a packing list: item descriptions, quantities, weights, box numbers
From a bill of lading: shipper, consignee, container numbers, ports
From a certificate of origin: exporter, importer, country of origin, product descriptions

Each field is identified, located in the document, and read by our optical character recognition (OCR) system.

Step 5: Confidence Scoring

As the AI extracts each field, it calculates a confidence score (0 to 1) based on:

How clear the text is
How well it matches expected formats
Whether it’s in the expected location
How well extracted values validate against each other

The confidence score reflects the AI’s certainty. A score of 0.95 means we’re very confident; a score of 0.55 means we’re uncertain and the field should be reviewed.

Step 6: Validation Checks

CargoLint performs cross-field validation to catch obvious errors:

Do line item totals add up to the invoice total?
Do package counts match the shipment summary?
Are dates in logical order (invoice date before due date)?
Are required fields present?

If validation checks fail, confidence scores are adjusted down automatically.

Step 7: Review Queue Routing

After scoring and validation, the document is routed based on its overall confidence:

High confidence (≥70%): The extracted data is sent directly to your completed documents, ready to use or export.
Lower confidence (<70%): The document goes to your review queue. A team member with sufficient permissions can open it, check the extractions, make corrections if needed, and then mark it complete.

Why does processing take a few seconds?

The entire pipeline—classification, language detection, extraction, scoring, and validation—typically takes 3 to 10 seconds per document. This delay accounts for:

The time needed for the AI to analyze each page
The time to perform validation checks and adjust scores
Network latency for secure transmission

Why do some documents need review and others don’t?

Documents that pass review are those with consistently clear text, standard layouts, and extractions that validate well. These typically score above 70%.

Documents that need review are those with:

Low-quality scans or handwriting
Non-standard layouts or missing fields
Extracted values that don’t validate (e.g., math doesn’t add up)
Languages or formats the AI is less familiar with

Review isn’t a sign of failure—it’s a designed part of the workflow. Some documents are genuinely ambiguous and benefit from human judgment.

Real-time monitoring

You can monitor the extraction status of all your documents from the dashboard. You’ll see:

Total documents processed
Documents completed
Documents requiring review
Average confidence across all documents
30-day trend charts showing volume and confidence over time

This visibility helps you understand document quality patterns and plan your review resources.