How Document Extraction Works
A plain-language walkthrough of what happens from the moment you upload a document to when it's ready for review.
When you upload a document to CargoLint, a lot happens behind the scenes in just a few seconds. Here’s how the extraction pipeline works.
Step 1: Upload and Ingestion
You upload a PDF, PNG, JPEG, or TIFF file (up to 5 MB) through the CargoLint dashboard. The file is securely stored and queued for processing.
Step 2: Document Type Classification
The first thing CargoLint does is figure out what kind of document it is. Our AI analyzes the page layout, header text, and structural patterns to classify the document as a Commercial Invoice, Packing List, Bill of Lading, or Certificate of Origin.
If the document doesn’t match any of these types, it’s marked for manual review.
Step 3: Language Detection
Next, CargoLint detects the language of the document. We support:
- English
- Simplified Chinese
- Traditional Chinese
- Spanish
Language detection is important because our extraction models are language-specific. Once we know the language, we apply the right extraction rules.
Step 4: Field Extraction
Now the AI extracts the specific fields relevant to the document type. For example:
- From an invoice: seller name, buyer name, invoice number, line items, total amount, payment terms
- From a packing list: item descriptions, quantities, weights, box numbers
- From a bill of lading: shipper, consignee, container numbers, ports
- From a certificate of origin: exporter, importer, country of origin, product descriptions
Each field is identified, located in the document, and read by our optical character recognition (OCR) system.
Step 5: Confidence Scoring
As the AI extracts each field, it calculates a confidence score (0 to 1) based on:
- How clear the text is
- How well it matches expected formats
- Whether it’s in the expected location
- How well extracted values validate against each other
The confidence score reflects the AI’s certainty. A score of 0.95 means we’re very confident; a score of 0.55 means we’re uncertain and the field should be reviewed.
Step 6: Validation Checks
CargoLint performs cross-field validation to catch obvious errors:
- Do line item totals add up to the invoice total?
- Do package counts match the shipment summary?
- Are dates in logical order (invoice date before due date)?
- Are required fields present?
If validation checks fail, confidence scores are adjusted down automatically.
Step 7: Review Queue Routing
After scoring and validation, the document is routed based on its overall confidence:
- High confidence (≥70%): The extracted data is sent directly to your completed documents, ready to use or export.
- Lower confidence (<70%): The document goes to your review queue. A team member with sufficient permissions can open it, check the extractions, make corrections if needed, and then mark it complete.
Why does processing take a few seconds?
The entire pipeline—classification, language detection, extraction, scoring, and validation—typically takes 3 to 10 seconds per document. This delay accounts for:
- The time needed for the AI to analyze each page
- The time to perform validation checks and adjust scores
- Network latency for secure transmission
Why do some documents need review and others don’t?
Documents that pass review are those with consistently clear text, standard layouts, and extractions that validate well. These typically score above 70%.
Documents that need review are those with:
- Low-quality scans or handwriting
- Non-standard layouts or missing fields
- Extracted values that don’t validate (e.g., math doesn’t add up)
- Languages or formats the AI is less familiar with
Review isn’t a sign of failure—it’s a designed part of the workflow. Some documents are genuinely ambiguous and benefit from human judgment.
Real-time monitoring
You can monitor the extraction status of all your documents from the dashboard. You’ll see:
- Total documents processed
- Documents completed
- Documents requiring review
- Average confidence across all documents
- 30-day trend charts showing volume and confidence over time
This visibility helps you understand document quality patterns and plan your review resources.