Multilingual Document Processing
Handle documents in English, Chinese, and Spanish with CargoLint's multilingual extraction engine.
Supported Languages
CargoLint’s multilingual engine supports documents in the following languages:
- English - Full support with highest accuracy
- Simplified Chinese - Common in mainland China operations
- Traditional Chinese - Used in Hong Kong, Taiwan, and regional partners
- Spanish - European and Latin American trade documents
Documents in other languages will fall back to English processing.
Automatic Language Detection
CargoLint automatically detects the document language and applies the appropriate extraction model:
{
"document_id": "doc_789abc",
"detected_language": "zh-Hans",
"language_confidence": 0.99,
"extracted_data": {
"shipper": "深圳市技术有限公司",
"invoice_number": "20260301-0001"
}
}
You don’t need to specify the language - CargoLint handles detection automatically. The system uses a combination of language detection and Unicode character frequency analysis to distinguish between Simplified and Traditional Chinese. For mixed-language documents, the system identifies the primary language and adapts extraction accordingly.
Mixed-Language Documents
Many trade documents contain multiple languages (e.g., English headers with Chinese item descriptions). CargoLint handles this intelligently:
- Primary Language Detection - Identifies the main document language
- Field-Level Language Handling - Processes each field in its detected language
Confidence Scores by Language
Extraction confidence varies by language complexity:
| Language | Typical Confidence | Notes |
|---|---|---|
| English | 92-98% | Highest accuracy, well-trained model |
| Spanish | 90-96% | Excellent support for European/Latin American documents |
| Simplified Chinese | 88-94% | Strong performance, especially for invoices |
| Traditional Chinese | 86-92% | Slightly lower due to character complexity |
Chinese documents may have slightly lower confidence due to character complexity, but remain highly reliable for business documents.
Best Practices for Multilingual Documents
For Chinese Documents
- Ensure document images have high resolution (≥300 DPI)
- Avoid rotated or skewed text
- Maintain consistent font sizing
- Use clear black text on white background
- Avoid stylized or decorative fonts
- Use standard Simplified or Traditional characters (avoid mixing the two)
For Mixed-Language Documents
- Keep language sections clearly separated when possible
- Use consistent formatting across languages
- Avoid embedding text in images
- Ensure country-specific symbols are crisp and clear
Language-Specific Field Handling
Certain fields are normalized regardless of source language:
- HS Codes - Always extracted in standard format
- Currency Codes - Normalized to ISO 4217 codes
- Country Codes - Converted to ISO 3166 codes
- Dates - Standardized to ISO 8601 format
{
"currency_original": "人民币",
"currency_code": "CNY",
"date_original": "2026年3月2日",
"date_normalized": "2026-03-02",
"hs_code": "8542.31.00"
}
Improving Extraction Accuracy
For documents in Chinese:
- Use high-quality scans - 300 DPI minimum
- Enable manual review - Flag low-confidence extractions for human verification
- Provide context - Use document classification to help the model
- Submit corrections - Corrections improve model accuracy for your language pair
Language codes used by CargoLint: en, zh-Hans, zh-Hant, es.