Multilingual Document Processing - Docs

Handle documents in English, Chinese, and Spanish with CargoLint's multilingual extraction engine.

Supported Languages

CargoLint’s multilingual engine supports documents in the following languages:

English - Full support with highest accuracy
Simplified Chinese - Common in mainland China operations
Traditional Chinese - Used in Hong Kong, Taiwan, and regional partners
Spanish - European and Latin American trade documents

Documents in other languages will fall back to English processing.

Automatic Language Detection

CargoLint automatically detects the document language and applies the appropriate extraction model:

{
  "document_id": "doc_789abc",
  "detected_language": "zh-Hans",
  "language_confidence": 0.99,
  "extracted_data": {
    "shipper": "深圳市技术有限公司",
    "invoice_number": "20260301-0001"
  }
}

You don’t need to specify the language - CargoLint handles detection automatically. The system uses a combination of language detection and Unicode character frequency analysis to distinguish between Simplified and Traditional Chinese. For mixed-language documents, the system identifies the primary language and adapts extraction accordingly.

Mixed-Language Documents

Many trade documents contain multiple languages (e.g., English headers with Chinese item descriptions). CargoLint handles this intelligently:

Primary Language Detection - Identifies the main document language
Field-Level Language Handling - Processes each field in its detected language

Confidence Scores by Language

Extraction confidence varies by language complexity:

Language	Typical Confidence	Notes
English	92-98%	Highest accuracy, well-trained model
Spanish	90-96%	Excellent support for European/Latin American documents
Simplified Chinese	88-94%	Strong performance, especially for invoices
Traditional Chinese	86-92%	Slightly lower due to character complexity

Chinese documents may have slightly lower confidence due to character complexity, but remain highly reliable for business documents.

Best Practices for Multilingual Documents

For Chinese Documents

Ensure document images have high resolution (≥300 DPI)
Avoid rotated or skewed text
Maintain consistent font sizing
Use clear black text on white background
Avoid stylized or decorative fonts
Use standard Simplified or Traditional characters (avoid mixing the two)

For Mixed-Language Documents

Keep language sections clearly separated when possible
Use consistent formatting across languages
Avoid embedding text in images
Ensure country-specific symbols are crisp and clear

Language-Specific Field Handling

Certain fields are normalized regardless of source language:

HS Codes - Always extracted in standard format
Currency Codes - Normalized to ISO 4217 codes
Country Codes - Converted to ISO 3166 codes
Dates - Standardized to ISO 8601 format

{
  "currency_original": "人民币",
  "currency_code": "CNY",
  "date_original": "2026年3月2日",
  "date_normalized": "2026-03-02",
  "hs_code": "8542.31.00"
}

Improving Extraction Accuracy

For documents in Chinese:

Use high-quality scans - 300 DPI minimum
Enable manual review - Flag low-confidence extractions for human verification
Provide context - Use document classification to help the model
Submit corrections - Corrections improve model accuracy for your language pair

Language codes used by CargoLint: en, zh-Hans, zh-Hant, es.