Multilingual Document Processing

Handle documents in English, Chinese, and Spanish with CargoLint's multilingual extraction engine.

Supported Languages

CargoLint’s multilingual engine supports documents in the following languages:

  • English - Full support with highest accuracy
  • Simplified Chinese - Common in mainland China operations
  • Traditional Chinese - Used in Hong Kong, Taiwan, and regional partners
  • Spanish - European and Latin American trade documents

Documents in other languages will fall back to English processing.

Automatic Language Detection

CargoLint automatically detects the document language and applies the appropriate extraction model:

{
  "document_id": "doc_789abc",
  "detected_language": "zh-Hans",
  "language_confidence": 0.99,
  "extracted_data": {
    "shipper": "深圳市技术有限公司",
    "invoice_number": "20260301-0001"
  }
}

You don’t need to specify the language - CargoLint handles detection automatically. The system uses a combination of language detection and Unicode character frequency analysis to distinguish between Simplified and Traditional Chinese. For mixed-language documents, the system identifies the primary language and adapts extraction accordingly.

Mixed-Language Documents

Many trade documents contain multiple languages (e.g., English headers with Chinese item descriptions). CargoLint handles this intelligently:

  1. Primary Language Detection - Identifies the main document language
  2. Field-Level Language Handling - Processes each field in its detected language

Confidence Scores by Language

Extraction confidence varies by language complexity:

LanguageTypical ConfidenceNotes
English92-98%Highest accuracy, well-trained model
Spanish90-96%Excellent support for European/Latin American documents
Simplified Chinese88-94%Strong performance, especially for invoices
Traditional Chinese86-92%Slightly lower due to character complexity

Chinese documents may have slightly lower confidence due to character complexity, but remain highly reliable for business documents.

Best Practices for Multilingual Documents

For Chinese Documents

  • Ensure document images have high resolution (≥300 DPI)
  • Avoid rotated or skewed text
  • Maintain consistent font sizing
  • Use clear black text on white background
  • Avoid stylized or decorative fonts
  • Use standard Simplified or Traditional characters (avoid mixing the two)

For Mixed-Language Documents

  • Keep language sections clearly separated when possible
  • Use consistent formatting across languages
  • Avoid embedding text in images
  • Ensure country-specific symbols are crisp and clear

Language-Specific Field Handling

Certain fields are normalized regardless of source language:

  • HS Codes - Always extracted in standard format
  • Currency Codes - Normalized to ISO 4217 codes
  • Country Codes - Converted to ISO 3166 codes
  • Dates - Standardized to ISO 8601 format
{
  "currency_original": "人民币",
  "currency_code": "CNY",
  "date_original": "2026年3月2日",
  "date_normalized": "2026-03-02",
  "hs_code": "8542.31.00"
}

Improving Extraction Accuracy

For documents in Chinese:

  1. Use high-quality scans - 300 DPI minimum
  2. Enable manual review - Flag low-confidence extractions for human verification
  3. Provide context - Use document classification to help the model
  4. Submit corrections - Corrections improve model accuracy for your language pair

Language codes used by CargoLint: en, zh-Hans, zh-Hant, es.