CLIP Model Document Classification Category Overlap Issue
The current CLIP model implementation is unable to properly distinguish between similar document categories, particularly between receipts and invoices. When processing documents like detailed sales reports, the model shows significant overlap in confidence scores between categories that should be more distinctly classified.
Current Categories
CLIP_CATEGORIES = [
"receipt",
"invoice (table with with brought items and their price)",
"cheque",
"logo",
"document",
"blank document",
"form",
"contract",
"letter",
"chart",
"graph"
]
Problem
-
Category Overlap:
- Receipt vs Invoice: Model struggles to differentiate between these when processing sales documents
-
Example Case: When processing a restaurant sales report containing:
- Itemized sales data
- Payment summaries
- Tax calculations
- Business information
The model produces ambiguous confidence scores between "receipt" and "invoice" categories.
Impact
- Unreliable classification results
- High uncertainty in document type determination
- Reduced accuracy in automated document processing
- Manual intervention often needed for correct categorization
Proposed Solutions
-
Refine Category Definitions:
- Add more specific categories like "sales_report", "financial_statement"
- Create subcategories for business-specific documents
- Include composite categories for hybrid documents
-
Training Improvements:
- Enhance training data with more diverse document examples
- Include more restaurant-specific financial documents
- Add clear distinguishing features between receipts and invoices
-
Category Refinement:
REFINED_CATEGORIES = [
"simple_receipt", # Basic transaction receipts
"detailed_sales_report", # Comprehensive business sales data
"commercial_invoice", # Formal billing documents
"financial_statement", # Detailed financial reports
"business_document", # General business documentation
"form",
"contract",
"letter",
"chart",
"graph"
]
Additional Context
Test document: Restaurant daily sales report containing detailed financial breakdowns, which received split classifications between receipt and invoice categories.
Labels
- enhancement
- machine-learning
- document-classification
- CLIP-model
- accuracy