langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Incorrect Extraction Alignment

Open derrick56007 opened this issue 8 months ago • 3 comments

Hello, I've been implementing a custom provider using gemma3:27b which is providing correct looking raw responses but am seeing weird extraction results from the LangExtract library (main branch).

Below is what is returned from my llm api instance

{                                               
   "extractions": [                              
     {                                           
       "extraction_class": "search_news",        
       "extraction_text": "articles about Tesla",
       "attributes": {                           
         "query": "articles about Tesla"         
       }                                         
     }                                           
   ]                                             
 }

DEBUG logs

DEBUG:absl:Processing chunk: TextChunk(
  interval=[start_index: 0, end_index: 4],
  Document ID: doc_8e5527ad,
  Chunk Text: 'get articles about Tesla'
)
DEBUG:absl:Top inference result: {
  "extractions": [
    {
      "extraction_class": "search_news",
      "extraction_text": "articles about Tesla",
      "attributes": {
        "query": "articles about Tesla"
      }
    }
  ]
}
INFO:absl:Starting resolver process for input text.
DEBUG:absl:Input Text: {
  "extractions": [
    {
      "extraction_class": "search_news",
      "extraction_text": "articles about Tesla",
      "attributes": {
        "query": "articles about Tesla"
      }
    }
  ]
}
INFO:absl:Starting string parsing.
DEBUG:absl:input_string: {
  "extractions": [
    {
      "extraction_class": "search_news",
      "extraction_text": "articles about Tesla",
      "attributes": {
        "query": "articles about Tesla"
      }
    }
  ]
}
DEBUG:absl:Successfully parsed content.
INFO:absl:Completed parsing of string.
DEBUG:absl:Parsed content: [{'extraction_class': 'search_news', 'extraction_text': 'articles about Tesla', 'attributes': {'query': 'articles about Tesla'}}]
INFO:absl:Starting to extract and order extractions from data.
INFO:absl:Completed extraction and ordering of extractions.
DEBUG:absl:Completed the resolver process.
INFO:absl:Starting alignment process for provided chunk text.
DEBUG:absl:WordAligner: Starting alignment of extractions with the source text. Extraction groups to align: [[Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None), Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=None, alignment_status=None, extraction_index=2, group_index=0, description=None, attributes=None), Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)]]
2025-08-14 12:59:34,302 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='get articles about Tesla')
2025-08-14 12:59:34,302 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='␟')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Using delimiter '␟' for extraction alignment
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="search_news ␟ articles about Tesla ␟ {'query': 'articles about Tesla'}")
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
DEBUG:absl:Processing extraction group 0 with 3 extractions.
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='search_news')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='articles about Tesla')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="{'query': 'articles about Tesla'}")
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='get articles about Tesla')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='articles about Tesla')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Starting fuzzy alignment for 2 unaligned extractions
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='search_news')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Fuzzy aligning 'search_news' (3 tokens)
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="{'query': 'articles about Tesla'}")
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
DEBUG:absl:Fuzzy aligning "{'query': 'articles about Tesla'}" (8 tokens)
DEBUG:absl:Final aligned extraction groups: [[Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None), Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=CharInterval(start_pos=4, end_pos=24), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=0, description=None, attributes=None), Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)]]
DEBUG:absl:Aligned extractions count: 3
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None)
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=CharInterval(start_pos=4, end_pos=24), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=0, description=None, attributes=None)
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)
INFO:absl:Completed alignment process for the provided source_text.

However the output printed from result.extractions produces the following

{
  "extractions": [
    {
      "extraction_class": "extraction_class",
      "extraction_text": "search_news",
      "char_interval": null,
      "alignment_status": null,
      "extraction_index": 1,
      "group_index": 0,
      "description": null,
      "attributes": null,
      "_token_interval": null
    },
    {
      "extraction_class": "extraction_text",
      "extraction_text": "articles about Tesla",
      "char_interval": {
        "start_pos": 4,
        "end_pos": 24
      },
      "alignment_status": "match_exact",
      "extraction_index": 2,
      "group_index": 0,
      "description": null,
      "attributes": null,
      "_token_interval": {
        "start_index": 1,
        "end_index": 4
      }
    },
    {
      "extraction_class": "attributes",
      "extraction_text": "{'query': 'articles about Tesla'}",
      "char_interval": null,
      "alignment_status": null,
      "extraction_index": 3,
      "group_index": 0,
      "description": null,
      "attributes": null,
      "_token_interval": null
    }
  ]
}

Any thoughts on what could be causing this?

derrick56007 avatar Aug 14 '25 20:08 derrick56007

Source Code

  prompt = textwrap.dedent(
      """\
      From the user's request, extract the correct tool to use and its parameters.
      - Use 'search_news' for getting relevant news given a query.
  """
  )

  examples = [
      lx.data.ExampleData(
          text="Events related to ESG issues.",
          extractions=[
              lx.data.Extraction(
                  extraction_class="search_news",
                  extraction_text="Events related to ESG issues.",
                  attributes={"query": "Events related to ESG issues."},
              )
          ],
      ),
      lx.data.ExampleData(
          text="Any news about Microsoft",
          extractions=[
              lx.data.Extraction(
                  extraction_class="search_news",
                  extraction_text="news about Microsoft",
                  attributes={"query": "news about Microsoft"},
              )
          ],
      ),
  ]

  config = lx.factory.ModelConfig(
      model_id="gemma3:27b",
      provider="CustomProvider",
  )
  model = lx.factory.create_model(
      config,
      examples=examples,
      use_schema_constraints=True,
  )

  result = lx.extract(
      text_or_documents="get articles about Tesla",
      prompt_description=prompt,
      examples=examples,
      model_id="gemma3:27b",
      model=model,
      extraction_passes=1,
      fence_output=False,
      # use_schema_constraints=True, # unused, passed from model config
      max_workers=20,
      max_char_buffer=2000,
      debug=True,
  )

derrick56007 avatar Aug 14 '25 20:08 derrick56007

POST data sent to the LLM API endpoint

{
    "service_name": "mcp",
    "model_parameters": {
        "name": "gemma3:27b",
        "internal": true,
        "temperature": 1.0,
        "top_k": 1,
        "top_p": 0.9
    },
    "json_schema": {
        "type": "object",
        "properties": {
            "extractions": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "extraction_class": {
                            "type": "string",
                            "enum": [
                                "search_news",
                            ]
                        },
                        "extraction_text": {
                            "type": "string"
                        },
                        "attributes": {
                            "type": "object",
                            "properties": {
                                "query": {
                                    "type": "string"
                                }
                            }
                        }
                    },
                    "required": [
                        "extraction_class",
                        "extraction_text"
                    ]
                }
            }
        },
        "required": [
            "extractions"
        ]
    },
    "prompt": "From the user's request, extract the correct tool to use and its parameters.\r\n- Use 'search_news' for getting relevant news given a query.\r\n\r\n\r\nExamples\r\nQ: Events related to ESG issues.\r\nA: {\r\n  \"extractions\": [\r\n    {\r\n      \"search_news\": \"Events related to ESG issues.\",\r\n      \"search_news_attributes\": {\r\n        \"query\": \"Events related to ESG issues.\"\r\n      }\r\n    }\r\n  ]\r\n}\r\n\r\nQ: Any news about Microsoft\r\nA: {\r\n  \"extractions\": [\r\n    {\r\n      \"search_news\": \"news about Microsoft\",\r\n      \"search_news_attributes\": {\r\n        \"query\": \"news about Microsoft\"\r\n      }\r\n    }\r\n  ]\r\n}\r\n\r\nQ: get articles about Tesla\r\nA: "
}

derrick56007 avatar Aug 14 '25 20:08 derrick56007

@aksg87 also, why is langextract now printing debug logs too? this wasn't there in the earlier version when i was using it. It is helpful but was it added deliberately?

Kishlay-notabot avatar Aug 19 '25 07:08 Kishlay-notabot