langextract
langextract copied to clipboard
Incorrect Extraction Alignment
Hello, I've been implementing a custom provider using gemma3:27b which is providing correct looking raw responses but am seeing weird extraction results from the LangExtract library (main branch).
Below is what is returned from my llm api instance
{
"extractions": [
{
"extraction_class": "search_news",
"extraction_text": "articles about Tesla",
"attributes": {
"query": "articles about Tesla"
}
}
]
}
DEBUG logs
DEBUG:absl:Processing chunk: TextChunk(
interval=[start_index: 0, end_index: 4],
Document ID: doc_8e5527ad,
Chunk Text: 'get articles about Tesla'
)
DEBUG:absl:Top inference result: {
"extractions": [
{
"extraction_class": "search_news",
"extraction_text": "articles about Tesla",
"attributes": {
"query": "articles about Tesla"
}
}
]
}
INFO:absl:Starting resolver process for input text.
DEBUG:absl:Input Text: {
"extractions": [
{
"extraction_class": "search_news",
"extraction_text": "articles about Tesla",
"attributes": {
"query": "articles about Tesla"
}
}
]
}
INFO:absl:Starting string parsing.
DEBUG:absl:input_string: {
"extractions": [
{
"extraction_class": "search_news",
"extraction_text": "articles about Tesla",
"attributes": {
"query": "articles about Tesla"
}
}
]
}
DEBUG:absl:Successfully parsed content.
INFO:absl:Completed parsing of string.
DEBUG:absl:Parsed content: [{'extraction_class': 'search_news', 'extraction_text': 'articles about Tesla', 'attributes': {'query': 'articles about Tesla'}}]
INFO:absl:Starting to extract and order extractions from data.
INFO:absl:Completed extraction and ordering of extractions.
DEBUG:absl:Completed the resolver process.
INFO:absl:Starting alignment process for provided chunk text.
DEBUG:absl:WordAligner: Starting alignment of extractions with the source text. Extraction groups to align: [[Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None), Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=None, alignment_status=None, extraction_index=2, group_index=0, description=None, attributes=None), Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)]]
2025-08-14 12:59:34,302 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='get articles about Tesla')
2025-08-14 12:59:34,302 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='␟')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Using delimiter '␟' for extraction alignment
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="search_news ␟ articles about Tesla ␟ {'query': 'articles about Tesla'}")
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
DEBUG:absl:Processing extraction group 0 with 3 extractions.
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='search_news')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='articles about Tesla')
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="{'query': 'articles about Tesla'}")
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
2025-08-14 12:59:34,303 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='get articles about Tesla')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='articles about Tesla')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Starting fuzzy alignment for 2 unaligned extractions
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='search_news')
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.0 ms)
DEBUG:absl:Fuzzy aligning 'search_news' (3 tokens)
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text="{'query': 'articles about Tesla'}")
2025-08-14 12:59:34,304 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.1 ms)
DEBUG:absl:Fuzzy aligning "{'query': 'articles about Tesla'}" (8 tokens)
DEBUG:absl:Final aligned extraction groups: [[Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None), Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=CharInterval(start_pos=4, end_pos=24), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=0, description=None, attributes=None), Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)]]
DEBUG:absl:Aligned extractions count: 3
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='extraction_class', extraction_text='search_news', char_interval=None, alignment_status=None, extraction_index=1, group_index=0, description=None, attributes=None)
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='extraction_text', extraction_text='articles about Tesla', char_interval=CharInterval(start_pos=4, end_pos=24), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=2, group_index=0, description=None, attributes=None)
DEBUG:absl:Yielding aligned extraction: Extraction(extraction_class='attributes', extraction_text="{'query': 'articles about Tesla'}", char_interval=None, alignment_status=None, extraction_index=3, group_index=0, description=None, attributes=None)
INFO:absl:Completed alignment process for the provided source_text.
However the output printed from result.extractions produces the following
{
"extractions": [
{
"extraction_class": "extraction_class",
"extraction_text": "search_news",
"char_interval": null,
"alignment_status": null,
"extraction_index": 1,
"group_index": 0,
"description": null,
"attributes": null,
"_token_interval": null
},
{
"extraction_class": "extraction_text",
"extraction_text": "articles about Tesla",
"char_interval": {
"start_pos": 4,
"end_pos": 24
},
"alignment_status": "match_exact",
"extraction_index": 2,
"group_index": 0,
"description": null,
"attributes": null,
"_token_interval": {
"start_index": 1,
"end_index": 4
}
},
{
"extraction_class": "attributes",
"extraction_text": "{'query': 'articles about Tesla'}",
"char_interval": null,
"alignment_status": null,
"extraction_index": 3,
"group_index": 0,
"description": null,
"attributes": null,
"_token_interval": null
}
]
}
Any thoughts on what could be causing this?
Source Code
prompt = textwrap.dedent(
"""\
From the user's request, extract the correct tool to use and its parameters.
- Use 'search_news' for getting relevant news given a query.
"""
)
examples = [
lx.data.ExampleData(
text="Events related to ESG issues.",
extractions=[
lx.data.Extraction(
extraction_class="search_news",
extraction_text="Events related to ESG issues.",
attributes={"query": "Events related to ESG issues."},
)
],
),
lx.data.ExampleData(
text="Any news about Microsoft",
extractions=[
lx.data.Extraction(
extraction_class="search_news",
extraction_text="news about Microsoft",
attributes={"query": "news about Microsoft"},
)
],
),
]
config = lx.factory.ModelConfig(
model_id="gemma3:27b",
provider="CustomProvider",
)
model = lx.factory.create_model(
config,
examples=examples,
use_schema_constraints=True,
)
result = lx.extract(
text_or_documents="get articles about Tesla",
prompt_description=prompt,
examples=examples,
model_id="gemma3:27b",
model=model,
extraction_passes=1,
fence_output=False,
# use_schema_constraints=True, # unused, passed from model config
max_workers=20,
max_char_buffer=2000,
debug=True,
)
POST data sent to the LLM API endpoint
{
"service_name": "mcp",
"model_parameters": {
"name": "gemma3:27b",
"internal": true,
"temperature": 1.0,
"top_k": 1,
"top_p": 0.9
},
"json_schema": {
"type": "object",
"properties": {
"extractions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"extraction_class": {
"type": "string",
"enum": [
"search_news",
]
},
"extraction_text": {
"type": "string"
},
"attributes": {
"type": "object",
"properties": {
"query": {
"type": "string"
}
}
}
},
"required": [
"extraction_class",
"extraction_text"
]
}
}
},
"required": [
"extractions"
]
},
"prompt": "From the user's request, extract the correct tool to use and its parameters.\r\n- Use 'search_news' for getting relevant news given a query.\r\n\r\n\r\nExamples\r\nQ: Events related to ESG issues.\r\nA: {\r\n \"extractions\": [\r\n {\r\n \"search_news\": \"Events related to ESG issues.\",\r\n \"search_news_attributes\": {\r\n \"query\": \"Events related to ESG issues.\"\r\n }\r\n }\r\n ]\r\n}\r\n\r\nQ: Any news about Microsoft\r\nA: {\r\n \"extractions\": [\r\n {\r\n \"search_news\": \"news about Microsoft\",\r\n \"search_news_attributes\": {\r\n \"query\": \"news about Microsoft\"\r\n }\r\n }\r\n ]\r\n}\r\n\r\nQ: get articles about Tesla\r\nA: "
}
@aksg87 also, why is langextract now printing debug logs too? this wasn't there in the earlier version when i was using it. It is helpful but was it added deliberately?