agents icon indicating copy to clipboard operation
agents copied to clipboard

examples simple-rag bug: split_paragraphs isn't working correctly

Open Kael-DWT opened this issue 1 year ago • 4 comments

livekit.agents.tokenize._basic_paragraph.split_paragraphs

def split_paragraphs(text: str) -> list[tuple[str, int, int]]:
    """
    Split the text into paragraphs.
    Returns a list of paragraphs with their start and end indices of the original text.
    """
    matches = re.finditer(r"\n{2,}", text)
    paragraphs = []

    for match in matches:
        paragraph = match.group(0)
        start_pos = match.start()
        end_pos = match.end()
        paragraphs.append((paragraph.strip(), start_pos, end_pos))

    return paragraphs

Is this regex written incorrectly? It should be like this.

matches = re.finditer(r".+\n{2,}", text)

Kael-DWT avatar Oct 09 '24 03:10 Kael-DWT

I also encountered the same problem

LIHUA919 avatar Oct 09 '24 07:10 LIHUA919

Kindly assign this to me, I would like to work on this.

m-tabish avatar Oct 11 '24 12:10 m-tabish

@m-tabish of course, please go ahead and submit a PR. it'd be great if you are able to add test coverage here as well, so we won't break it unintentionally again.

davidzhao avatar Oct 11 '24 16:10 davidzhao

Aren't you guys in Hacktoberfest? Is you are kindly add a tag here

m-tabish avatar Oct 11 '24 17:10 m-tabish

fixed in #896

davidzhao avatar Oct 11 '24 19:10 davidzhao