topicvec icon indicating copy to clipboard operation
topicvec copied to clipboard

Sentences within a dataset

Open gabrer opened this issue 8 years ago • 2 comments

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

gabrer avatar May 17 '17 17:05 gabrer

The sentence information is actually not used. So it should not impact the performance. Do you mean that dots are part of the abbreviations? In this case you could modify the regular expression used to extract tokens from text.

On May 18, 2017 1:32 AM, "Gabriele Pergola" [email protected] wrote:

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJZSRxeInd8W9fti3r2NYk2JmSCibks5r6y86gaJpZM4NeNLw .

askerlee avatar May 18 '17 06:05 askerlee

Oh, thank you for confirming this! I've already modified the regular expression; but unfortunately, they are not only abbreviations but "mistakes".

Thank you anyway!

gabrer avatar May 18 '17 11:05 gabrer