codeface icon indicating copy to clipboard operation
codeface copied to clipboard

Documentation on e-mail-address processing

Open clhunsen opened this issue 10 years ago • 1 comments

Referring to issue #34, the behavior and abilities of Codeface need to be documented. What kinds of From-line formats are supported when supplying mbox files to the mailing-list analysis of Codeface?

With the patch from issue #34, the following "abominations" are supported, additionally to the standard format Hans Huber <[email protected]> (according to @wolfgangmauerer on the maling list):

Hans Huber [email protected] Hans Huber huber at hubercorp.com Hans Huber ("AT" instead of "at" also works) [email protected] Hans Huber hans huber @ hubercorp.com Hans Huber hans huber @ hubercorp.com (Hans Huber)

Furthermore, we have the via pattern (such as Hans Huber via corp-dev <[email protected]>) and likely others. Documentation on the treatment would help users (e.g., "The via pattern gets treated as follows: Remove the 'via ...' part and use the mail address as is." [I am not sure that this is actually the way it is handled, hence, this ticket...]).


Things to do

  • [ ] Document the various formats (abominations or not) that are supported by Codeface.
  • [ ] Factor out the processing routines and make them independent of document processing.
  • [ ] Implement a unit test case for all possibilities

clhunsen avatar Nov 17 '15 15:11 clhunsen

Am 17/11/2015 um 17:01 schrieb Andreas Ringlstetter:

Which of these edge cases are specific to transforming incompatible mbox formats, which are specific to the ML analysis, and which are possibly also effecting the parsing of Sign-Off patterns in the VCS analysis? none of them is specific to anything -- it's just that the amount of creativity that goes into coming up with bogus formats for email addresses in mails considerably exceeds the amount found in tags.

As I suggested in the corresponding thread, it is surely useful to separate the cleanup operations from document processing and make the routines generically available.

There is also the |Huber, Hans| variation of names for all patterns. This is already handled in the idManager.py, but not in the ML analysis.

thanks for catching this -- I was discussing this with Mitchell in this thread, and he's currently looking into what the majority of bogus use-cases for this pattern is.

— Reply to this email directly or view it on GitHub https://github.com/siemens/codeface/issues/35#issuecomment-157413668.

wolfgangmauerer avatar Nov 17 '15 18:11 wolfgangmauerer