Improve e-mail-address processing

Open clhunsen opened this issue 10 years ago • 1 comments

Problem

When considering the following two From lines in mbox files, Codeface will run into problems right now:

From: ambrus at math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=) From: Hans Huber

While the string at is replaced by @ in the first case, it is not in the second. In the first case, the name is not properly parsed (it is Zsbán Ambrus actually), the string is stored as is in the database.

mysql> select name, email1 from person;
+-------------------------------+------------------------+
| name                          | email1                 |
+-------------------------------+------------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | [email protected]     |
| Hans Huber                    | huber at hubercorp.com |
+-------------------------------+------------------------+

Fix for case two

The following patch by @wolfgangmauerer (taken from the mailing-list, tested by me) implements a more robust handling of e-mail addresses and is able to handle the second case to be transformed correctly.

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
     ## Trim trailing and leading whitespace
     author <- str_trim(author)

+    ## Replace textual ' at  ' with @, sometimes
+    ## we can recover an email
+    author <- sub(' at ', '@', author)
+    author <- sub(' AT ', '@', author)
+
     ## Check if email exists
     email.exists <- grepl("<.+>", author, TRUE)

@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
                    "<[email protected]>); attempting to recover from: ", author)
       logdevinfo(msg, logger="ml.analysis")

-      ## Replace textual ' at  ' with @, sometimes
-      ## we can recover an email
-      author <- sub(' at ', '@', author)
-      author <- sub(' AT ', '@', author)
-
       ## Check for @ symbol
       r <- regexpr("\\S+@\\S+", author, TRUE)
       email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
         ## string minus the new email part as name, and construct
         ## a valid name/email combination
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
       }

       ## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
       author <- paste(name, ' <', email, '>', sep="")
     }
     else {
-      ## Verify that the order is correct
+      ## There is a correct email address. Ensure that the order is correct
+      ## and fix cases like "<[email protected]> Hans Huber"
+
       ## Get email and name parts
       r <- regexpr("<.+>", author, TRUE)
       if(r[[1]] == 1) {
         email <- substr(author, r, r + attr(r,"match.length")-1)
         name <- sub(email, "", author, fixed=TRUE)
-        name <- str_trim(name)
+        name <- fix.name(name)
         email <- str_trim(email)
         author <- paste(name,email)
       }
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map, loc.id) {

   return(global.id)
 }
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+    name <- str_trim(name)
+    if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+                                            str_length(name)) == ")") {
+        name <- substr(name, 2, str_length(name)-1)
+    }
+
+    return (name)
+}

After applying the patch, the database contains:

mysql> select name, email1 from person;
+-------------------------------+----------------------+
| name                          | email1               |
+-------------------------------+----------------------+
| =?UTF-8?Q?Zsb=C3=A1n_Ambrus?= | [email protected]   |
| Hans Huber                    | [email protected]  |
+-------------------------------+----------------------+

Things to do

[x] Incorporate the patch into Codeface core
[ ] Fix the encoding problem somehow

Nov 17 '15 13:11 clhunsen

I had another unshaped thought on this: What about making the several patterns that are supported by Codeface explicit in the source code?

To be more specific, I thought about having one explicit pattern definition (i.e., a regex) and one transformation pattern (i.e., the regex replacement or rewriting) for each pattern Codeface supports, to transform the various patterns to the one we want to have (Hans Huber <[email protected]>). Plus a routine for mis-shaped strings, where we need to handle missing e-mail addresses, missing names, or similar.

This way, we only need to take care of one pattern for extraction, and the different patterns would be explicit and transparent in the source code, too.

Nov 17 '15 15:11 clhunsen