msgxtractr icon indicating copy to clipboard operation
msgxtractr copied to clipboard

HTML body not extracted

Open ghost opened this issue 8 years ago • 6 comments

Bug description: when importing .msg files, the html body is not extracted.

Minimal replicable example

The attached email have an html body with encoding CP-1252 (west Europe) ex_no_html_body.zip

In R, I did the following:

msg <- msgxtractr::read_msg("ex_no_html_body.msg")
str(msg)
List of 8
 $ headers         : NULL
 $ sender          : list()
 $ recipients      :List of 1
  ..$ :List of 3
  .. ..$ display_name : NULL
  .. ..$ address_type : NULL
  .. ..$ email_address: NULL
 $ subject         : NULL
 $ body            :List of 2
  ..$ text: chr " \r\nTest table\r\n \r\n1\r\n2\r\n3\r\n4\r\n5\r\n6\r\n7\r\n8\r\n9\r\n \r\n"
  ..$ html: NULL
 $ attachments     : list()
 $ display_envelope:List of 1
  ..$ display_to: chr "[email protected]"
 $ times           :List of 3
  ..$ creation_time: NULL
  ..$ last_mod_time: NULL
 - attr(*, "class")= chr "msg"

The HTML element is empty. I checked if my message what not in RTF.

Any idea ?

By the way: thank you for this package !

ghost avatar Jan 03 '18 10:01 ghost

ZOMGOSH thank you for adding a .msg file! I have no Outlook instances anywhere and we use gmail at work so I've been devoid of samples!

I shall take a look later today.

While bugs are never awesome, this bug report made my day!

hrbrmstr avatar Jan 03 '18 15:01 hrbrmstr

Thank you ! If you need more sample (e.g. different encoding / formating / add rtf content), no problem. First time I made the day of someone with a bug report :-) A pleasure. I'm very impress you made this package without Outlook instance. If you have any clue from where the bug should come from, I could try to handle it.

ghost avatar Jan 03 '18 16:01 ghost

Progress! (sort of)

Your content is very very very likely in the mess of hex digits down 👇 somewhere :-)

I can decode some binary streams but need to go on the hunt for the 10090102, 80090102 and 800A0102 types (one is a compressed RTF stream and I'm not sure that the others are yet).

I'll see what I can cook up in the next cpl days.

Thx again for this.

When I've got this figured out is it OK if I add it to the included example files?

$`/__nameid_version1.0/__substg1.0_10010102`
 [1] 10 85 00 00 06 00 00 00 52 85 00 00 06 00 01 00

$`/__nameid_version1.0/__substg1.0_00030102`
 [1] 10 85 00 00 06 00 00 00 52 85 00 00 06 00 01 00 03 85 00 00 06 00 02 00 01 85 00 00 06 00 03 00 06 85 00
[36] 00 06 00 04 00 54 85 00 00 06 00 05 00 0e 85 00 00 06 00 06 00 18 85 00 00 06 00 07 00 bf 85 00 00 06 00
[71] 08 00 c2 85 00 00 06 00 09 00 c3 85 00 00 06 00 0a 00

$`/__nameid_version1.0/__substg1.0_00020102`
 [1] 08 20 06 00 00 00 00 00 c0 00 00 00 00 00 00 46

$`/__nameid_version1.0/__substg1.0_10110102`
[1] 01 85 00 00 06 00 03 00

$`/__nameid_version1.0/__substg1.0_100A0102`
[1] 06 85 00 00 06 00 04 00

$`/__nameid_version1.0/__substg1.0_10090102`
 [1] 18 85 00 00 06 00 07 00 bf 85 00 00 06 00 08 00

$`/__nameid_version1.0/__substg1.0_100F0102`
[1] 03 85 00 00 06 00 02 00

$`/__nameid_version1.0/__substg1.0_10140102`
[1] c2 85 00 00 06 00 09 00

$`/__nameid_version1.0/__substg1.0_10120102`
[1] 0e 85 00 00 06 00 06 00

$`/__nameid_version1.0/__substg1.0_10150102`
[1] c3 85 00 00 06 00 0a 00

$`/__nameid_version1.0/__substg1.0_101E0102`
[1] 54 85 00 00 06 00 05 00

$`/__substg1.0_300B0102`
 [1] 44 a1 d6 a4 20 d4 d4 42 a9 26 b4 ff ea ad 47 63

$`/__substg1.0_10090102`
   [1] 16 19 00 00 08 74 00 00 4c 5a 46 75 da cb 54 ef 03 00 0a 00 72 63 70 67 31 32 35 82 32 03 43 68 74 6d
  [35] 6c 31 03 31 f8 62 69 64 04 00 03 30 01 03 01 f7 0a 80 27 02 a4 03 e3 02 00 63 68 0a c0 73 65 f8 74 30
  [69] 20 07 13 02 80 10 83 00 50 04 56 bf 08 55 07 b2 12 55 0e 51 03 01 11 57 32 06 00 fb 06 c3 12 55 33 04
 [103] 46 11 59 13 6b 12 63 08 ef 6d 09 f7 3b 19 4f 0e 30 35 12 52 0c 60 63 67 00 50 0b 09 01 64 33 36 11 e0
 [137] 0b a5 34 72 20 10 82 2a 5c 0e b2 01 90 0e 10 39 64 20 3c 0e b2 20 78 0e d0 00 80 3a 48 76 3d 22 08 70
 [171] 6e 3a 04 f0 68 62 65 00 c0 73 2d 6d 0d e0 03 60 73 6a 6f 01 80 2d 05 a0 6d 1f d0 0e d0 22 5d 1f 75 6f
 [205] 1f ff 21 09 21 30 66 0d e0 65 7b 24 25 21 e6 77 22 7f 23 8f 25 70 05 b0 64 0d 21 e6 6d 25 90 0e b0 74
 [239] 70 3a 2f aa 2f 25 f5 2e 20 d7 2e 21 71 2f 24 34 62 2f 01 d0 30 34 2f 0e 20 2a 70 6d 0b 21 c7 28 97 77
 [273] 2c b0 2e 77 33 2e 01 05 b0 67 2f 54 52 2f 52 45 84 43 2d 0e b2 34 30 22 3e 12 63 8d 1e 67 33 1e 00 1f
 [307] 20 65 61 64 2e 4d 34 31 36 0e f0 3c 07 80 01 90 20 6e 12 61 07 80 3d 50 03 60 67 49 64 8e 20 05 a0 02
 [341] 30 09 f0 74 3d 57 27 d1 e0 2e 44 6f 63 75 07 80 02 30 2f bf cd 30 cb 47 09 f0 04 90 61 74 05 b1 32 06
 [375] 14 22 4d 20 e6 20 32 82 20 31 34 eb 2e 3e 30 ad 4f 05 10 67 0b 80 35 8f 36 9f 49 37 af 3c 6c 0b 80 6b
 [409] 20 19 50 6c c4 3d 46 03 10 65 2d 4c 04 00 05 40 aa 68 19 50 66 25 90 63 0f 40 3a 24 50 17 3d b0 3d 00
 [443] 3e 00 2e 1f 81 40 30 31 00 44 33 38 34 38 32 2e 38 e1 01 c0 35 37 34 33 2e 2f 1d f0 30 c1 90 21 2d 2d
 [477] 5b 06 90 20 67 32 30 04 20 6d 21 20 20 39 5d 3e 3c 85 1f 81 3e 0a a3 3c 6f 3a 4f 24 43 4f 32 d6 06 60
 [511] 02 40 0b 80 67 73 43 67 41 82 6c 18 e0 77 50 4e 47 2f 43 65 b7 2a 70 43 ef 44 fb 2f 43 32 41 f0 5b 09
 [545] f0 b9 0f 50 66 5d 42 10 3b cf 3c da 74 26 11 4f 47 60 39 c0 31 20 3e 34 7e 7e 4b b3 64 5f 4c 11 4c b0
 [579] 3b bf 3c cb 18 c3 53 26 02 65 f0 4d 61 70 70 45 11 4c 47 18 c3 25 f3 1f 26 21 50 33 4d 6f 41 af 42 bf
 [613] 3c 77 3a 67 32 82 32 d7 55 16 53 70 3d 60 3d 01 67 e6 53 01 90 32 30 3e 43 3d b0 00 70 46 d0 0b 56 ce
 [647] 55 16 54 35 70 63 6b 4d 6f ee 76 07 90 46 66 59 95 46 05 b0 00 c0 44 f3 b1 5a 58 48 79 70 26 10 39 b1
 [681] 69 02 20 e6 5a 02 20 57 a0 32 31 58 12 5c 8e 55 16 dc 45 6e 5a 20 18 e0 56 f0 56 04 00 5a 58 40 50 75
 [715] 6e 63 74 75 5c e3 4b 3b 04 91 5b bb 56 07 40 0f 40 57 81 41 67 7f 0b 71 3e 00 4f b3 26 40 5a 58 06 10
 [749] 5a 20 49 60 66 58 4d 4c 49 5f 70 62 82 3e 3e 66 07 40 11 b0 58 13 64 8e 55 16 49 67 e6 6e 05 b0 50 00
 [783] 69 78 09 80 08 50 32 23 0f 65 78 67 af 56 48 45 e0 77 61 79 73 bc 53 68 46 11 0b 60 24 70 6b 80 6c 04
 [817] 81 f8 54 65 78 68 b9 6b 1f 6c 27 55 16 32 d0 38 4e 6f 74 31 91 04 60 32 30 51 46 53 5a 58 3d e0 64 54
 [851] 4b c2 4f 4b b1 72 b8 3e 46 52 58 12 70 ec 70 6e 41 00 90 01 00 70 3e 58 2d 4e 4f 4e 45 ef 71 ea 74 04
 [885] 70 6e 08 50 6d 0b 50 6c 70 4f b0 ff 05 12 74 5f 76 ce 55 16 76 e2 5c e1 0f 30 3d 00 14 74 79 55 07 42
 [919] 19 50 61 6b 57 f7 35 70 50 30 09 80 54 01 a0 3d b0 63 ca 31 40 50 70 54 6f 47 05 10 64 65 00 43 af 57
 [953] 01 5a 58 7b f2 6c 62 57 7a e0 68 60 b3 9d 5a 58 55 11 b0 74 03 7b a3 52 75 7c 9b cf 32 d0 02 30 7d c0
 [987] 46 10 41 75 39 d0 24 50 a7 80 09 56 e0 7a
 [ reached getOption("max.print") -- omitted 5426 entries ]

$`/__substg1.0_65E20102`
 [1] e4 c6 f1 28 51 cb b9 43 ba 4a 7c 6f 4b 1d 75 43 00 00 40 02

$`/__substg1.0_65E30102`
 [1] 14 e4 c6 f1 28 51 cb b9 43 ba 4a 7c 6f 4b 1d 75 43 00 00 40 02

$`/__substg1.0_800A0102`
  [1] 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38
 [36] 22 20 73 74 61 6e 64 61 6c 6f 6e 65 3d 22 79 65 73 22 3f 3e 0d 0a 3c 61 3a 63 6c 72 4d 61 70 20 78 6d 6c
 [71] 6e 73 3a 61 3d 22 68 74 74 70 3a 2f 2f 73 63 68 65 6d 61 73 2e 6f 70 65 6e 78 6d 6c 66 6f 72 6d 61 74 73
[106] 2e 6f 72 67 2f 64 72 61 77 69 6e 67 6d 6c 2f 32 30 30 36 2f 6d 61 69 6e 22 20 62 67 31 3d 22 6c 74 31 22
[141] 20 74 78 31 3d 22 64 6b 31 22 20 62 67 32 3d 22 6c 74 32 22 20 74 78 32 3d 22 64 6b 32 22 20 61 63 63 65
[176] 6e 74 31 3d 22 61 63 63 65 6e 74 31 22 20 61 63 63 65 6e 74 32 3d 22 61 63 63 65 6e 74 32 22 20 61 63 63
[211] 65 6e 74 33 3d 22 61 63 63 65 6e 74 33 22 20 61 63 63 65 6e 74 34 3d 22 61 63 63 65 6e 74 34 22 20 61 63
[246] 63 65 6e 74 35 3d 22 61 63 63 65 6e 74 35 22 20 61 63 63 65 6e 74 36 3d 22 61 63 63 65 6e 74 36 22 20 68
[281] 6c 69 6e 6b 3d 22 68 6c 69 6e 6b 22 20 66 6f 6c 48 6c 69 6e 6b 3d 22 66 6f 6c 48 6c 69 6e 6b 22 2f 3e

$`/__substg1.0_80090102`
   [1] 50 4b 03 04 14 00 06 00 08 00 00 00 21 00 e9 de 0f bf ff 00 00 00 1c 02 00 00 13 00 00 00 5b 43 6f 6e
  [35] 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d 6c ac 91 cb 4e c3 30 10 45 f7 48 fc 83 e5 2d 4a 9c b2 40 08
  [69] 25 e9 82 c7 8e c7 a2 7c c0 c8 99 24 16 c9 d8 b2 a7 55 fb f7 4c d2 54 42 a8 20 16 6c 2c d9 33 f7 9e 3b
 [103] e3 72 bd 1f 07 b5 c3 98 9c a7 4a af f2 42 2b 24 eb 1b 47 5d a5 df 37 4f d9 ad 56 89 81 1a 18 3c 61 a5
 [137] 0f 98 f4 ba be bc 28 37 87 80 49 89 9a 52 a5 7b e6 70 67 4c b2 3d 8e 90 72 1f 90 a4 d2 fa 38 02 cb 35
 [171] 76 26 80 fd 80 0e cd 75 51 dc 18 eb 89 91 38 e3 c9 43 d7 e5 03 b6 b0 1d 58 3d ee e5 f9 98 24 e2 90 b4
 [205] ba 3f 36 4e ac 4a 43 08 83 b3 c0 92 d4 ec a8 f9 46 c9 16 42 2e ca b9 27 f5 2e a4 2b 89 a1 cd 59 c2 54
 [239] f9 19 b0 e8 5e 65 35 d1 35 a8 de 20 f2 0b 8c 12 c3 b0 0c 89 5f cf 67 20 19 2d e6 bf 3b 9e 89 ec db d6
 [273] 59 6c bc dd 8e b2 8e 7c 36 5e cc 4e c1 ff 14 60 f5 3f e8 13 d3 cc 7f 5b 7f 02 00 00 ff ff 03 00 50 4b
 [307] 03 04 14 00 06 00 08 00 00 00 21 00 a5 d6 a7 e7 c0 00 00 00 36 01 00 00 0b 00 00 00 5f 72 65 6c 73 2f
 [341] 2e 72 65 6c 73 84 8f cf 6a c3 30 0c 87 ef 85 bd 83 d1 7d 51 d2 c3 18 25 76 2f a5 90 43 2f a3 7d 00 e1
 [375] 28 7f 68 22 1b db 1b eb db 4f c7 06 0a bb 08 84 a4 ef f7 a9 3d fe ae 8b f9 e1 94 e7 20 16 9a aa 06 c3
 [409] e2 43 3f cb 68 e1 76 3d bf 7f 82 c9 85 a4 a7 25 08 5b 78 70 86 a3 7b db b5 5f bc 50 d1 a3 3c cd 31 1b
 [443] a5 48 b6 30 95 12 0f 88 d9 4f bc 52 ae 42 64 d1 c9 10 d2 4a 45 db 34 62 24 7f a7 91 71 5f d7 1f 98 9e
 [477] 19 e0 36 4c d3 f5 16 52 d7 37 60 ae 8f a8 c9 ff b3 c3 30 cc 9e 4f c1 7f af 2c e5 45 04 6e 37 94 4c 69
 [511] e4 62 a1 a8 2f e3 53 bd 90 a8 65 aa d4 1e d0 b5 b8 f9 d6 fd 01 00 00 ff ff 03 00 50 4b 03 04 14 00 06
 [545] 00 08 00 00 00 21 00 6b 79 96 16 83 00 00 00 8a 00 00 00 1c 00 00 00 74 68 65 6d 65 2f 74 68 65 6d 65
 [579] 2f 74 68 65 6d 65 4d 61 6e 61 67 65 72 2e 78 6d 6c 0c cc 4d 0a c3 20 10 40 e1 7d a1 77 90 d9 37 63 bb
 [613] 28 45 62 b2 cb ae bb f6 00 43 9c 1a 41 c7 a0 d2 9f db d7 e5 e3 83 37 ce df 14 d5 9b 4b 0d 59 2c 9c 07
 [647] 0d 8a 65 cd 2e 88 b7 f0 7c 2c a7 1b a8 da 48 1c c5 2c 6c e1 c7 15 e6 e9 78 18 c9 b4 8d 13 df 49 c8 73
 [681] 51 7d 23 d5 90 85 ad b5 dd 20 d6 b5 2b d5 21 ef 2c dd 5e b9 24 6a 3d 8b 47 57 e8 d3 f7 29 e2 45 eb 2b
 [715] 26 0a 02 38 fd 01 00 00 ff ff 03 00 50 4b 03 04 14 00 06 00 08 00 00 00 21 00 b1 89 54 f3 ac 06 00 00
 [749] a5 1b 00 00 16 00 00 00 74 68 65 6d 65 2f 74 68 65 6d 65 2f 74 68 65 6d 65 31 2e 78 6d 6c ec 59 4f 6f
 [783] 1b 45 14 bf 23 f1 1d 46 7b 6f 63 27 76 1a 47 75 aa d8 b1 1b 68 d3 46 b1 5b d4 e3 78 77 bc 3b cd ec ce
 [817] 6a 66 9c d4 37 d4 1e 91 90 10 05 71 a0 12 37 90 10 50 a9 95 b8 94 13 1f 25 50 04 45 ea 57 e0 cd cc ee
 [851] 7a 27 5e 93 a4 8d a0 82 e6 d0 da b3 bf 79 ff df 6f de ac 2f 5f b9 17 33 74 40 84 a4 3c 69 7b f5 8b 35
 [885] 0f 91 c4 e7 01 4d c2 b6 77 6b d8 bf b0 e6 21 a9 70 12 60 c6 13 d2 f6 a6 44 7a 57 36 de 7d e7 32 5e 57
 [919] 11 89 09 82 fd 89 5c c7 6d 2f 52 2a 5d 5f 5a 92 3e 2c 63 79 91 a7 24 81 67 63 2e 62 ac e0 ab 08 97 02
 [953] 81 0f 41 6e cc 96 96 6b b5 d5 a5 18 d3 c4 43 09 8e 41 ec 30 fa f9 1b 10 76 73 3c a6 3e f1 36 72 e9 3d
 [987] 06 2a 12 25 f5 82 cf c4 40 cb 26 d9 96 12
 [ reached getOption("max.print") -- omitted 2108 entries ]

$`/__recip_version1.0_#00000000/__substg1.0_0FF60102`
[1] 00 01 bf dc

$`/__recip_version1.0_#00000000/__substg1.0_3D010102`
 [1] 30 49 c5 36 8a fb 59 4e 95 0c 77 6e d7 93 41 ed

hrbrmstr avatar Jan 03 '18 22:01 hrbrmstr

Good news! I believe I now know the format so I shld be able to get it coded up fairly soon.

hrbrmstr avatar Jan 04 '18 02:01 hrbrmstr

When I've got this figured out is it OK if I add it to the included example files?

Sure!

I tried yesterday to understand how .msg are working. Seems to be quite a mess without a lot of documentation (I didn't find any official doc, but maybe my search terms weren't accurate). Thank you for you quick answer to this issue.

ghost avatar Jan 04 '18 09:01 ghost

Hi,

Did you manage to get the html body for read_msg function?

Regards, GH

chuagh74 avatar Feb 11 '22 06:02 chuagh74