node-html-to-text icon indicating copy to clipboard operation
node-html-to-text copied to clipboard

Is there a way to get the tag and its content as it is?

Open gmatarbhog-cci opened this issue 4 years ago • 8 comments

I am using this package to extract the text from html email content and I want to retain the anchor tag as it is. For example example should not be converted to text and the whole tag along with the content should be retained.

Is there a way to achieve the above? I have gone through the different options but could not find anything for my need.

gmatarbhog-cci avatar Jul 14 '21 04:07 gmatarbhog-cci

Input HTML is parsed into DOM model by htmlparser2 with help of domhandler. You can explore it there.

html-to-text passes DOM elements into appropriate formatters. You'll have to provide a custom formatter that will reconstruct an HTML tag from a DOM element. Look into how formatters work. Start from this section in the Readme.

(Caveat: it may not work nicely with word wrap. I don't have a solution to mask some output text as taking zero space at this moment. So it's only safe and won't wrap your links if you also disable word wrap.)

P.S. This gives me an idea. The task of reconstructing HTML tags seems generic enough and it must be possible to ship it as a standard formatter along with skip. At some point. It might create too much confusion with broken links for unaware users before I solve the masking issue.

KillyMXI avatar Jul 14 '21 10:07 KillyMXI

@KillyMXI, somewhat related to this question - is it possible to get the value of a particular attribute? In my use case (which is nearly perfect out of the box!) I have the following selectors:

baseElements: {
  selectors: [
    '[data-islntxt]'
  }
}

And this gives me all the content I need. Each line is its own div and everything works out perfectly, with one caveat: there is a value in each div in the data-ln-ref attribute that I want to also print. If I were formatting it, I'd want to print it like this:

`${data-ln-ref value} ${div text}`

I'm looking through the built-in formatters but I'm sure someone with more experience in this library than me would be able to tell me pretty easily if this is something I can do, or if it would be far too complicated.

I would greatly appreciate it if you have the time to look, thank you!

pcopley avatar Oct 05 '21 01:10 pcopley

In case anyone comes across this in the future, it only took 15 minutes of hacking around to come up with a solution. Here is how I am doing it, it seems pretty concise but there may be a way to repurpose an existing formatter to remove a couple lines:

(I retyped this to remove client-specific and unrelated things so there might be a typo or two)

const parsed = convert(html, {
  baseElements: {
    selectors: [
    '[data-islntxt]'
    ]
  },
  formatters: {
    'lineNumberFormatter': function (elem, walk, builder, formatOptions) {
      builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks });
      builder.addInline(`${elem.attribs['data-ln-ref']} `); // This is the only line I added to the default formatBlock formatter
      walk(elem.children, builder);
      builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks });
    }
  },
  selectors: [
    {
      selector: '[data-islntxt]',
      format: 'lineNumberFormatter'
    }
  ],
  preserveNewlines: true
});

I must say I'm extremely impressed with how the code is laid out and how easy it is to see what's going on under the hood.

pcopley avatar Oct 05 '21 01:10 pcopley

@pcopley Glad you figured it out. This is indeed the most concise solution.

I see a connection to ordered list formatting. Maybe I can move a part of that into builder to offer an easier custom numbered lists construction with proper column alignment and word wrapping (Not more concise though, depending on how you count it). Currently it is all in the lists formatter and not reusable without copying it all together.

The question is sufficiently different to the OP question though. Separate issue with a reference would've served better. I wrote my follow-up thoughts in #238 so I don't lose them.

KillyMXI avatar Oct 05 '21 19:10 KillyMXI

For anyone like me who needed to be able to this there is an example here ...

var { htmlToText } = require('html-to-text');

var text = htmlToText('<i>Hello <span>World</span></i>', {
  formatters: {
    // Create a formatter.
    'outputHtmlTag': function (elem, walk, builder, formatOptions) {
      builder.addInline('<' + formatOptions.tagName + '>');
      walk(elem.children, builder);
      builder.addInline('</' + formatOptions.tagName + '>');
    }
  },
  tags: {
    // Assign it to tags.
    'i': {
      format: 'outputHtmlTag',
      options: { tagName: 'i' }
    },
    'strong': {
      format: 'outputHtmlTag',
      options: { tagName: 'strong' }
    }
  }
});

console.log(text); // <i>Hello World</i>```

justinjenkins avatar Dec 22 '21 00:12 justinjenkins

tags option is deprecated btw.

...
  selectors: [
    { selector: 'i', format: 'outputHtmlTag', options: { tagName: 'i' } },
    ...
  ]
...

Version 9 will offer more built-in options when it's done.

KillyMXI avatar Dec 24 '21 15:12 KillyMXI

So tldr; the "typescript" friendly way to do this is to use itemPrefix option as the "container" for the tag 🥳

const convert = compile({
    wordwrap: null,
    formatters: {
      // Create a formatter.
      outputHTML: function (elem, walk, builder, formatOptions) {
        builder.addInline('<' + formatOptions.itemPrefix + '>')
        walk(elem.children, builder)
        builder.addInline('</' + formatOptions.itemPrefix + '>')
      },
    },
    selectors: [
      { selector: 'i', format: 'outputHTML', options: { itemPrefix: 'i' } },
      { selector: 'p', format: 'outputHTML', options: { itemPrefix: 'p' } },
      { selector: 'h2', format: 'outputHTML', options: { itemPrefix: 'h2' } },
      { selector: 'h3', format: 'outputHTML', options: { itemPrefix: 'h3' } },
      { selector: 'h4', format: 'outputHTML', options: { itemPrefix: 'h4' } },
      { selector: 'h5', format: 'outputHTML', options: { itemPrefix: 'h5' } },
      { selector: 'h6', format: 'outputHTML', options: { itemPrefix: 'h6' } },
    ],
  })

huntedman avatar Jul 05 '22 16:07 huntedman

@huntedman Hmm. I wouldn't call it "typescript friendly". More like a hack around @types/html-to-text. You are just hijacking a name used for a different purpose and adding a code smell.

DefinitelyTyped' definition for selector options might be somewhat flawed. Selector options objects are intended to be extendable by client code. The type definition should probably include [key:string]: any. Refer to #223 for a related issue.

If possible, a prettier solution would be to extend the interface definition locally. Other solution would be to open an issue in DefinitelyTyped repo for this - open FormatOptions interface for extension.

KillyMXI avatar Jul 05 '22 17:07 KillyMXI

Version 9 is now released. It comes with a new set of predefined formatters (see the second table).

  • blockHtml and inlineHtml would output outer HTML of a given node (no recursive walking in html-to-text, fully handled by htmlparser2);
  • blockTag and inlineTag would output given node as an HTML tag but convert its' contents, similar to examples provided above.

KillyMXI avatar Dec 03 '22 12:12 KillyMXI