Is there a way to get the tag and its content as it is?
I am using this package to extract the text from html email content and I want to retain the anchor tag as it is. For example example should not be converted to text and the whole tag along with the content should be retained.
Is there a way to achieve the above? I have gone through the different options but could not find anything for my need.
Input HTML is parsed into DOM model by htmlparser2 with help of domhandler. You can explore it there.
html-to-text passes DOM elements into appropriate formatters. You'll have to provide a custom formatter that will reconstruct an HTML tag from a DOM element. Look into how formatters work. Start from this section in the Readme.
(Caveat: it may not work nicely with word wrap. I don't have a solution to mask some output text as taking zero space at this moment. So it's only safe and won't wrap your links if you also disable word wrap.)
P.S. This gives me an idea. The task of reconstructing HTML tags seems generic enough and it must be possible to ship it as a standard formatter along with skip. At some point. It might create too much confusion with broken links for unaware users before I solve the masking issue.
@KillyMXI, somewhat related to this question - is it possible to get the value of a particular attribute? In my use case (which is nearly perfect out of the box!) I have the following selectors:
baseElements: {
selectors: [
'[data-islntxt]'
}
}
And this gives me all the content I need. Each line is its own div and everything works out perfectly, with one caveat: there is a value in each div in the data-ln-ref attribute that I want to also print. If I were formatting it, I'd want to print it like this:
`${data-ln-ref value} ${div text}`
I'm looking through the built-in formatters but I'm sure someone with more experience in this library than me would be able to tell me pretty easily if this is something I can do, or if it would be far too complicated.
I would greatly appreciate it if you have the time to look, thank you!
In case anyone comes across this in the future, it only took 15 minutes of hacking around to come up with a solution. Here is how I am doing it, it seems pretty concise but there may be a way to repurpose an existing formatter to remove a couple lines:
(I retyped this to remove client-specific and unrelated things so there might be a typo or two)
const parsed = convert(html, {
baseElements: {
selectors: [
'[data-islntxt]'
]
},
formatters: {
'lineNumberFormatter': function (elem, walk, builder, formatOptions) {
builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks });
builder.addInline(`${elem.attribs['data-ln-ref']} `); // This is the only line I added to the default formatBlock formatter
walk(elem.children, builder);
builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks });
}
},
selectors: [
{
selector: '[data-islntxt]',
format: 'lineNumberFormatter'
}
],
preserveNewlines: true
});
I must say I'm extremely impressed with how the code is laid out and how easy it is to see what's going on under the hood.
@pcopley Glad you figured it out. This is indeed the most concise solution.
I see a connection to ordered list formatting. Maybe I can move a part of that into builder to offer an easier custom numbered lists construction with proper column alignment and word wrapping (Not more concise though, depending on how you count it). Currently it is all in the lists formatter and not reusable without copying it all together.
The question is sufficiently different to the OP question though. Separate issue with a reference would've served better. I wrote my follow-up thoughts in #238 so I don't lose them.
For anyone like me who needed to be able to this there is an example here ...
var { htmlToText } = require('html-to-text');
var text = htmlToText('<i>Hello <span>World</span></i>', {
formatters: {
// Create a formatter.
'outputHtmlTag': function (elem, walk, builder, formatOptions) {
builder.addInline('<' + formatOptions.tagName + '>');
walk(elem.children, builder);
builder.addInline('</' + formatOptions.tagName + '>');
}
},
tags: {
// Assign it to tags.
'i': {
format: 'outputHtmlTag',
options: { tagName: 'i' }
},
'strong': {
format: 'outputHtmlTag',
options: { tagName: 'strong' }
}
}
});
console.log(text); // <i>Hello World</i>```
tags option is deprecated btw.
...
selectors: [
{ selector: 'i', format: 'outputHtmlTag', options: { tagName: 'i' } },
...
]
...
Version 9 will offer more built-in options when it's done.
So tldr; the "typescript" friendly way to do this is to use itemPrefix option as the "container" for the tag 🥳
const convert = compile({
wordwrap: null,
formatters: {
// Create a formatter.
outputHTML: function (elem, walk, builder, formatOptions) {
builder.addInline('<' + formatOptions.itemPrefix + '>')
walk(elem.children, builder)
builder.addInline('</' + formatOptions.itemPrefix + '>')
},
},
selectors: [
{ selector: 'i', format: 'outputHTML', options: { itemPrefix: 'i' } },
{ selector: 'p', format: 'outputHTML', options: { itemPrefix: 'p' } },
{ selector: 'h2', format: 'outputHTML', options: { itemPrefix: 'h2' } },
{ selector: 'h3', format: 'outputHTML', options: { itemPrefix: 'h3' } },
{ selector: 'h4', format: 'outputHTML', options: { itemPrefix: 'h4' } },
{ selector: 'h5', format: 'outputHTML', options: { itemPrefix: 'h5' } },
{ selector: 'h6', format: 'outputHTML', options: { itemPrefix: 'h6' } },
],
})
@huntedman
Hmm.
I wouldn't call it "typescript friendly". More like a hack around @types/html-to-text. You are just hijacking a name used for a different purpose and adding a code smell.
DefinitelyTyped' definition for selector options might be somewhat flawed.
Selector options objects are intended to be extendable by client code.
The type definition should probably include [key:string]: any.
Refer to #223 for a related issue.
If possible, a prettier solution would be to extend the interface definition locally. Other solution would be to open an issue in DefinitelyTyped repo for this - open FormatOptions interface for extension.
Version 9 is now released. It comes with a new set of predefined formatters (see the second table).
-
blockHtmlandinlineHtmlwould output outer HTML of a given node (no recursive walking inhtml-to-text, fully handled byhtmlparser2); -
blockTagandinlineTagwould output given node as an HTML tag but convert its' contents, similar to examples provided above.