CsQuery icon indicating copy to clipboard operation
CsQuery copied to clipboard

Remove of html tag seems to clean complete dom

Open marcofranssen opened this issue 12 years ago • 11 comments

I'm trying to remove all empty paragraph tags, but somehow following code clears my complete dom.

Is this a bug? Am I doing something wrong?

var dom = CQ.Create(HtmlBody);
var paragraphs = dom.Select("p");
foreach (var paragraph in paragraphs.Where(paragraph => string.IsNullOrEmpty(paragraph.InnerText)))
{
    paragraph.Remove();
}
return dom.Render();

marcofranssen avatar Feb 06 '13 10:02 marcofranssen

Can you post the markup? In a very simple test I can't reproduce it:

        var dom = CQ.CreateDocument("<div><p>test</p><p></p><p></p></div><p>test2<p>test3<p></p>");
        Assert.AreEqual(6, dom["p"].Length);
        var paragraphs = dom.Select("p");
        foreach (var paragraph in paragraphs.Where(paragraph => string.IsNullOrEmpty(paragraph.InnerText)))
        {
            paragraph.Remove();
        }
        Assert.AreEqual(3, dom["p"].Length);

jamietre avatar Feb 06 '13 11:02 jamietre

Ok I figured it out... When I replace

return dom.Render();

with

return dom.Render(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes);

it seems to work...

marcofranssen avatar Feb 06 '13 18:02 marcofranssen

That seems strange. Are you changing the default rendering options anywhere? Also which version are you using? On Feb 6, 2013 1:13 PM, "Marco Franssen" [email protected] wrote:

jamietre avatar Feb 06 '13 18:02 jamietre

We are using <package id="CsQuery.Signed" version="1.3.3-signed" targetFramework="net40" />

Yesterday we had also some strange behaviour by getting the text...

Not working

var dom = CQ.Create(html);
var result = dom.Remove("a").Text();

We expected this to remove all anchors and then return us the text from the dom... or should we use render here?

Working

var dom = CQ.Create(title);
var temp = dom.Remove("a");
var result = temp.Text();

marcofranssen avatar Feb 06 '13 18:02 marcofranssen

Those appear identical except for title vs. html ... Text should return all the raw text whereas render would include the html elements as well. Removing the anchors would also remove the text of that link.

jamietre avatar Feb 06 '13 18:02 jamietre

Back at my desk.. just to clarify a couple things about usage which might have something to do with this. These are identical:

var result = dom.Remove("a").Text();

dom.Remove("a");
var result = dom.Text();

var temp = dom.Remove("a");
var result = temp.Text();

This is because the Remove method is destructive - it doesn't return a new CQ object; it alters the DOM and returns the same object.

Also - Render and Text operate on two different things. Render renders the entire DOM, whereas Text returns the contents of the text nodes only from the selection. So these are the same:

var html = dom.Render();
var html = dom.Select("p").Render(); // the results of the Select are unused

These are different:

var text = dom.Text();   // assuming "dom" was just created, should return all the text
                                  // since the initial selection is all the top-level nodes
var text = dom.Select("body").Text()  // ALWAYS return all the text
var text = dom.Select("p").Text();  // return only text inside p tags

It seems like there might be some confusion in your code about the dom, vs. the selection set. The DOM is the same for any CQ object derived from a single source, whereas the selection changes depending on the selectors you run. Most methods return data based on the selection, Render is an exception because it's specifically for rendering the entire DOM. There's also RenderSelection for returning the outer HTML of each selected element.

jamietre avatar Feb 06 '13 21:02 jamietre

btw I just pushed the signed 1.3.4 package -- forgot yesterday. I don't think anything in this update would have to do with this though.

jamietre avatar Feb 06 '13 21:02 jamietre

Seems like 1.3.4 broke some things... Because now I get completely different results. Seems like all paragraps are deleted now...

foreach (var paragraph in paragraphs)
{
    if (string.IsNullOrEmpty(paragraph.InnerText) || paragraph.InnerText.Trim() == "&nbsp;")
        paragraph.Remove();
    else
    {
        paragraph.RemoveAttribute("class");
    }
}
var cleanedHtml = dom.Render(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes)

The HTML is just some default Outlook email.... The remove attribute is to get rid of the class="MsoNormal"... This code results in deleting all paragraphs...

When debugging I get the following message on all properties of the paragraph object. Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation.

I get this on each of the paragraph... in all the foreach iterations on all properties....

marcofranssen avatar Feb 08 '13 18:02 marcofranssen

Is there any way you can provide me with some markup that you are having trouble with? It's really hard for me to identify a potential problem because I can't reproduce anything. If you want to email me something directly go right ahead: jamietre at gmail.com

jamietre avatar Feb 08 '13 19:02 jamietre

Here a Html example... I am also trying to remove the wrapping div.WordSection1 I tried to unwrap by using following code.

dom.Select("div.WordSection1").Unwrap();
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 14 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:Tahoma;
    panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
    {mso-style-priority:99;
    mso-style-link:"Balloon Text Char";
    margin:0cm;
    margin-bottom:.0001pt;
    font-size:8.0pt;
    font-family:"Tahoma","sans-serif";
    mso-fareast-language:EN-US;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
span.BalloonTextChar
    {mso-style-name:"Balloon Text Char";
    mso-style-priority:99;
    mso-style-link:"Balloon Text";
    font-family:"Tahoma","sans-serif";}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=NL link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Dear Marco,<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><span lang=EN-US>This is a testmail to test some stuff with<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span lang=EN-US>Paragraphs in CsQuery<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span lang=EN-US>So hopefully this will solve the problems from CsQuery&#8230;.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:NL'>Kind regards,<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:NL'><img width=128 height=3 id="Picture_x0020_1" src="cid:[email protected]" alt="blue_strip"><o:p></o:p></span></p><p class=MsoNormal><b><span lang=EN-US style='mso-fareast-language:NL'>Marco Franssen<o:p></o:p></span></b></p></div></body></html>

marcofranssen avatar Feb 08 '13 20:02 marcofranssen

Finally found the problem! There is indeed a bug with InnerText -- so elements that shouldn't have been matched were getting removed. Fixed in next push. Try the updated DLLs.

jamietre avatar Feb 09 '13 01:02 jamietre