Remove of html tag seems to clean complete dom
I'm trying to remove all empty paragraph tags, but somehow following code clears my complete dom.
Is this a bug? Am I doing something wrong?
var dom = CQ.Create(HtmlBody);
var paragraphs = dom.Select("p");
foreach (var paragraph in paragraphs.Where(paragraph => string.IsNullOrEmpty(paragraph.InnerText)))
{
paragraph.Remove();
}
return dom.Render();
Can you post the markup? In a very simple test I can't reproduce it:
var dom = CQ.CreateDocument("<div><p>test</p><p></p><p></p></div><p>test2<p>test3<p></p>");
Assert.AreEqual(6, dom["p"].Length);
var paragraphs = dom.Select("p");
foreach (var paragraph in paragraphs.Where(paragraph => string.IsNullOrEmpty(paragraph.InnerText)))
{
paragraph.Remove();
}
Assert.AreEqual(3, dom["p"].Length);
Ok I figured it out... When I replace
return dom.Render();
with
return dom.Render(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes);
it seems to work...
That seems strange. Are you changing the default rendering options anywhere? Also which version are you using? On Feb 6, 2013 1:13 PM, "Marco Franssen" [email protected] wrote:
We are using <package id="CsQuery.Signed" version="1.3.3-signed" targetFramework="net40" />
Yesterday we had also some strange behaviour by getting the text...
Not working
var dom = CQ.Create(html);
var result = dom.Remove("a").Text();
We expected this to remove all anchors and then return us the text from the dom... or should we use render here?
Working
var dom = CQ.Create(title);
var temp = dom.Remove("a");
var result = temp.Text();
Those appear identical except for title vs. html ... Text should return
all the raw text whereas render would include the html elements as well.
Removing the anchors would also remove the text of that link.
Back at my desk.. just to clarify a couple things about usage which might have something to do with this. These are identical:
var result = dom.Remove("a").Text();
dom.Remove("a");
var result = dom.Text();
var temp = dom.Remove("a");
var result = temp.Text();
This is because the Remove method is destructive - it doesn't return a new CQ object; it alters the DOM and returns the same object.
Also - Render and Text operate on two different things. Render renders the entire DOM, whereas Text returns the contents of the text nodes only from the selection. So these are the same:
var html = dom.Render();
var html = dom.Select("p").Render(); // the results of the Select are unused
These are different:
var text = dom.Text(); // assuming "dom" was just created, should return all the text
// since the initial selection is all the top-level nodes
var text = dom.Select("body").Text() // ALWAYS return all the text
var text = dom.Select("p").Text(); // return only text inside p tags
It seems like there might be some confusion in your code about the dom, vs. the selection set. The DOM is the same for any CQ object derived from a single source, whereas the selection changes depending on the selectors you run. Most methods return data based on the selection, Render is an exception because it's specifically for rendering the entire DOM. There's also RenderSelection for returning the outer HTML of each selected element.
btw I just pushed the signed 1.3.4 package -- forgot yesterday. I don't think anything in this update would have to do with this though.
Seems like 1.3.4 broke some things... Because now I get completely different results. Seems like all paragraps are deleted now...
foreach (var paragraph in paragraphs)
{
if (string.IsNullOrEmpty(paragraph.InnerText) || paragraph.InnerText.Trim() == " ")
paragraph.Remove();
else
{
paragraph.RemoveAttribute("class");
}
}
var cleanedHtml = dom.Render(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes)
The HTML is just some default Outlook email.... The remove attribute is to get rid of the class="MsoNormal"... This code results in deleting all paragraphs...
When debugging I get the following message on all properties of the paragraph object.
Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation.
I get this on each of the paragraph... in all the foreach iterations on all properties....
Is there any way you can provide me with some markup that you are having trouble with? It's really hard for me to identify a potential problem because I can't reproduce anything. If you want to email me something directly go right ahead: jamietre at gmail.com
Here a Html example... I am also trying to remove the wrapping div.WordSection1 I tried to unwrap by using following code.
dom.Select("div.WordSection1").Unwrap();
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 14 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";
mso-fareast-language:EN-US;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=NL link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Dear Marco,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span lang=EN-US>This is a testmail to test some stuff with<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Paragraphs in CsQuery<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>So hopefully this will solve the problems from CsQuery….<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:NL'>Kind regards,<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:NL'><img width=128 height=3 id="Picture_x0020_1" src="cid:[email protected]" alt="blue_strip"><o:p></o:p></span></p><p class=MsoNormal><b><span lang=EN-US style='mso-fareast-language:NL'>Marco Franssen<o:p></o:p></span></b></p></div></body></html>
Finally found the problem! There is indeed a bug with InnerText -- so elements that shouldn't have been matched were getting removed. Fixed in next push. Try the updated DLLs.