crazy IE-generated HTML is not normalized #22

kirbysayshi · 2011-05-15T00:21:47Z

IE8 (at least) will take the following HTML:

<!DOCTYPE html>
<html>
    <head></head>
    <body></body>
</html>

And convert it to:

<!DOCTYPE HTML>
<HTML>
    <HEAD></HEAD>
    <BODY></BODY>
</HTML>

The neat part: node-htmlparser handles this just fine!

The bad: libraries like soupselect (https://github.com/harryf/node-soupselect) and the DomUtils included with node-htmlparser will fail when trying to select 'body'. The DomUtils will select 'BODY' properly, but it's a pain to have to try to select BOTH 'body' and 'BODY'.

Should this be handled on the parser-side of things or the selector side of things? I'm not sure. Part of me thinks that the parser should normalize the HTML to an extent, such as make all the tags lowercase. At the same time, a selector engine could do this normalization when searching.

Thoughts?

The text was updated successfully, but these errors were encountered:

kirbysayshi · 2011-05-15T00:53:58Z

After doing a bit more research, it appears that the proper way, at least according to the browser vendors, is that element.nodeName and element.tagName should be uppercased. For example, if you run:

<!DOCTYPE html>
<html>
    <head></head>
    <BODY>
        <p></p>
        <P></p>
        <P></P>
        <p></P>
        <script type="text/javascript" charset="utf-8">
            console.log('4 p tags: ', document.querySelectorAll( 'p' ).length);
            console.log('4 P tags: ', document.querySelectorAll( 'P' ).length);

            var p = Array.prototype.slice.call(document.querySelectorAll( 'p' ), 0);
            p.forEach(function(e){
                console.log(e.nodeName, e.tagName);
            });

        </script>
    </BODY>
</html>

It should output:

4 p tags: 4
4 P tags: 4
P P
P P
P P
P P

I attempted to implement this behavior in node-htmlparser, but it broke the RssHandler test case. However... RSS is technically not HTML, but rather XML, which is case-sensitive!

So, perhaps this is where htmlparser has a choice: to support XML or not?

kirbysayshi · 2011-06-01T05:00:08Z

Update: #24

kirbysayshi mentioned this issue Dec 19, 2013

HtmlHandler, for normalizing tag cases #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crazy IE-generated HTML is not normalized #22

crazy IE-generated HTML is not normalized #22

kirbysayshi commented May 15, 2011

kirbysayshi commented May 15, 2011

kirbysayshi commented Jun 1, 2011

crazy IE-generated HTML is not normalized #22

crazy IE-generated HTML is not normalized #22

Comments

kirbysayshi commented May 15, 2011

kirbysayshi commented May 15, 2011

kirbysayshi commented Jun 1, 2011