Unreliable iterator based incremental parsing #123

TLCFEM · 2024-03-23T22:12:57Z

Line 104 in 9f35b4b

data = source.read(0o3000)

Here if it breaks between a href, nothing will be further parsed.

See example:

Wrong:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_',  b'135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # nothing
    parser.close()

Expected:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # a root
    parser.close()

It may be better just to feed all at once.

        parser.feed(source.fp.data)
        for event, element in parser.read_events():
            for child in links(element):
                if child is None:
                    continue
                yield child

The text was updated successfully, but these errors were encountered:

rajatomar788 · 2024-03-24T17:01:33Z

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

TLCFEM · 2024-03-24T17:43:00Z

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

Please try this link: https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/

The saved html:

<html>
<!--
* PyWebCopy Engine [version 7.0.2]
* Copyright 2020; Raja Tomar
* File mirrored from [https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/]
* At UTC datetime: [2024-03-24 17:40:10.070531]
--><head><title>Index of /seismic-products/strong-motion/volume-products/2011/</title></head>
<body>
<h1>Index of /seismic-products/strong-motion/volume-products/2011/</h1><hr><pre><a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/">../</a>
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/01_Jan/">01_Jan/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Christchurch_mainshock_extended_pass_band/">02_Christchurch_mainshock_extended_pass_band/</a>      24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Feb/">02_Feb/</a>                                            24-Mar-2024 17:19                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/03_Mar/">03_Mar/</a>                                            24-Mar-2024 17:20                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/04_Apr/">04_Apr/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/05_May/">05_May/</a>                                            24-Mar-2024 17:05                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Christchurch_13_June_extended%20pass%20band/">06_Christchurch_13_June_extended pass band/</a>        24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Jun/">06_Jun/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/07_Jul/">07_Jul/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/08_Aug/">08_Aug/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/09_Sep/">09_Sep/</a>                                            24-Mar-2024 17:29                   -
<a href="10_Oct/">10_Oct/</a>                                            24-Mar-2024 17:22                   -
<a href="11_Nov/">11_Nov/</a>                                            24-Mar-2024 17:29                   -
<a href="12_Dec/">12_Dec/</a>                                            24-Mar-2024 17:16                   -
</pre><hr></body>
</html>

The last three are broken in this example. As far as I can tell, many links are broken due to this issue for sites like this.

TLCFEM · 2024-03-24T17:53:32Z

there is another loop outside the while loop for cases like this which iterates any leftover tags.

And just to be clear, it is not caused by not fully fed, incomplete data. So the iterator itself is fine.

If the break is outside href, then it is working fine.

from lxml import etree

parser = etree.HTMLPullParser()
#                                             |  here please note the difference
for data in (b'<root><a href=', b'"2011-03-13_135411/">2011-03-13_135411/</a></root>',):
      parser.feed(data)
      for _, elem in parser.read_events():
            print(elem.tag)  # a root
parser.close()

rajatomar788 · 2024-03-24T18:01:48Z

alright then will change to one time feeding.

TLCFEM · 2024-03-28T04:33:44Z

It turns out to be a bug in libxml, see this: https://bugs.launchpad.net/lxml/+bug/2058828

Maybe check etree.LIBXML_VERSION, and provide the one-off alternative for versions < 2.11.

rajatomar788 · 2024-03-29T08:02:39Z

Will try to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreliable iterator based incremental parsing #123

Unreliable iterator based incremental parsing #123

TLCFEM commented Mar 23, 2024 •

edited

Loading

rajatomar788 commented Mar 24, 2024

TLCFEM commented Mar 24, 2024 •

edited

Loading

TLCFEM commented Mar 24, 2024

rajatomar788 commented Mar 24, 2024

TLCFEM commented Mar 28, 2024

rajatomar788 commented Mar 29, 2024

Unreliable iterator based incremental parsing #123

Unreliable iterator based incremental parsing #123

Comments

TLCFEM commented Mar 23, 2024 • edited Loading

rajatomar788 commented Mar 24, 2024

TLCFEM commented Mar 24, 2024 • edited Loading

TLCFEM commented Mar 24, 2024

rajatomar788 commented Mar 24, 2024

TLCFEM commented Mar 28, 2024

rajatomar788 commented Mar 29, 2024

TLCFEM commented Mar 23, 2024 •

edited

Loading

TLCFEM commented Mar 24, 2024 •

edited

Loading