Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable iterator based incremental parsing #123

Open
TLCFEM opened this issue Mar 23, 2024 · 6 comments
Open

Unreliable iterator based incremental parsing #123

TLCFEM opened this issue Mar 23, 2024 · 6 comments

Comments

@TLCFEM
Copy link

TLCFEM commented Mar 23, 2024

data = source.read(0o3000)

Here if it breaks between a href, nothing will be further parsed.

See example:

Wrong:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_',  b'135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # nothing
    parser.close()

Expected:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # a root
    parser.close()

It may be better just to feed all at once.

        parser.feed(source.fp.data)
        for event, element in parser.read_events():
            for child in links(element):
                if child is None:
                    continue
                yield child
@rajatomar788
Copy link
Owner

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

@TLCFEM
Copy link
Author

TLCFEM commented Mar 24, 2024

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

Please try this link: https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/

The saved html:

<html>
<!--
* PyWebCopy Engine [version 7.0.2]
* Copyright 2020; Raja Tomar
* File mirrored from [https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/]
* At UTC datetime: [2024-03-24 17:40:10.070531]
--><head><title>Index of /seismic-products/strong-motion/volume-products/2011/</title></head>
<body>
<h1>Index of /seismic-products/strong-motion/volume-products/2011/</h1><hr><pre><a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/">../</a>
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/01_Jan/">01_Jan/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Christchurch_mainshock_extended_pass_band/">02_Christchurch_mainshock_extended_pass_band/</a>      24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Feb/">02_Feb/</a>                                            24-Mar-2024 17:19                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/03_Mar/">03_Mar/</a>                                            24-Mar-2024 17:20                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/04_Apr/">04_Apr/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/05_May/">05_May/</a>                                            24-Mar-2024 17:05                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Christchurch_13_June_extended%20pass%20band/">06_Christchurch_13_June_extended pass band/</a>        24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Jun/">06_Jun/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/07_Jul/">07_Jul/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/08_Aug/">08_Aug/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/09_Sep/">09_Sep/</a>                                            24-Mar-2024 17:29                   -
<a href="10_Oct/">10_Oct/</a>                                            24-Mar-2024 17:22                   -
<a href="11_Nov/">11_Nov/</a>                                            24-Mar-2024 17:29                   -
<a href="12_Dec/">12_Dec/</a>                                            24-Mar-2024 17:16                   -
</pre><hr></body>
</html>

The last three are broken in this example. As far as I can tell, many links are broken due to this issue for sites like this.

@TLCFEM
Copy link
Author

TLCFEM commented Mar 24, 2024

there is another loop outside the while loop for cases like this which iterates any leftover tags.

And just to be clear, it is not caused by not fully fed, incomplete data. So the iterator itself is fine.

If the break is outside href, then it is working fine.

from lxml import etree

parser = etree.HTMLPullParser()
#                                             |  here please note the difference
for data in (b'<root><a href=', b'"2011-03-13_135411/">2011-03-13_135411/</a></root>',):
      parser.feed(data)
      for _, elem in parser.read_events():
            print(elem.tag)  # a root
parser.close()

@rajatomar788
Copy link
Owner

alright then will change to one time feeding.

@TLCFEM
Copy link
Author

TLCFEM commented Mar 28, 2024

It turns out to be a bug in libxml, see this: https://bugs.launchpad.net/lxml/+bug/2058828

Maybe check etree.LIBXML_VERSION, and provide the one-off alternative for versions < 2.11.

@rajatomar788
Copy link
Owner

Will try to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants