-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreliable iterator based incremental parsing #123
Comments
@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem. |
Please try this link: https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/ The saved html: <html>
<!--
* PyWebCopy Engine [version 7.0.2]
* Copyright 2020; Raja Tomar
* File mirrored from [https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/]
* At UTC datetime: [2024-03-24 17:40:10.070531]
--><head><title>Index of /seismic-products/strong-motion/volume-products/2011/</title></head>
<body>
<h1>Index of /seismic-products/strong-motion/volume-products/2011/</h1><hr><pre><a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/">../</a>
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/01_Jan/">01_Jan/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Christchurch_mainshock_extended_pass_band/">02_Christchurch_mainshock_extended_pass_band/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Feb/">02_Feb/</a> 24-Mar-2024 17:19 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/03_Mar/">03_Mar/</a> 24-Mar-2024 17:20 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/04_Apr/">04_Apr/</a> 24-Mar-2024 17:26 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/05_May/">05_May/</a> 24-Mar-2024 17:05 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Christchurch_13_June_extended%20pass%20band/">06_Christchurch_13_June_extended pass band/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Jun/">06_Jun/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/07_Jul/">07_Jul/</a> 24-Mar-2024 17:26 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/08_Aug/">08_Aug/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/09_Sep/">09_Sep/</a> 24-Mar-2024 17:29 -
<a href="10_Oct/">10_Oct/</a> 24-Mar-2024 17:22 -
<a href="11_Nov/">11_Nov/</a> 24-Mar-2024 17:29 -
<a href="12_Dec/">12_Dec/</a> 24-Mar-2024 17:16 -
</pre><hr></body>
</html> The last three are broken in this example. As far as I can tell, many links are broken due to this issue for sites like this. |
And just to be clear, it is not caused by not fully fed, incomplete data. So the iterator itself is fine. If the break is outside from lxml import etree
parser = etree.HTMLPullParser()
# | here please note the difference
for data in (b'<root><a href=', b'"2011-03-13_135411/">2011-03-13_135411/</a></root>',):
parser.feed(data)
for _, elem in parser.read_events():
print(elem.tag) # a root
parser.close() |
alright then will change to one time feeding. |
It turns out to be a bug in libxml, see this: https://bugs.launchpad.net/lxml/+bug/2058828 Maybe check |
Will try to do it. |
pywebcopy/pywebcopy/parsers.py
Line 104 in 9f35b4b
Here if it breaks between a
href
, nothing will be further parsed.See example:
Wrong:
Expected:
It may be better just to feed all at once.
The text was updated successfully, but these errors were encountered: