You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I fixed some issues in the main branch, but now if I run python -m justext -s Polish "https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html" I think it gets you what you expect. The title "Ziemniaki na szóstej, surówka na dziesiątej". Jak pomagać, żeby nie zaszkodzić? [PORADNIK W PIGUŁCE] is twice in the original HTML too and there is no deduplication logic. The jusText is intended to create corpora IMHO and some duplication there is not so bad. It would be nice to do some deduplication though, but you know. I don't have the motivation to do it because I am no longer using justText for my projects.
Justext outputs the title of this webpage twice:
https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html
(archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html)
The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).
The text was updated successfully, but these errors were encountered: