-
Notifications
You must be signed in to change notification settings - Fork 111
Home
Created By : Raja Tomar
License : MIT
Email: [email protected]
Python websites and webpages cloning at ease. Web Scraping or Saving Complete webpages and websites with python.
Web scraping and archiving tool written in Python
Archive any online website and its assets, css, js and
images for offilne reading, storage or whatever reasons.
It's easy with pywebcopy
.
Why it's great? because it -
- respects
robots.txt
- saves a webpage with css, js and images with one call
- clones a complete website with assets and links remapped in one call
- have direct apis for simplicity and ease
- subclassing for advanced usage
- custom html tags handler support
- lots of configuration for many custom needs
- provides several scraping packages in one objects for scraping under one class
- lxml
- requests
- beautifulsoup4
- pyquery
- requests_html
Email me at [email protected]
of any query :)
pywebcopy
is available on PyPi and is easily installable using pip
$ pip install pywebcopy
You are ready to go. Read the tutorials below to get started.
You should always check if the latest pywebcopy is installed successfully.
>>> import pywebcopy
>>> pywebcopy.__version___
6.0.0
Your version may be different, now you can continue the tutorial.
To save any single page, just type in python console
from pywebcopy import save_webpage
kwargs = {'project_name': 'some-fancy-name'}
save_webpage(
url='http://example-site.com/index.html',
project_folder='path/to/downloads',
**kwargs
)
To save full website (This could overload the target server, So, be careful)
from pywebcopy import save_website
kwargs = {'project_name': 'some-fancy-name'}
save_website(
url='http://example-site.com/index.html',
project_folder='path/to/downloads',
**kwargs
)
Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package.
$ python -m pywebcopy run-tests
pywebcopy
have a very easy to use command-line interface which
can help you do task without having to worrying about the inner
long way.
-
$ python -m pywebcopy -- --help
-
$ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True or $ python -m pywebcopy save_website http://google.com E://store// --bypass_robots
-
$ python -m pywebcopy run_tests
Most of the time authentication is needed to access a certain page.
Its real easy to authenticate with pywebcopy
because it usage an
requests.Session
object for base http activity which can be accessed
through pywebcopy.SESSION
attribute. And as you know there
are ton of tutorials on setting up authentication with requests.Session
.
Here is a basic example of simple http auth -
import pywebcopy
# Update the headers with suitable data
pywebcopy.SESSION.headers.update({
'auth': {'username': 'password'},
'form': {'key1': 'value1'},
})
# Rest of the code is as usual
kwargs = {
'url': 'http://localhost:5000',
'project_folder': 'e://saved_pages//',
'project_name': 'my_site'
}
pywebcopy.config.setup_config(**kwargs)
pywebcopy.save_webpage(**kwargs)