Skip to content
Raja Tomar edited this page Jun 7, 2019 · 1 revision

PyWebCopy © 6

PyPI PyPI - Python Version PyPI - Status Codacy Badge

Created By : Raja Tomar License : MIT Email: [email protected]

Python websites and webpages cloning at ease. Web Scraping or Saving Complete webpages and websites with python.

Web scraping and archiving tool written in Python Archive any online website and its assets, css, js and images for offilne reading, storage or whatever reasons. It's easy with pywebcopy.

Why it's great? because it -

  • respects robots.txt
  • saves a webpage with css, js and images with one call
  • clones a complete website with assets and links remapped in one call
  • have direct apis for simplicity and ease
  • subclassing for advanced usage
  • custom html tags handler support
  • lots of configuration for many custom needs
  • provides several scraping packages in one objects for scraping under one class
    • lxml
    • requests
    • beautifulsoup4
    • pyquery
    • requests_html

Email me at [email protected] of any query :)

1.1 Installation

pywebcopy is available on PyPi and is easily installable using pip

$ pip install pywebcopy

You are ready to go. Read the tutorials below to get started.

1.1.1 First steps

You should always check if the latest pywebcopy is installed successfully.

>>> import pywebcopy
>>> pywebcopy.__version___
6.0.0

Your version may be different, now you can continue the tutorial.

1.2 Basic Usages

To save any single page, just type in python console

from pywebcopy import save_webpage

kwargs = {'project_name': 'some-fancy-name'}

save_webpage(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads',
    **kwargs
)

To save full website (This could overload the target server, So, be careful)

from pywebcopy import save_website

kwargs = {'project_name': 'some-fancy-name'}

save_website(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads',
    **kwargs
)

1.3 Running Tests

Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package.

$ python -m pywebcopy run-tests

1.4 Command Line Interface

pywebcopy have a very easy to use command-line interface which can help you do task without having to worrying about the inner long way.

  • Getting list of commands

    $ python -m pywebcopy -- --help
  • Using apis

    $ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True
    or
    $ python -m pywebcopy save_website http://google.com E://store// --bypass_robots
  • Running tests

    $ python -m pywebcopy run_tests

1.5 Authentication and Cookies

Most of the time authentication is needed to access a certain page. Its real easy to authenticate with pywebcopy because it usage an requests.Session object for base http activity which can be accessed through pywebcopy.SESSION attribute. And as you know there are ton of tutorials on setting up authentication with requests.Session.

Here is a basic example of simple http auth -

import pywebcopy

# Update the headers with suitable data

pywebcopy.SESSION.headers.update({
    'auth': {'username': 'password'},
    'form': {'key1': 'value1'},
})

# Rest of the code is as usual
kwargs = {
    'url': 'http://localhost:5000',
    'project_folder': 'e://saved_pages//',
    'project_name': 'my_site'
}
pywebcopy.config.setup_config(**kwargs)
pywebcopy.save_webpage(**kwargs)