-
Notifications
You must be signed in to change notification settings - Fork 111
How To
Particular webpage can be saved easily using the following methods.
Note: if you get pywebcopy.exceptions.AccessError
when running any of these code then use the code provided on later sections.
Webpage can easily be saved using an inbuilt funtion called .save_webpage()
which takes several
arguments also.
>>> from pywebcopy import save_webpage
>>> save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)
This use case is slightly more powerful as it can provide every functionallity of the WebPage class.
>>> from pywebcopy import WebPage, config
>>> url = 'http://some-url.com/some-page.html'
# You should always start with setting up the config or use apis
>>> config.setup_config(url, project_folder, project_name, **kwargs)
# Create a instance of the webpage object
>>> wp = WebPage()
# If you want to use `requests` to fetch the page then
>>> wp.get(url)
# Else if you want to use plain html or urllib then use
>>> wp.set_source(object_which_have_a_read_method, encoding=encoding)
>>> wp.url = url # you need to do this if you are using set_source()
# Then you can access several methods like
>>> wp.save_complete()
>>> wp.save_html()
>>> wp.save_assets()
# This Webpage object contains every methods of the Webpage() class and thus
# can be reused for later usages.
I told you earlier that Webpage object is powerful and can be manipulated in any ways.
One feature is that the raw html is now also accepted.
>>> from pywebcopy import WebPage, config
>>> HTML = open('test.html').read()
>>> base_url = 'http://example.com' # used as a base for downloading imgs, css, js files.
>>> project_folder = '/saved_pages/'
>>> config.setup_config(base_url, project_folder)
>>> wp = WebPage()
>>> wp.set_source(HTML)
>>> wp.url = base_url
>>> wp.save_webpage()
Use caution when copying websites as this can overload or damage the servers of the site and rarely could be illegal, so check everything before you proceed.
Using the inbuilt api .save_website()
which takes several arguments.
>>> from pywebcopy import save_website
>>> save_website(project_url='http://localhost:8000', project_folder='e://tests/')
By creating a Crawler() object which provides several other functions as well.
>>> from pywebcopy import Crawler, config
>>> config.setup_config(project_url='http://localhost:5000/',
project_folder='e://tests/', project_name='LocalHost')
>>> crawler = Crawler()
>>> crawler.crawl()