Python selenium scraping setup

Sometimes when you scrape different websites, you eventually come to a point where you need to run a full scale browser, for example you need those JavaScript animation, to complete login, or to counter hidden data.

First to smart simulating your browser you need to install webdriver for a browser of you choice. It can either be Firefox or Chrome. In this tutorial we will show an example with Firefox.

wget https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz

tar xvfz geckodriver-v0.19.1-linux64.tar.gz

mv geckodriver ~/.local/bin

Next step will be installing you python dependencies, you can use our proxy list, to easily rotate proxies for your selenium agent https://blog.proxypage.io/how-to-rotate-proxies-for-web-crawler-with-python/

pip install selenium

Next we can write beginning of our script

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
opts = Options()
opts.set_headless()
assert opts.headless  # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://duckduckgo.com')

This snippet will run your browser in a headless mode, thus you can use it on your servers without a graphics card

If you want to use different browsers you can easily switch between Firefox, chrome, opera, edge and safari.

firefoxdriver = webdriver.Firefox(executable_path=”Path to Firefox driver”)
chromedriver = webdriver.Chrome(executable_path=”Path to Chrome driver”)
iedriver = webdriver.IE(executable_path=”Path To IEDriverServer.exe”)
edgedriver = webdriver.Edge(executable_path=”Path To MicrosoftWebDriver.exe”)
operadriver = webdriver.Opera(executable_path=”Path To operadriver”)


Through options you can easily add arguments that you want for your browser

–headless

Opens browser in a headless mode

–start-maximized

This will start browser maximised to your screen size

–incognito

Opent browser in an incognito or private mode

–disable-notifications

Disables notification, this argument will only work for a chrome browser

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--incognito")
driver = webdriver.Chrome(chrome_options=options, executable_path="Path to driver")

Since you figured out how to set up your selenium driver and to launch it, our next step will be integrating it with our simple proxy api. By using our proxy list you can scrape with your selenium while connecting through our proxies, and switching them wen you need to.

import requests

def get_us_ssl():
    api_endpoint = 'https://api.proxypage.io/v1/list?='
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
        'api_key': 'YOUR_API_KEY'
    }
    payload = 'type=HTTPS&limit=1&latency=500&ssl=True&country=US'
    response = requests.get(url, headers = headers, data = payload)
    proxy_string = response.json()[0]['ip']+':'+str(response.json()[0]['port'])
    return proxy_string

PROXY = get_us_ssl

webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
    "httpProxy":PROXY,
    "ftpProxy":PROXY,
    "sslProxy":PROXY,
    "noProxy":None,
    "proxyType":"MANUAL",
    "class":"org.openqa.selenium.Proxy",
    "autodetect":False
}


driver = webdriver.Remote("http://localhost:4444/wd/hub", webdriver.DesiredCapabilities.FIREFOX)

In this example we use remote selenium instance with one of the elite proxies from a proxy list. We hope to make our API simple enough for you to use to make sure that proxy list wil give you the most stable proxies with lowest latencies. We hope that we can make your scraping experience even simpler.