Tuesday 1 September 2015

UPDATE! Proxy Scraper Script V1.1(Python) 9/1/2015

Updated 9/6/2015!

Hi guys,

This is the a update to my proxy scraper, it should now perform better and gather more proxies.
The script now gathers +5400 proxies. The dependency on BeautifulSoup4 has been removed and replaced with the re, a native Python library. The script now runs completely on native Python without any third party libraries.

Proxy Sites:

-http://www.proxy-list.org
-http://www.us-proxy.org
-http://www.free-proxy-list.net
-http://www.cool-proxy.net
-http://www.samair.ru
-http://www.proxylisty.com
-http://www.nntime.com
-http://www.aliveproxy.com

Update:

-Gathers +3200 +5400 proxies
-Added Multithreading Support
-Removed BeautifulSoup4
-Added http://www.samair.ru
-Added http://www.proxylisty.com

-9/6/2015
-Added http://nntime.com
-Added http://www.aliveproxy.com

Download:

-proxy-scraper-v.1.1
-proxy-scraper-v.1.2

-VirusTotal 0/55

Requirements:

-Python-2.7

13 comments:

  1. Outdated samair function, ports are not given because are img's.
    You is taking your script not seriously, but at least it's threaded :)

    def samair():
    print "Grabbing: http://www.samair.ru/"
    primary_url = "http://www.samair.ru/proxy/proxy-00.htm"
    urls = []

    for i in range(1, 31):
    if i < 10:
    urls.append(primary_url.replace("00", "0" + str(i)))
    else:
    urls.append(primary_url.replace("00", str(i)))

    for url in urls:
    try:
    bug("grabbing " + "'" + url + "'")
    opener = urllib2.build_opener()
    opener.addheaders = [('Host', 'www.proxylisty.com'),
    ('Connection', 'keep-alive'),
    ('Cache-Control', 'max-age=0'),
    ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
    ('Upgrade-Insecure-Requests', '1'),
    ('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'),
    ('Referer', 'https://www.google.com'),
    ('Accept-Encoding','gzip, deflate, sdch'),
    ('Accept-Language','en-US,en;q=0.8')]

    response = opener.open(url, timeout=10)
    compressedFile = StringIO.StringIO()
    compressedFile.write(response.read())
    compressedFile.seek(0)
    decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
    html = decompessedFile.read()

    newurl = re.findall(r'/proxy/ip-port/\d{1,10}.html', html)[0]
    #~ print newurl

    response = opener.open("http://www.samair.ru/" + newurl, timeout=10)
    compressedFile = StringIO.StringIO()
    compressedFile.write(response.read())
    compressedFile.seek(0)
    decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
    html = decompessedFile.read()

    templs = re.findall(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}:\d{1,5}', html)

    for i in templs:
    workerQueue.put(i)
    bug("proxylisty() " + i)

    except Exception, e:
    print str(e)
    if e.message == " ":
    bug(e.message)
    bug("Failed to grab " + "'" + url + "'")
    else:
    bug("Failed to grab " + "'" + url + "'")

    ReplyDelete
  2. Starting Proxy Scraper...

    Grabbing: http://www.us-proxy.org/
    Grabbing: http://free-proxy-list.net/

    Please wait...
    Could not scrape any proxies!

    Those 2 don't work too and you didn't make any notification.
    This let you run just 1 site without being aware of the non-working while thinking you're scraping 7 sites or more.

    Update it please.

    ReplyDelete
  3. can't download from http://www.mediafire.com/download/un6agfgzq3fmsqi/proxy-scraper-v1.2%28scrapeomatic.blogspot.com%29.zip

    ReplyDelete
    Replies
    1. maybe blocked by adblock

      Delete
    2. This comment has been removed by the author.

      Delete
  4. Outdated samair function is Ok

    http://proxy-list.org
    http://www.us-proxy.org
    http://free-proxy-list.net
    http://www.proxylisty.com

    Script does not works for these 4 sites

    could you Update it please ?

    TY for your work !

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Hey,
      I wonder if you have your updated code uploaded somewhere?
      I could not find anything in your website.
      Thanks

      Delete
    3. This comment has been removed by the author.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. I hacked a solution together, it can be found here: https://ghostbin.com/paste/b2vur

    Everything is working

    ReplyDelete
  7. Hey guys! Take a look at Proxicity.io's Free Rotating Proxy APIs and Free Rotating User Agent API (https://www.proxicity.io). It fixes having to scrape all of these sites, and verifying that they are up. Currently at the time of posting this, there are over 4300 verified proxies from over 119 countries all available via our RESTful APIs.

    You can easily use these in your Python scripts using the endpoint:

    https://www.proxicity.io/api/v1/YOUR-FREE-APIKEY/proxy

    The result looks something like this:

    {
    "cookiesSupport": true,
    "country": "US",
    "curl": "http://107.151.136.205:80",
    "getSupport": true,
    "httpsSupport": false,
    "ip": "107.151.136.205",
    "ipPort": "107.151.136.205:80",
    "isAnonymous": true,
    "lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
    "port": "80",
    "postSupport": true,
    "protocol": "http",
    "refererSupport": true,
    "userAgentSupport": true
    }

    ReplyDelete
  8. Blogging has been what has been filling me with information and updates on stuffs like tech and reviews. I have also see that, some blog sites out there only care for making more money than the update they give to users but you are not like the other blog owners. Is not that, it is bad to make more money but a blog owner should also care for users in other for them to get the reviews or update they need. Thanks for sharing this update, I hope you will accept my comment and it will be nice to share ideas. I have a link here Toxicwap.

    ReplyDelete