Scrape-O-matic: UPDATE! Proxy Scraper Script V1.1(Python) 9/1/2015

Tuesday, 1 September 2015

UPDATE! Proxy Scraper Script V1.1(Python) 9/1/2015

Updated 9/6/2015!

Hi guys,

This is the a update to my proxy scraper, it should now perform better and gather more proxies.
The script now gathers +5400 proxies. The dependency on BeautifulSoup4 has been removed and replaced with the re, a native Python library. The script now runs completely on native Python without any third party libraries.

Update:

-Gathers ~~+3200~~ +5400 proxies
-Added Multithreading Support
-Removed BeautifulSoup4
-Added http://www.samair.ru
-Added http://www.proxylisty.com

-9/6/2015
-Added http://nntime.com
-Added http://www.aliveproxy.com

Download:

-proxy-scraper-v.1.1
-proxy-scraper-v.1.2

-VirusTotal 0/55

Requirements:

-Python-2.7

13 comments:

Anonymous27 October 2015 at 10:17
Outdated samair function, ports are not given because are img's.
You is taking your script not seriously, but at least it's threaded :)

def samair():
print "Grabbing: http://www.samair.ru/"
primary_url = "http://www.samair.ru/proxy/proxy-00.htm"
urls = []

for i in range(1, 31):
if i < 10:
urls.append(primary_url.replace("00", "0" + str(i)))
else:
urls.append(primary_url.replace("00", str(i)))

for url in urls:
try:
bug("grabbing " + "'" + url + "'")
opener = urllib2.build_opener()
opener.addheaders = [('Host', 'www.proxylisty.com'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
('Upgrade-Insecure-Requests', '1'),
('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'),
('Referer', 'https://www.google.com'),
('Accept-Encoding','gzip, deflate, sdch'),
('Accept-Language','en-US,en;q=0.8')]

response = opener.open(url, timeout=10)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
html = decompessedFile.read()

newurl = re.findall(r'/proxy/ip-port/\d{1,10}.html', html)[0]
#~ print newurl

response = opener.open("http://www.samair.ru/" + newurl, timeout=10)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
html = decompessedFile.read()

templs = re.findall(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}:\d{1,5}', html)

for i in templs:
workerQueue.put(i)
bug("proxylisty() " + i)

except Exception, e:
print str(e)
if e.message == " ":
bug(e.message)
bug("Failed to grab " + "'" + url + "'")
else:
bug("Failed to grab " + "'" + url + "'")
ReplyDelete
Replies
Anonymous27 October 2015 at 10:22
Starting Proxy Scraper...

Grabbing: http://www.us-proxy.org/
Grabbing: http://free-proxy-list.net/

Please wait...
Could not scrape any proxies!

Those 2 don't work too and you didn't make any notification.
This let you run just 1 site without being aware of the non-working while thinking you're scraping 7 sites or more.

Update it please.
ReplyDelete
Replies
pchaos11 January 2016 at 01:52
can't download from http://www.mediafire.com/download/un6agfgzq3fmsqi/proxy-scraper-v1.2%28scrapeomatic.blogspot.com%29.zip
ReplyDelete
Replies
Anonymous26 March 2016 at 04:48
Outdated samair function is Ok

http://proxy-list.org
http://www.us-proxy.org
http://free-proxy-list.net
http://www.proxylisty.com

Script does not works for these 4 sites

could you Update it please ?

TY for your work !
ReplyDelete
Replies
Unknown19 July 2016 at 20:50
This comment has been removed by the author.
ReplyDelete
Replies
Unknown19 July 2016 at 23:59
I hacked a solution together, it can be found here: https://ghostbin.com/paste/b2vur

Everything is working
ReplyDelete
Replies
Proxicity.io20 July 2016 at 12:57
Hey guys! Take a look at Proxicity.io's Free Rotating Proxy APIs and Free Rotating User Agent API (https://www.proxicity.io). It fixes having to scrape all of these sites, and verifying that they are up. Currently at the time of posting this, there are over 4300 verified proxies from over 119 countries all available via our RESTful APIs.

You can easily use these in your Python scripts using the endpoint:

https://www.proxicity.io/api/v1/YOUR-FREE-APIKEY/proxy

The result looks something like this:

{
"cookiesSupport": true,
"country": "US",
"curl": "http://107.151.136.205:80",
"getSupport": true,
"httpsSupport": false,
"ip": "107.151.136.205",
"ipPort": "107.151.136.205:80",
"isAnonymous": true,
"lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
"port": "80",
"postSupport": true,
"protocol": "http",
"refererSupport": true,
"userAgentSupport": true
}
ReplyDelete
Replies
Okuse Marvedobankz10 August 2020 at 08:37
Blogging has been what has been filling me with information and updates on stuffs like tech and reviews. I have also see that, some blog sites out there only care for making more money than the update they give to users but you are not like the other blog owners. Is not that, it is bad to make more money but a blog owner should also care for users in other for them to get the reviews or update they need. Thanks for sharing this update, I hope you will accept my comment and it will be nice to share ideas. I have a link here Toxicwap.
ReplyDelete
Replies