Hi guys,
This is the a update to my proxy scraper, it should now perform better and gather more proxies.
The script now gathers +5400 proxies. The dependency on BeautifulSoup4 has been removed and replaced with the re, a native Python library. The script now runs completely on native Python without any third party libraries.
Proxy Sites:
-http://www.proxy-list.org-http://www.us-proxy.org
-http://www.free-proxy-list.net
-http://www.cool-proxy.net
-http://www.samair.ru
-http://www.proxylisty.com
-http://www.nntime.com
-http://www.aliveproxy.com
Update:
-Gathers-Added Multithreading Support
-Removed BeautifulSoup4
-Added http://www.samair.ru
-Added http://www.proxylisty.com
-9/6/2015
-Added http://nntime.com
-Added http://www.aliveproxy.com
Download:
-proxy-scraper-v.1.1-proxy-scraper-v.1.2
-VirusTotal 0/55
Outdated samair function, ports are not given because are img's.
ReplyDeleteYou is taking your script not seriously, but at least it's threaded :)
def samair():
print "Grabbing: http://www.samair.ru/"
primary_url = "http://www.samair.ru/proxy/proxy-00.htm"
urls = []
for i in range(1, 31):
if i < 10:
urls.append(primary_url.replace("00", "0" + str(i)))
else:
urls.append(primary_url.replace("00", str(i)))
for url in urls:
try:
bug("grabbing " + "'" + url + "'")
opener = urllib2.build_opener()
opener.addheaders = [('Host', 'www.proxylisty.com'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
('Upgrade-Insecure-Requests', '1'),
('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'),
('Referer', 'https://www.google.com'),
('Accept-Encoding','gzip, deflate, sdch'),
('Accept-Language','en-US,en;q=0.8')]
response = opener.open(url, timeout=10)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
html = decompessedFile.read()
newurl = re.findall(r'/proxy/ip-port/\d{1,10}.html', html)[0]
#~ print newurl
response = opener.open("http://www.samair.ru/" + newurl, timeout=10)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompessedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
html = decompessedFile.read()
templs = re.findall(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}:\d{1,5}', html)
for i in templs:
workerQueue.put(i)
bug("proxylisty() " + i)
except Exception, e:
print str(e)
if e.message == " ":
bug(e.message)
bug("Failed to grab " + "'" + url + "'")
else:
bug("Failed to grab " + "'" + url + "'")
Starting Proxy Scraper...
ReplyDeleteGrabbing: http://www.us-proxy.org/
Grabbing: http://free-proxy-list.net/
Please wait...
Could not scrape any proxies!
Those 2 don't work too and you didn't make any notification.
This let you run just 1 site without being aware of the non-working while thinking you're scraping 7 sites or more.
Update it please.
can't download from http://www.mediafire.com/download/un6agfgzq3fmsqi/proxy-scraper-v1.2%28scrapeomatic.blogspot.com%29.zip
ReplyDeletemaybe blocked by adblock
DeleteThis comment has been removed by the author.
DeleteOutdated samair function is Ok
ReplyDeletehttp://proxy-list.org
http://www.us-proxy.org
http://free-proxy-list.net
http://www.proxylisty.com
Script does not works for these 4 sites
could you Update it please ?
TY for your work !
This comment has been removed by the author.
DeleteHey,
DeleteI wonder if you have your updated code uploaded somewhere?
I could not find anything in your website.
Thanks
This comment has been removed by the author.
DeleteThis comment has been removed by the author.
ReplyDeleteI hacked a solution together, it can be found here: https://ghostbin.com/paste/b2vur
ReplyDeleteEverything is working
Hey guys! Take a look at Proxicity.io's Free Rotating Proxy APIs and Free Rotating User Agent API (https://www.proxicity.io). It fixes having to scrape all of these sites, and verifying that they are up. Currently at the time of posting this, there are over 4300 verified proxies from over 119 countries all available via our RESTful APIs.
ReplyDeleteYou can easily use these in your Python scripts using the endpoint:
https://www.proxicity.io/api/v1/YOUR-FREE-APIKEY/proxy
The result looks something like this:
{
"cookiesSupport": true,
"country": "US",
"curl": "http://107.151.136.205:80",
"getSupport": true,
"httpsSupport": false,
"ip": "107.151.136.205",
"ipPort": "107.151.136.205:80",
"isAnonymous": true,
"lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
"port": "80",
"postSupport": true,
"protocol": "http",
"refererSupport": true,
"userAgentSupport": true
}
Blogging has been what has been filling me with information and updates on stuffs like tech and reviews. I have also see that, some blog sites out there only care for making more money than the update they give to users but you are not like the other blog owners. Is not that, it is bad to make more money but a blog owner should also care for users in other for them to get the reviews or update they need. Thanks for sharing this update, I hope you will accept my comment and it will be nice to share ideas. I have a link here Toxicwap.
ReplyDelete