Weblog entry #2 for miguel
Last time I posted was long time ago, and Etch still not released. :-(
Well, this web site holds a lot of sys admins, I'm righting to let you know about a project of mine that now has gone public: http://www.enterpriseblacklist.org.
Quoting the site:
The EBL (Enterprise Blacklist) offers a blacklist of domains, with free distribution. It is fed by collaborators, web robots and web crawlers. it has the objective to be an efficient list of domains that certainly network administrators want the users to remain distant.
We already have more than 1.5 million domains, and started from scratch.
I want to have a lot of information, and right now I'm working on a robot to collect open proxies. Take a look!
Miguel
Comments on this Entry
What, if any, methods of removal do you have? Are they automated or manual?
[ Parent | Reply to this comment ]
You can suggest a domain for removal, and this suggestion goes to a voting queue to.
Daily 10.000 domains are checked if they resolve to any IP. If we receive an NXDOMAIN error, the domain is removed. If the name server times out, the domain is disabled but will be tested again in 10 days.
I will try keep every domain tested at least with a 10 days interval.
This process is automated. Will can see the results of every test on this page: http://www.enterpriseblacklist.org/?q=blog/7
The test robot "blogs" every time with the results of the tests.
On the main page there is link on top named "Log", that goes to this page.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Here's a fun question for you...
If you found a host on the net which was running port scans looking for open proxies - would you list it?
ie. What exactly are the criterion for inclusion in your list? On your site I just see a list of categories, but no real detail. (I apologize if I've missed it.)
[ Parent | Reply to this comment ]
Actually, my idea for open proxies is not hammer any random IP of the net. It is just make a robot that extracts the IPs from sites like http://www.proxy4free.com/page1.html or http://www.samair.ru/proxy/socks.htm. There is a lot of this sites that daily publish open proxies. It would be really more effective do this way, IMO, because let them to the hard work, we just collect them, so you could block them on your proxy or firewall.
I updated the FAQ with more information about the criterion of the blacklist.
What is the criteria for inclusion on the list?
That depends on the category or the source of the domain.
We have 2 robots that extract domains from Sedo and The Domain Name After Market, and mark them on the Parked category. Every domain listed on this sites are for sale, they have no real content, just advertisement. We have another robot that daily extracts domains listed on the RSS feed of The Domain Name After Market.
EBL has a crawler, that has the mission of finding porn sites over the web. This crawler is under development, but it s working quite good. The crawler starts with a seed, provided manually. It extracts all links of the seed and follow them. If the domain of the followed link has some pre-defined words, like gangbang, adult, teens, etc, the domain is listed on the blacklist, if not, the domain is sent to a queue. Then the domain will be accessed by Dansguardian, if blocked, then it is listed, if not it is discarted. Then another cycle starts again, but the seed is always a porn site.
You may ask your self that this crawler can do mistakes. Yes it can. But, there is a curious thing, porn sites link to porn sites, and the chance that a porn site links to a non porn site is really, really small. By the way, most of the non-porn sites found by the crawler linked from porn sites are almost the same every time for all porn sites. The porn crawler never goes beyond one pass from the seed, removing the chances do get out of the context "porn to porn" links.
[ Parent | Reply to this comment ]
"Help the fight against users" ?!
"It has the objective to be an efficient list of domains that certainly network administrators want the users to remain distant."
Machine translation in use, by any chance?
What you seem to be producing is a list of domains, registered for the purpose of advertising, with no original content.
Get a proof reader please...
[ Parent | Reply to this comment ]
Right now, the biggest content is parked domains, and I want to doing this because this domains are hard to filter, and makes my web crawler waste a lot of time parsing them. There is more than 120.000 working porn sites there to. Now that I have a good sample of parked domains (over 1.8 million), my porn crawler is skipping this domains and finding the "useful" junk that I want to list.
Right now, this moment, I'm finishing the open proxy robot, and before you ask if the bot goes on wild around the net knocking IPs, NO, it doesn't.
Quoting my other post:
"Actually, my idea for open proxies is not hammer any random IP on the net. It is just make a robot that extracts the IPs from sites like http://www.proxy4free.com/page1.html or http://www.samair.ru/proxy/socks.htm. There is a lot of this sites that daily publish open proxies. It would be really more effective do this way, IMHO, because let them to the hard work, we just collect them, so you could block them on your proxy or firewall."
[ Parent | Reply to this comment ]