Thanks to KermM for convincing me to make a topic for this :)

Over last semester, I started a project with two friends of mine; the purpose of which was to scan the entire web recursively based on the links situated in each webpage that I scanned. Starting off a seed website, such as cemetech.net, the program would proceed to find every link on the webpage, and scan the webpages linked to for their links and so on and so forth, until theoretically, we have successfully scanned every single page on the Internet and established how it connects to every other page on the Internet. A completed setup would appear very similar to a computer's folder hierarchy, with each webpage represented with its own directory or "node" full of links to each other webpage/directory or "node" the webpage linked to.

I first created a proof-of-concept based off of a single computer that started at www.google.com, and continued to log the webpages that it found until I told it to stop. Within a couple minutes of running it, the program had traversed to www.apple.com and had discovered various different things such as the iTunes installer and the eBooks section of the iTunes Store.

However, I knew that for any chance to succeed in fully scanning the web, using only a single computer and its Internet connection would not be sufficient. Thus, the project grew into a crowdsourced program, designed in such a way that tens, hundreds, or thousands of people could scan the web and relay the results back to a central server(s). A connection would appear like this:

Code:
Client -> Server: Client Hello: Query
Server -> Client: Webpage to scan
Client: scans webpage and finds all links
Client -> Server: Sends links
Server -> Client: Server ACK, end connection
Server: Finds all non-duplicate links and creates nodes for said links
Server:Adds links to newly created nodes in the scanned webpage's node.
Server: Adds non-duplicate links to central database of webpages to be scanned.
Server: Repeat!


The server was built with a verification procedure. Essentially, each link had to be scanned twice by two separate clients to make sure the entries matched. If they didn't, both results were thrown out and the webpage was rescanned. The program also had built in robots.txt support so as to avoid websites who didn't want to be scanned.

The hope was to eventually create a visualization of the web, which showed all the connections and intersections between each node in one enormous web.

The framework has been essentially completed on both the server and the client; the final step would be to finish debugging both and get the scanning part of the client up and working.

While I have been very busy with school recently, I hope to finish both programs by the end of June (earlier if help comes along) and to hopefully have the internet scanned by the end of next summer.

Thoughts or offers of help appreciated!!
This sounds rather cool and quite ambitious. Do you have any concept pictures of what the end result would be, if it only did.. Say 50 links?
Does your crawler respect robots.txt at all?
merthsoft wrote:
Does your crawler respect robots.txt at all?

This, and how are you planning on handling redirects?
I'm glad to hear you're exploring this! About eight years ago, I worked on a project I called WordNet, which tried to spider the internet using linked URLs and then associate pages based on a combination of shared keywords and links. I described the very early experiments with the projects in a news article, and I showed off one of the rendered graphs here. Each point represents a domain, the lines indicate linked URLs, the distances represent conceptual distance based on keywords, and the colors are the type of TLD (com, net, gov, edu, org, etc). As Merth asked, what does your spider look like? Does it rate-limit itself and obey robots.txt?

I will answer the other questions later when I have more time but as to whether or not it supports robots.txt, yes :)
Quote:
The program also had built in robots.txt support so as to avoid websites who didn't want to be scanned.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement