Currently I am working on a threaded link finder system to crawl the internet looking for links and store them all in a database (that part is mostly done) and then export then to different formats of files (csv (link, referrer, etc.) and return separated is my current goal). Realizing that I was going to be creating a massive database of URLs I was wondering if anyone had any projects where that would be useful to them? I've currently got about 1.6 million links and I am adding 100+ per second right now though that is slowing down a bit.
Relevantly enough, I'm right now working to upgrade and optimize a parallel/distributed web crawler application that I've been working on as a demo app for my research project. I'm curious how yours works; what language and design? Does it obey robots.txt? Does it do any "niceness" to avoid hitting a single domain too fast?
I second Kerm's questions. My first question as I was reading your post was if it had any sort of wait time between each hit. Are they only URLs? Does it keep track of some text on each page?
_player1537 wrote:
I second Kerm's questions. My first question as I was reading your post was if it had any sort of wait time between each hit. Are they only URLs? Does it keep track of some text on each page?



My system does not wait. My system keeps track of URLs and also saves the entire content of the page.

Quote:
Relevantly enough, I'm right now working to upgrade and optimize a parallel/distributed web crawler application that I've been working on as a demo app for my research project. I'm curious how yours works; what language and design? Does it obey robots.txt? Does it do any "niceness" to avoid hitting a single domain too fast?


Mine is in PHP for now for ease of prototyping and changing. It does not follow robots.txt intentionally. It does not have any niceness but I am looking at adding that at some point.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement