Wikipedia, the free online encyclopedia, contains a wealth of intellectually and monetarily free content (in common terminology, “free as in speech and free as in beer”). The sheer number of users editing the corpus means that the majority of the articles are well-written and largely factual. However, the relationship between related articles, usually inferred by the See Also links at the bottom of each article, are generally incomplete compared to the relationships implied by words linked amidst the text of each article. We propose a PHP framework to spider Wikipedia, collecting both full-text word lists and lists containing only the words from the text of internal links. We propose comparing the relative performance of a system that attempts to find similarity metrics between articles based on the full text of each article and one based on only on the linked words in each article. This implementation uses the TF-IDF algorithm to normalize word frequency and the cosine similarity metric to rank article similarity. Please see http://www.cemetech.net/projects/item.php?id=30 for a full description and documentation, including information on installing and using this program.
- File Size
- 5.3 KB
- Short link
- 12 years, 10 months ago
- No ratings.
ReviewsNobody has reviewed this file yet.
- Content Association In Wikipedia (published 12 years, 10 months ago; 2010-05-05 18:17 UTC)