Content Association In Wikipedia
  1. All files
  2. Computer Programs

Description

Wikipedia, the free online encyclopedia, contains a wealth of intellectually and monetarily free content (in common terminology, “free as in speech and free as in beer”). The sheer number of users editing the corpus means that the majority of the articles are well-written and largely factual. However, the relationship between related articles, usually inferred by the See Also links at the bottom of each article, are generally incomplete compared to the relationships implied by words linked amidst the text of each article. We propose a PHP framework to spider Wikipedia, collecting both full-text word lists and lists containing only the words from the text of internal links. We propose comparing the relative performance of a system that attempts to find similarity metrics between articles based on the full text of each article and one based on only on the linked words in each article. This implementation uses the TF-IDF algorithm to normalize word frequency and the cosine similarity metric to rank article similarity. Please see http://www.cemetech.net/projects/item.php?id=30 for a full description and documentation, including information on installing and using this program.

Screenshots

Screenshot #4515

Archive Contents

Name Size
func_wiki.php 8.1 KB
wiki_nlp.php 6.6 KB
func_misc.php 486 bytes
Download file
File Size
5.3 KB

Metadata

Author
KermMartian
Uploaded
14 years, 5 months ago

Statistics

Rating
No ratings.
Downloads
646
Views
1883

Reviews

Nobody has reviewed this file yet.

Versions

  1. Content Association In Wikipedia (published 14 years, 5 months ago; 2010-05-05 18:17 UTC)

Advertisement