phelps.net

Class RobustHyperlink

public class RobustHyperlink extends Object

Augment URL with information that can be used to find content of URL in case link breaks. See the Robust Home Page.

Strategy: Inverse word frequency: find top n most common words in document that are uncommon in web overall.

  1. count words in page (either from tree or, while testing, HTML text)
  2. look up relative frequency counts from web search engine, cacheing new ones to disk
  3. pick locally frequent-globally infrequent

Version: $Revision: 1.9 $ $Date: 2003/07/04 08:04:35 $

See Also: tool.LexSig tool.html.Robust

Field Summary
static intALGORITHM_RANDOM
Picks words randomly.
static intALGORITHM_RANDOM100K
Picks words randomly from those that appear in fewer than 100,000 web pages.
static intALGORITHM_RAREST
Rarest picks the words rarest in the web.
static intALGORITHM_TFIDF
Term frequency-inverse document frequency picks the most frequent words in the document that are the rarest in the web.
static intALGORITHM_TFIDF2
Refines tfidf by capping page frequency at 3 to bias toward rarity.
static booleanDEBUG
static booleanFoldCase
Ignore case in collecting words?
static intMinWordLength
static StringPARAMETER
Canonical definition of parameter used for lexical signatures.
static intSignatureLength
Signature length (in words).
static booleanVerbose
static StringVERSION
Method Summary
static StringaddSignature(URL url, String words)
Add signature words to url.
static StringcomputeSignature(Node root)
Compute signature from document tree.
static StringcomputeSignature(String txt)
Compute signature from parsed txt.
static StringcomputeSignature(List<String> words)
Compute signature from list of words.
static intgetFreq(String word)
Determine web page frequency of word.
static StringgetSignature(String surl)
Return signature as found in string.
static StringgetSignatureWords(String surl)
Return signature as plain words: no "?
static voidsetAlgorithm(int alg)
Set algorithm to use (N.B.: static).
static voidsetEngine(String prefix, String hook)
Sets the search engine and key text fragment that signals the start of the web frequency information.
static voidsetWordCache(File cache)
Client can set the file to use as the user's supplemental word frequency cache.
static StringstripSignature(String surl)
Given a URL in String form, return URL with signature, if any, stripped off.
static voidwriteCache()
Writes user word frequency cache.

Field Detail

ALGORITHM_RANDOM

public static final int ALGORITHM_RANDOM
Picks words randomly.

ALGORITHM_RANDOM100K

public static final int ALGORITHM_RANDOM100K
Picks words randomly from those that appear in fewer than 100,000 web pages.

ALGORITHM_RAREST

public static final int ALGORITHM_RAREST
Rarest picks the words rarest in the web.

ALGORITHM_TFIDF

public static final int ALGORITHM_TFIDF
Term frequency-inverse document frequency picks the most frequent words in the document that are the rarest in the web.

ALGORITHM_TFIDF2

public static final int ALGORITHM_TFIDF2
Refines tfidf by capping page frequency at 3 to bias toward rarity. Default.

DEBUG

public static boolean DEBUG

FoldCase

public static boolean FoldCase
Ignore case in collecting words?

MinWordLength

public static int MinWordLength

PARAMETER

public static final String PARAMETER
Canonical definition of parameter used for lexical signatures.

SignatureLength

public static int SignatureLength
Signature length (in words).

Verbose

public static boolean Verbose

VERSION

public static final String VERSION

Method Detail

addSignature

public static String addSignature(URL url, String words)
Add signature words to url.

computeSignature

public static String computeSignature(Node root)
Compute signature from document tree.

computeSignature

public static String computeSignature(String txt)
Compute signature from parsed txt.

computeSignature

public static String computeSignature(List<String> words)
Compute signature from list of words.

getFreq

public static int getFreq(String word)
Determine web page frequency of word. If not in cache, looks up in web search engine.

getSignature

public static String getSignature(String surl)
Return signature as found in string. Signature is introduced by "lexical-signature=".

getSignatureWords

public static String getSignatureWords(String surl)
Return signature as plain words: no "?lexical-signature=", no meta characters.

setAlgorithm

public static void setAlgorithm(int alg)
Set algorithm to use (N.B.: static).

setEngine

public static void setEngine(String prefix, String hook)
Sets the search engine and key text fragment that signals the start of the web frequency information. Web word freqencies are obtained by screen scraping the results of a search engine. Used to switch to a search engine that is different than the default or to update the URL and text hook that's been changed.

Parameters: prefix URL of search submissions with the query term at the end and left blank hook contant words in the HTML page results near the word frequency number

setWordCache

public static void setWordCache(File cache)
Client can set the file to use as the user's supplemental word frequency cache. The Multivalent client places this in user's private cache directory, as public placement can reveal personal information. Defaults to a file named "wordfreq.txt" in the Java temp directory.

stripSignature

public static String stripSignature(String surl)
Given a URL in String form, return URL with signature, if any, stripped off.

writeCache

public static void writeCache()
Writes user word frequency cache. Should call before quit application. May want to periodically refresh cache, for words that become popular and therefore no longer good distinguishers.