Multivalent API

phelps.net
Class RobustHyperlink

java.lang.Object
  extended by phelps.net.RobustHyperlink

public class RobustHyperlink
extends java.lang.Object

Augment URL with information that can be used to find content of URL in case link breaks. See the Robust Home Page.

Strategy: Inverse word frequency: find top n most common words in document that are uncommon in web overall.

  1. count words in page (either from tree or, while testing, HTML text)
  2. look up relative frequency counts from web search engine, cacheing new ones to disk
  3. pick locally frequent-globally infrequent

Version:
$Revision: 1.9 $ $Date: 2003/07/04 08:04:35 $
See Also:
tool.LexSig, tool.html.Robust

Field Summary
static int ALGORITHM_RANDOM
          Picks words randomly.
static int ALGORITHM_RANDOM100K
          Picks words randomly from those that appear in fewer than 100,000 web pages.
static int ALGORITHM_RAREST
          Rarest picks the words rarest in the web.
static int ALGORITHM_TFIDF
          Term frequency-inverse document frequency picks the most frequent words in the document that are the rarest in the web.
static int ALGORITHM_TFIDF2
          Refines tfidf by capping page frequency at 3 to bias toward rarity.
static boolean DEBUG
           
static boolean FoldCase
          Ignore case in collecting words?
static int MinWordLength
           
static java.lang.String PARAMETER
          Canonical definition of parameter used for lexical signatures.
static int SignatureLength
          Signature length (in words).
static boolean Verbose
           
static java.lang.String VERSION
           
 
Method Summary
static java.lang.String addSignature(java.net.URL url, java.lang.String words)
          Add signature words to url.
static java.lang.String computeSignature(java.util.List<java.lang.String> words)
          Compute signature from list of words.
static java.lang.String computeSignature(Node root)
          Compute signature from document tree.
static java.lang.String computeSignature(java.lang.String txt)
          Compute signature from parsed txt.
static int getFreq(java.lang.String word)
          Determine web page frequency of word.
static java.lang.String getSignature(java.lang.String surl)
          Return signature as found in string.
static java.lang.String getSignatureWords(java.lang.String surl)
          Return signature as plain words: no "?
static void setAlgorithm(int alg)
          Set algorithm to use (N.B.: static).
static void setEngine(java.lang.String prefix, java.lang.String hook)
          Sets the search engine and key text fragment that signals the start of the web frequency information.
static void setWordCache(java.io.File cache)
          Client can set the file to use as the user's supplemental word frequency cache.
static java.lang.String stripSignature(java.lang.String surl)
          Given a URL in String form, return URL with signature, if any, stripped off.
static void writeCache()
          Writes user word frequency cache.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEBUG

public static boolean DEBUG

VERSION

public static final java.lang.String VERSION
See Also:
Constant Field Values

PARAMETER

public static final java.lang.String PARAMETER
Canonical definition of parameter used for lexical signatures.

See Also:
Constant Field Values

ALGORITHM_TFIDF

public static final int ALGORITHM_TFIDF
Term frequency-inverse document frequency picks the most frequent words in the document that are the rarest in the web.

See Also:
Constant Field Values

ALGORITHM_TFIDF2

public static final int ALGORITHM_TFIDF2
Refines tfidf by capping page frequency at 3 to bias toward rarity. Default.

See Also:
Constant Field Values

ALGORITHM_RAREST

public static final int ALGORITHM_RAREST
Rarest picks the words rarest in the web.

See Also:
Constant Field Values

ALGORITHM_RANDOM

public static final int ALGORITHM_RANDOM
Picks words randomly.

See Also:
Constant Field Values

ALGORITHM_RANDOM100K

public static final int ALGORITHM_RANDOM100K
Picks words randomly from those that appear in fewer than 100,000 web pages.

See Also:
Constant Field Values

Verbose

public static boolean Verbose

FoldCase

public static boolean FoldCase
Ignore case in collecting words?


MinWordLength

public static int MinWordLength

SignatureLength

public static int SignatureLength
Signature length (in words).

Method Detail

setWordCache

public static void setWordCache(java.io.File cache)
Client can set the file to use as the user's supplemental word frequency cache. The Multivalent client places this in user's private cache directory, as public placement can reveal personal information. Defaults to a file named "wordfreq.txt" in the Java temp directory.


setEngine

public static void setEngine(java.lang.String prefix,
                             java.lang.String hook)
Sets the search engine and key text fragment that signals the start of the web frequency information. Web word freqencies are obtained by screen scraping the results of a search engine. Used to switch to a search engine that is different than the default or to update the URL and text hook that's been changed.

Parameters:
prefix - URL of search submissions with the query term at the end and left blank
hook - contant words in the HTML page results near the word frequency number

setAlgorithm

public static void setAlgorithm(int alg)
Set algorithm to use (N.B.: static).


addSignature

public static java.lang.String addSignature(java.net.URL url,
                                            java.lang.String words)
Add signature words to url.


stripSignature

public static java.lang.String stripSignature(java.lang.String surl)
Given a URL in String form, return URL with signature, if any, stripped off.


getSignature

public static java.lang.String getSignature(java.lang.String surl)
Return signature as found in string. Signature is introduced by "lexical-signature=".


getSignatureWords

public static java.lang.String getSignatureWords(java.lang.String surl)
Return signature as plain words: no "?lexical-signature=", no meta characters.


writeCache

public static void writeCache()
Writes user word frequency cache. Should call before quit application. May want to periodically refresh cache, for words that become popular and therefore no longer good distinguishers.


getFreq

public static int getFreq(java.lang.String word)
Determine web page frequency of word. If not in cache, looks up in web search engine.


computeSignature

public static java.lang.String computeSignature(Node root)
Compute signature from document tree.


computeSignature

public static java.lang.String computeSignature(java.lang.String txt)
Compute signature from parsed txt.


computeSignature

public static java.lang.String computeSignature(java.util.List<java.lang.String> words)
Compute signature from list of words.


Multivalent API