multivalent.std.adaptor
Class HTML
public
class
HTML
extends ML
Media adaptor for
HTML (.html => document tree).
Also see
HTML Performance Notes.
Doesn't quite do as much as Mozilla's browser component yet, but an order of magnitude smaller.
HTML fixups:
- robust parsing (Generic SGML error recovery from an HTML DTD not good enough.)
- HEAD tags (TITLE, META, LINK, STYLE, ...) moved under HEAD as necessary; likewise, BODY tags under BODY, NOFRAMES under FRAMESET, ...
- cauterizes runaway a tag
- tags balanced and well nested (except for spans)
- sometimes-heroic tag-specific corrections
- attributes
- attribute names normalized to all lowercase, according to XHTML
- illegal values removed (e.g., rowspan=5 in last row), attributes set to defaults removed (e.g., rowspan=1)
- xxx
- tags and attributes updated to HTML 4.0 as possible (e.g., DIR,MENU=>UL, TBODY added to TABLE, CENTER=>DIV align="center")
- image WIDTH,HEIGHT set to image dimensions (perhaps instantiated or fixed)
- future
- optionally, non-HTML 4.0 attributes removed
- define style sheet in CSS (use existing html.css?), make one static data structure copy, clone and hack for individual page style sheets as modified by CSS
Parts of HTML 4.01 not yet supported:
- bidirectional text (DIR attribute, BDO tag)
- COLGROUP and COL styles
- new FORM controls
Use as parser, see instructions in superclass MediaAdaptor.
Tree looks like this:
- structural HTML tags as Node
- spans as Spans, which can cross structure
Version: $Revision: 1.24 $ $Date: 2003/06/01 07:28:50 $
Method Summary |
void | destroy() |
protected void | eatSpace()
Overrides because "In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()"
...
|
static byte | getAlign(String spec) |
static String | getEntity(char unicode) Return entity corresponding to given Unicode character, if any. |
static int | getParseType(String tag) Less efficient than HTML. |
static int | getParseType(int tagid) Parse type is TAGTYPE_EMPTY, TAGTYPE_SPAN, TAGTYPE_NEST, TAGTYPE_NONEST, or TAGTYPE_UNKNOWN if tag is unknown. |
static char | getUnicode(String entity) Return Unicode character corresponding to given HTML entity reference. |
static byte | getVAlign(String spec) |
static void | go(Node startn, Object replace, Object ouri)
TARGET-aware hyperlink. |
Object | parse(INode parent)
Normalizes in direction of XHTML: lowercase tag and attribute names, well nested (except for spans), ...
|
boolean | semanticEventAfter(SemanticEvent se, String msg)
Form processing.
|
boolean | semanticEventBefore(SemanticEvent se, String msg) Adds LINKs to Go menu and document popup. |
public static final FileFilter FILTER
public static final String MSG_FORM_POPULATE
Set values of HTML
FORM
.
"populateForm": arg= java.util.Map attributes, in= java.util.Map name-value pairs.
public static final String MSG_FORM_PROCESS
Give chance for client-side processing by another behavior before sending to server.
"processForm": arg= java.util.Map attributes, in= INode root of tree, out=unused.
public static final String MSG_FORM_RESET
Reset settings of HTML
FORM
.
"resetForm": arg= Node top-of-form,
public static final String MSG_FORM_SUBMIT
Submit HTML
FORM
to server.
"submitForm": arg= Node top-of-form,
public static final int TAGTYPE_EMPTY
public static final int TAGTYPE_NEST
public static final int TAGTYPE_NONEST
public static final int TAGTYPE_SPAN
public static final int TAGTYPE_UNKNOWN
public int[] TagUse
Number of times open-tag of given id is used in document.
public HTML()
public void destroy()
protected void eatSpace()
Overrides because "In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()"
...
"All line breaks constitute white space. "
public static byte getAlign(String spec)
public static String getEntity(char unicode)
Return entity corresponding to given Unicode character, if any. If no such entity, return null.
public static int getParseType(String tag)
Less efficient than
HTML.
public static int getParseType(int tagid)
Parse type is TAGTYPE_EMPTY, TAGTYPE_SPAN, TAGTYPE_NEST, TAGTYPE_NONEST, or TAGTYPE_UNKNOWN if tag is unknown.
public static char getUnicode(String entity)
Return Unicode character corresponding to given HTML entity reference. If no such character, return '\0'.
public static byte getVAlign(String spec)
public static void go(
Node startn, Object replace, Object ouri)
TARGET-aware hyperlink. Shared by A HREF and IMG MAP
public Object parse(
INode parent)
Normalizes in direction of XHTML: lowercase tag and attribute names, well nested (except for spans), ...
Within generated tree, all tags (GIs) are interned.
This fact is exploited while parsing, but not afterward (when gleaning FORM, say) as other behaviors could have hacked tree and not been careful to intern (or always use literal Strings).
Form processing.
Later, submittedForm so can intercept for client-side forms processing.
Adds LINKs to Go menu and document popup.