tool

Class ExtractText

public class ExtractText extends Object

Extract Unicode text from any supported document type: PDF, HTML, man pages, DVI, Plucker, .... Documents can be flowed for fixed, single page or multiple page. Use either as command line tool or method ExtractText from another program.

Version: $Revision: 1.11 $ $Date: 2003/08/29 06:07:31 $

See Also: java.text.BreakIterator

Field Summary
static StringUSAGE
static StringVERSION
Constructor Summary
ExtractText()
Method Summary
voiddefaults()
Stringextract(URI uri, String mimeType)
Return java.lang.StringBuffer with text of document at uri.
voidextractFlow(Node top, StringBuffer sb)
Traverse document tree and extract text.
static voidextractFlowFixed(Node top, StringBuffer sb)
Extract text in same flow as document tree but track coordinates, which is apt for PDF.
static voidextractFlowStruct(Node n, StringBuffer sb)
Extract text by following structure in document tree, which is apt for HTML.
static voidextractLayout(Node top, StringBuffer sb)
Preserve layout as much as possible in straight ASCII.
static voidmain(String[] argv)
voidsetLayout(boolean b)
voidsetQuiet(boolean b)
voidsetRange(String range)
voidsetVerbose(boolean b)

Field Detail

USAGE

public static final String USAGE

VERSION

public static final String VERSION

Constructor Detail

ExtractText

public ExtractText()

Method Detail

defaults

public void defaults()

extract

public String extract(URI uri, String mimeType)
Return java.lang.StringBuffer with text of document at uri.

Parameters: mimeType mimeType of document, or null if not known uri URI of document, local file or network

Returns: null if document of unknown type

extractFlow

public void extractFlow(Node top, StringBuffer sb)
Traverse document tree and extract text.

extractFlowFixed

public static void extractFlowFixed(Node top, StringBuffer sb)
Extract text in same flow as document tree but track coordinates, which is apt for PDF.

extractFlowStruct

public static void extractFlowStruct(Node n, StringBuffer sb)
Extract text by following structure in document tree, which is apt for HTML.

extractLayout

public static void extractLayout(Node top, StringBuffer sb)
Preserve layout as much as possible in straight ASCII. Currently available for PDF and DVI.

main

public static void main(String[] argv)

setLayout

public void setLayout(boolean b)

setQuiet

public void setQuiet(boolean b)

setRange

public void setRange(String range)

setVerbose

public void setVerbose(boolean b)