multivalent.std.adaptor.pdf

Class PDFReader

public class PDFReader extends COSSource

Parse Adobe's Portable Document Format (PDF) and construct low-level objects (COS in Adobe terminology: string, number, dictionary) and high-level Java objects (Font, Image). Based on Adobe's PDF 1.5 Reference, available online. This class provides easy access to all parts of a PDF for developers familiar with the PDF Reference. At this time PDF files of up to 4GB in length are supported.

How to use this class

A PDF file is a series of numbered "COS" objects (strings, integers, dictionaries, streams, references to other objects, ...) that can be interpreted as images, streams of page-building commands, annotations, and so on, as described in the PDF Reference. To access these objects, first create an instance on a .pdf file with a constructor. Now all the PDF's objects are available by number via PDFReader from 1 to getObjCnt. If an object refers to another by an indirect reference (IRef), getObject will follow the reference to the actual object. PDF objects are represented with basic Java data types, e.g., PDF dictionaries as Java Map's, with the complete correspondence given by CLASS_* constants. The data for images and page contents are kept in streams, readable uncompressed and decrypted with getInputStream.

At a higher level, you can ask this API for pages and to transform them into Java images, fonts, colorspaces, and so on. Get a particular page's dictionary by number from 1 (not 0) to getPageCnt, inclusive, with PDFReader. Pages' content streams that describe page appearance can be parsed into individual commands with readCommand. High-level versions of images, colorspaces, and fonts are available by passing the PDF object to the appropriate method.

See Also

Other PDF manipulation libraries:

Version: $Revision: 1.84 $ $Date: 2003/08/29 03:26:55 $

Constructor Summary
PDFReader(File file)
Construct a instance corresponding to the .pdf file.
PDFReader(RandomAccess raf)
Read from a special file like a ByteArrayRAF.
Method Summary
voidclose()
Close use and free up resources.
voideatSpace(RandomAccess raf)
Eat whitespace between tokens in COS object.
voideatSpace(InputStreamComposite in)
Eat whitespace between tokens in content stream.
voidfault()
Faults into cache all objects reachable in document (starting from trailer), and sets unreachable objects (that have not been previously read by the caller) to OBJECT_NULL.
ObjectfindNameTree(Dict root, StringBuffer name)
Find name in name tree rooted at root and return its associated value.
ObjectfindNumberTree(Dict root, int number)
Find number in number tree rooted at root and return its associated value.
DictgetCatalog()
Returns Document catalog.
ColorSpacegetColorSpace(Object csref, Dict csres, Dict patres)
ColorSpaces with cacheing.
EncryptgetEncrypt()
Returns document-wide encryption manager.
FilegetFile()
Returns associated PDF java.io.File.
InputStreamgetFileInputStream(Object spec)
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data.
URIgetFileSpecification(Object spec)
Converts simple or full file specification into a platform-independent URI.
FontPDFgetFont(Dict fd, double pointsize, float size, Dict page, AffineTransform Tm, PDF pdf)
Cacheing front end to FontPDF.
BufferedImagegetImage(IRef imgdictref, AffineTransform ctm, Color fillcolor)
Images with cacheing (under key PDFReader)
DictgetInfo()
Returns /Info dictionary from trailer.
InputStreamCompositegetInputStream(Object o, boolean iscontent)
Given indirect reference to stream dictionary or array of such references, returns stream of uncompressed and decrypted data.
InputStreamCompositegetInputStream(Object o)
Same as PDFReader, assuming not a content stream.
intgetLinearized()
If document is linearized, returns integer > 0 that is object number of linearization dictionary.
intgetMajorVersion()
StringgetMetadata(Object o)
Returns metadata associated with object, or return 0-length String if none.
intgetMinorVersion()
intgetObjCnt()
Returns number of objects, numbered from 0.
ObjectgetObject(Object ref)
Returns referenced object, following any indirect references to concrete objects.
ObjectgetObject(int num)
Returns object from xref table offset at point num, from 0 to getObjCnt, taking from cache if available.
intgetObjGen(int objnum)
Returns object's generation number.
longgetObjOff(int objnum)
Returns object's byte offset in file.
bytegetObjType(int objnum)
Returns object's type, which is one of XREF_FREE, XREF_NORMAL, or XREF_OBJSTMC.
DictgetPage(int pagenum)
Given page number, finds corresponding a page dictionary.
intgetPageCnt()
Returns number of pages in document.
intgetPageNum(Dict page)
Reverse of PDFReader.
IRefgetPageRef(int pagenum)
Given page number, finds corresponding a page object.
RandomAccessgetRAF()
Returns associated RandomAccess.
static DoublegetReal(double val)
longgetStartXRef()
File offset of (last) trailer, which is needed for incremental updates.
byte[]getStreamData(Object ref, boolean fraw, boolean fcache)
Returns entire content of input stream.
DictgetTrailer()
Document trailer.
booleanisAuthorized()
booleanisModified()
Modified, perhaps because repaired or annotated.
booleanisRepaired()
CmdreadCommand(InputStreamComposite in)
Parse next command from content stream, or return null if no more.
Cmd[]readCommandArray(Object contentstream)
Parses content stream into array of commands.
DictreadInlineImage(InputStreamComposite in)
Parse inline image from stream into a dictionary with its attributes and the data in a CLASS_DATA under key STREAM_DATA.
intreadInt(RandomAccess raf)
Read positive integer from file.
ObjectreadObject()
Returns next COS object from current file position, which may not be a top-level object starting with m n obj.
ObjectreadObject(InputStreamComposite in)
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts
DictreadXref(boolean all)
Reads cross-reference table and returns its trailer.
voidsetExact(boolean b)
As a PDF is read in, COS objects are normalized.
booleansetPassword(String password)
Set password, returning true if document can be read unencrypted.

Constructor Detail

PDFReader

public PDFReader(File file)
Construct a instance corresponding to the .pdf file. Java strings can be converted to a File with new File(string), URIs and URLs with new File(getPath())

PDFReader

public PDFReader(RandomAccess raf)
Read from a special file like a ByteArrayRAF.

Method Detail

close

public void close()
Close use and free up resources. This class should be closed when no longer used.

eatSpace

public void eatSpace(RandomAccess raf)
Eat whitespace between tokens in COS object.

eatSpace

public void eatSpace(InputStreamComposite in)
Eat whitespace between tokens in content stream.

fault

public void fault()
Faults into cache all objects reachable in document (starting from trailer), and sets unreachable objects (that have not been previously read by the caller) to OBJECT_NULL. Some bad PDFs have cross reference entries to non-existent objects, but these objects aren't referenced by other objects so viewing works fine. This method ensures that a loop over all objects won't enounter an error either.

findNameTree

public Object findNameTree(Dict root, StringBuffer name)
Find name in name tree rooted at root and return its associated value. Used for Dests, AP, JavaScript, Pages, Templates, IDS, URLS. Yes, keys of name tree are (String)'s.

Returns: null name is not found in tree or if root is null

findNumberTree

public Object findNumberTree(Dict root, int number)
Find number in number tree rooted at root and return its associated value. Used for PageLabels, ParentTree in structure tree root.

Returns: null if number is not found in tree.

getCatalog

public Dict getCatalog()
Returns Document catalog.

Required keys: Type (=='Catalog'), Pages (dictionary),

Optional keys: PageLabels (number tree), Names (dictionary), Dests (dictionary), ViewerPreferences (dictionary), PageLayout (name), PageMode (name), Outlines (dictionary), Threads (array), OpenAction (array or dictionary), URI (dictionary), AcroForm (dictionary), StructTreeRoot (dictionary), SpiderInfo (dictionary)

getColorSpace

public ColorSpace getColorSpace(Object csref, Dict csres, Dict patres)
ColorSpaces with cacheing.

getEncrypt

public Encrypt getEncrypt()
Returns document-wide encryption manager. User of class should set the password, if any, through this object. If the password is null/empty, the password is automatically set.

See Also: authUser isAuthorized

getFile

public File getFile()
Returns associated PDF java.io.File.

getFileInputStream

public InputStream getFileInputStream(Object spec)
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data. This may involve fetching the file over the network and writing files to the file system. Files may happen to have their data embedded. Client may want to wrap return value in a java.io.BufferedInputStream. If file is not found, returns null.

getFileSpecification

public URI getFileSpecification(Object spec)
Converts simple or full file specification into a platform-independent URI.

getFont

public FontPDF getFont(Dict fd, double pointsize, float size, Dict page, AffineTransform Tm, PDF pdf)
Cacheing front end to FontPDF. Created font stored font dictionary in SoftReference under key PDFReader.

getImage

public BufferedImage getImage(IRef imgdictref, AffineTransform ctm, Color fillcolor)
Images with cacheing (under key PDFReader).

getInfo

public Dict getInfo()
Returns /Info dictionary from trailer. Normalizes to remove 0-length values.

Optional keys: Title (string), Author (string), Subject (string), Keywords (string), Creator (string), Producer (string), CreationDate (date), ModDate (date), Trapped (name).

Returns: null if no /Info.

getInputStream

public InputStreamComposite getInputStream(Object o, boolean iscontent)
Given indirect reference to stream dictionary or array of such references, returns stream of uncompressed and decrypted data. (Images are not uncompressed here.)

getInputStream

public InputStreamComposite getInputStream(Object o)
Same as PDFReader, assuming not a content stream.

getLinearized

public int getLinearized()
If document is linearized, returns integer > 0 that is object number of linearization dictionary.

getMajorVersion

public int getMajorVersion()

getMetadata

public String getMetadata(Object o)
Returns metadata associated with object, or return 0-length String if none. To obtain the metadata for the document as a whole, pass the document catalog.

getMinorVersion

public int getMinorVersion()

getObjCnt

public int getObjCnt()
Returns number of objects, numbered from 0.

getObject

public Object getObject(Object ref)
Returns referenced object, following any indirect references to concrete objects. In contrast to other methods, ref can be a Java null, so one can easily fully resolve an object that may or may not be present in a dictionary with a getObject(dict.get("key")).

getObject

public Object getObject(int num)
Returns object from xref table offset at point num, from 0 to getObjCnt, taking from cache if available. Object is decrypted if necessary. All objects are cached, with java.lang.ref.SoftReferences so they are automatically garbage collected when memory is tight. If the object is a stream, its contents are not read, but the file position of the data (a java.lang.Long) is stored under a new injected key STREAM_DATA. If the object has been freed ('f' in xref table), OBJECT_DELETED is returned -- the old object is not available. Object number is an int not a long, so it can handle only 2,147,483,647 of the possible 9,999,999,999 objects in a PDF, but even very large PDFs seldom have more than 100,000 objects.

getObjGen

public int getObjGen(int objnum)
Returns object's generation number. Generations are used in incremental writing and encryption. In PDF 1.5 this is the cross reference table's field 1.

getObjOff

public long getObjOff(int objnum)
Returns object's byte offset in file. In PDF 1.5 this is the cross reference table's field 1. N.B. Points to the object header n g obj, not to start of content.

getObjType

public byte getObjType(int objnum)
Returns object's type, which is one of XREF_FREE, XREF_NORMAL, or XREF_OBJSTMC. In PDF 1.5, this is the cross reference table's field 0.

getPage

public Dict getPage(int pagenum)
Given page number, finds corresponding a page dictionary. Populates inheritable attirbutes by climbing parents as necessary. To get page dictionary without inheriting attributess, use getObject(getPageRef(pagenum)). Pages are numbered 1..getPageCnt, inclusive. Reverse of getPageNum.

getPageCnt

public int getPageCnt()
Returns number of pages in document.

getPageNum

public int getPageNum(Dict page)
Reverse of PDFReader.

getPageRef

public IRef getPageRef(int pagenum)
Given page number, finds corresponding a page object. Pages are numbered PDF-style: 1..getPageCnt, inclusive. If object in that position is not a /Type /Page, returns null (not OBJECT_NULL).

getRAF

public RandomAccess getRAF()
Returns associated RandomAccess. Clients should not cache the return value since the RandomAccess can change.

getReal

public static Double getReal(double val)

getStartXRef

public long getStartXRef()
File offset of (last) trailer, which is needed for incremental updates.

getStreamData

public byte[] getStreamData(Object ref, boolean fraw, boolean fcache)
Returns entire content of input stream. For a stream use PDFReader and java.io.InputStream#read() out the data.

Parameters: ref stream dictionary, or indirect ref to stream dictionary. If PDF is encrypted, must be indirect ref (perhaps freshly created for this purpose, with the right generation number). fraw raw data, not passed through filters fcache if true save data under STREAM_DATA key, remove PDF /Length key, and strip out non-image filters from Filter value

Returns: null if dictionary is not a stream

getTrailer

public Dict getTrailer()
Document trailer. Required keys: Size, Root (to catalog), ID. (If no ID exists, one is created.) Optional keys: Encrypt, Info.

isAuthorized

public boolean isAuthorized()

isModified

public boolean isModified()
Modified, perhaps because repaired or annotated.

isRepaired

public boolean isRepaired()

readCommand

public Cmd readCommand(InputStreamComposite in)
Parse next command from content stream, or return null if no more. For inline images (BI..ID..EI), expands abbreviations in dictionary (in ops[0]) and strips non-image filters from data (in ops[1]).

readCommandArray

public Cmd[] readCommandArray(Object contentstream)
Parses content stream into array of commands. If PDF is encrypted, contentstream must be IRef.

See Also: PDFWriter Cmd

readInlineImage

public Dict readInlineImage(InputStreamComposite in)
Parse inline image from stream into a dictionary with its attributes and the data in a CLASS_DATA under key STREAM_DATA. Abbreviated keys (but not values) are expanded (e.g., /F => /Filter, but not the color space value G => DeviceGray), and non-image filters on the data (such as LZW) are removed. On entry input stream should be placed after the BI and following whitespace; on exit input stream is immediate after closing EI.

readInt

public int readInt(RandomAccess raf)
Read positive integer from file.

readObject

public Object readObject()
Returns next COS object from current file position, which may not be a top-level object starting with m n obj. Ordinarily you want PDFReader or getObject instead. Comments, which are rare, are lost; the following object is returned. Keywords that are not boolean or null are returned as java.lang.Strings.

Precondition: file pointer at start of token.
Postcondition: Eat following whitespace to bring file pointer to start of next token.

readObject

public Object readObject(InputStreamComposite in)
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts.

readXref

public Dict readXref(boolean all)
Reads cross-reference table and returns its trailer. If all flag is true, read entire table, chaining from trailer to trailer via /Prev. Precondition: file pointer is at start of xref table, at the start of the xref keyword. Usually the cross-reference table is read automatically at startup.

Returns: PDF trailer dictionary

See Also: PDFReader PDFReader getObjCnt

setExact

public void setExact(boolean b)
As a PDF is read in, COS objects are normalized. Set exact to true to prevent this. This should be set immediately after instantiation and setting the encryption password if any; after this point is cannot be changed, because that would leave some parts updated and others not and there would be conflicts.

setPassword

public boolean setPassword(String password)
Set password, returning true if document can be read unencrypted. Document may be unencryted for several reasons: not encrypted, password is null and so automatically unlocked, password is correct, password correctly set earlier. Once the password is correctly set, it may not be unset.