Multivalent API

multivalent.std.adaptor.pdf
Class PDFReader

java.lang.Object
  extended by multivalent.std.adaptor.pdf.COSSource
      extended by multivalent.std.adaptor.pdf.PDFReader

public class PDFReader
extends COSSource

Parse Adobe's Portable Document Format (PDF) and construct low-level objects (COS in Adobe terminology: string, number, dictionary) and high-level Java objects (Font, Image). Also parses Forms Data Format (FDF), which can hold form and annotation data. Based on Adobe's PDF 1.5 Reference, available online. This class provides easy access to all parts of a PDF for developers familiar with the PDF Reference. At this time PDF files of up to 4GB in length are supported.

How to use this class

A PDF file is a series of numbered "COS" objects (strings, integers, dictionaries, streams, references to other objects, ...) that can be interpreted as images, streams of page-building commands, annotations, and so on, as described in the PDF Reference. To access these objects, first create an instance on a .pdf file with a constructor. Now all the PDF's objects are available by number via getObject(int) from 1 to getObjCnt(). If an object refers to another by an indirect reference (IRef), getObject(Object) will follow the reference to the actual object. PDF objects are represented with basic Java data types, e.g., PDF dictionaries as Java Map's, with the complete correspondence given by CLASS_* constants. The data for images and page contents are kept in streams, readable uncompressed and decrypted with getInputStream(Object).

At a higher level, you can ask this API for pages and to transform them into Java images, fonts, colorspaces, and so on. Get a particular page's dictionary by number from 1 (not 0) to getPageCnt(), inclusive, with getPage(int). Pages' content streams that describe page appearance can be parsed into individual commands with readCommand(InputStreamComposite). High-level versions of images, colorspaces, and fonts are available by passing the PDF object to the appropriate method.

See Also

Version:
$Revision: 1.103 $ $Date: 2005/07/26 19:38:02 $

Constructor Summary
PDFReader(java.io.File file)
          Constructs new instance corresponding to the .pdf file.
PDFReader(com.pt.io.InputUni iu)
          Constructs new instance corresponding to the InputUni.
PDFReader(java.lang.Object[] objs, Dict trailer)
          Constructs new instance given the data structures of a PDF (for experts).
 
Method Summary
 void close()
          Close use and free up resources, including file descriptors.
 int countCached()
          For performance tuning, teturns count of different objects that have been cached (but may have been subsequently garbage collected).
 void eatSpace(InputStreamComposite in)
          Eat whitespace between tokens in content stream.
 void eatSpace(com.pt.io.RandomAccess ra)
          Eat whitespace between tokens in COS object.
 void fault()
          Faults into cache all objects reachable in document (starting from trailer), and sets unreachable objects (that have not been previously read by the caller) to COS.OBJECT_NULL.
 java.lang.Object findNameTree(Dict root, java.lang.StringBuffer name)
          Find name in name tree rooted at root and return its associated value.
 java.lang.Object findNumberTree(Dict root, int number)
          Find number in number tree rooted at root and return its associated value.
 Dict getCatalog()
          Returns Document catalog.
 com.pt.awt.font.CMap getCMap(java.lang.Object ref)
          Returns CMap for Encoding or ToUnicode.
 java.awt.color.ColorSpace getColorSpace(java.lang.Object csref, Dict csres, Dict patres)
          ColorSpaces.createColorSpace(Object, PDFReader) with cacheing.
 Encrypt getEncrypt()
          Returns document-wide encryption manager.
 java.io.InputStream getFileInputStream(java.lang.Object spec)
          Given a PDF external file specification, which can be a local file or network URI, returns a stream of data.
 java.net.URI getFileSpecification(java.lang.Object spec)
          Converts simple or full file specification into a platform-independent URI.
 NFont getFont(Dict fd, float size, java.awt.geom.AffineTransform Tm, PDF pdf)
          Fonts#createFont(Dict,float.AffineTransform,Dict,PDF,PDFReader) with cacheing and scaling.
 java.awt.image.BufferedImage getImage(IRef imgdictref, java.awt.geom.AffineTransform ctm, java.awt.Color fillcolor)
          Images.createImage(Dict, InputStream, AffineTransform, Color, PDFReader) with cacheing (under key COS.REALIZED)
 Dict getInfo()
          Returns /Info dictionary from trailer.
 InputStreamComposite getInputStream(java.lang.Object o)
          Same as getInputStream(Object, boolean), assuming not a content stream.
 InputStreamComposite getInputStream(java.lang.Object o, boolean iscontent)
          Given indirect reference to stream dictionary or array of such references, returns stream of uncompressed and decrypted data.
 int getLinearized()
          If document is linearized, returns integer > 0 that is object number of linearization dictionary.
 java.lang.String getMetadata(java.lang.Object o)
          Returns metadata associated with object, or return 0-length String if none.
 int getObjCnt()
          Returns number of objects, numbered from 0.
 java.lang.Object getObject(int num)
          Returns object from xref table offset at point num, from 0 to getObjCnt(), taking from cache if available.
 java.lang.Object getObject(java.lang.Object ref)
          Returns referenced object, following any indirect references to concrete objects.
 int getObjGen(int objnum)
          Returns object's generation number.
 long getObjOff(int objnum)
          Returns object's byte offset in file.
 byte getObjType(int objnum)
          Returns object's type, which is one of COS.XREF_FREE, COS.XREF_NORMAL, or COS.XREF_OBJSTMC.
 Dict getPage(int pagenum)
          Given page number, finds corresponding a page dictionary.
 int getPageCnt()
          Returns number of pages in document.
 int getPageNum(Dict page)
          Reverse of getPage(int).
 IRef getPageRef(int pagenum)
          Given page number, finds corresponding a page object.
 com.pt.io.RandomAccess getRA()
          Returns associated RandomAccess.
static java.lang.Double getReal(double val)
           
 long getStartXRef()
          File offset of (last) trailer, which is needed for incremental updates.
 byte[] getStreamData(java.lang.Object ref, boolean fraw, boolean fcache)
          Returns entire content of input stream.
 Dict getTrailer()
          Document trailer.
 java.net.URI getURI()
          Returns associated URI.
 phelps.util.Version getVersion()
          Returns the major version of PDF used; for example, for PDF 1.4.
 boolean isAuthorized()
           
 boolean isModified()
          Modified, perhaps because repaired or annotated.
 boolean isRepaired()
           
 Cmd readCommand(InputStreamComposite in)
          Parse next command from content stream, or return null if no more.
 Cmd[] readCommandArray(java.lang.Object contentstream)
          Parses content stream into array of commands.
 Dict readInlineImage(InputStreamComposite in)
          Parse inline image from stream into a dictionary with its attributes and the data in a COS.CLASS_DATA under key COS.STREAM_DATA.
 int readInt(com.pt.io.RandomAccess ra)
          Read positive integer from file.
 java.lang.Object readObject()
          Returns next COS object from current file position, which may not be a top-level object starting with m n obj.
 java.lang.Object readObject(InputStreamComposite in)
          Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts
 Dict readXref(boolean all)
          Reads cross-reference table and returns its trailer.
 void reset()
          Clears all cached objects, which may have been mutated.
 void setExact(boolean b)
          As a PDF is read in, COS objects are normalized.
 boolean setPassword(java.lang.String password)
          Set password, returning true if document can be read unencrypted.
 
Methods inherited from class multivalent.std.adaptor.pdf.COSSource
connected, getDecodeParms, getObjInt
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFReader

public PDFReader(java.io.File file)
          throws java.io.IOException,
                 ParseException
Constructs new instance corresponding to the .pdf file.

Throws:
java.io.IOException
ParseException

PDFReader

public PDFReader(com.pt.io.InputUni iu)
          throws java.io.IOException,
                 ParseException
Constructs new instance corresponding to the InputUni.

Throws:
java.io.IOException
ParseException

PDFReader

public PDFReader(java.lang.Object[] objs,
                 Dict trailer)
          throws java.io.IOException
Constructs new instance given the data structures of a PDF (for experts).

Throws:
java.io.IOException
Method Detail

close

public void close()
           throws java.io.IOException
Close use and free up resources, including file descriptors. This class should be closed when no longer used.

Throws:
java.io.IOException

getRA

public com.pt.io.RandomAccess getRA()
Returns associated RandomAccess. Clients should not cache the return value since the RandomAccess can change.


getURI

public java.net.URI getURI()
Returns associated URI.


setExact

public void setExact(boolean b)
As a PDF is read in, COS objects are normalized. Set exact to true to prevent this. This should be set immediately after instantiation and setting the encryption password if any; after this point is cannot be changed, because that would leave some parts updated and others not and there would be conflicts.


isModified

public boolean isModified()
Modified, perhaps because repaired or annotated.


isRepaired

public boolean isRepaired()

readXref

public Dict readXref(boolean all)
              throws java.io.IOException,
                     ParseException
Reads cross-reference table and returns its trailer. If all flag is true, read entire table, chaining from trailer to trailer via /Prev. Precondition: file pointer is at start of xref table, at the start of the xref keyword. Usually the cross-reference table is read automatically at startup.

Returns:
PDF trailer dictionary
Throws:
java.io.IOException
ParseException
See Also:
getObjOff(int), getObjGen(int), getObjCnt()

getVersion

public phelps.util.Version getVersion()
Description copied from class: COSSource
Returns the major version of PDF used; for example, for PDF 1.4.

Specified by:
getVersion in class COSSource

getLinearized

public int getLinearized()
If document is linearized, returns integer > 0 that is object number of linearization dictionary.


getTrailer

public Dict getTrailer()
Document trailer. Required keys: Size, Root (to catalog), ID. (If no ID exists, one is created.) Optional keys: Encrypt, Info.

Specified by:
getTrailer in class COSSource

getStartXRef

public long getStartXRef()
File offset of (last) trailer, which is needed for incremental updates.


getCatalog

public Dict getCatalog()
                throws java.io.IOException
Returns Document catalog.

Required keys: Type (=='Catalog'), Pages (dictionary),

Optional keys: PageLabels (number tree), Names (dictionary), Dests (dictionary), ViewerPreferences (dictionary), PageLayout (name), PageMode (name), Outlines (dictionary), Threads (array), OpenAction (array or dictionary), URI (dictionary), AcroForm (dictionary), StructTreeRoot (dictionary), SpiderInfo (dictionary)

Specified by:
getCatalog in class COSSource
Throws:
java.io.IOException

getInfo

public Dict getInfo()
             throws java.io.IOException
Returns /Info dictionary from trailer. Normalizes to remove 0-length values.

Optional keys: Title (string), Author (string), Subject (string), Keywords (string), Creator (string), Producer (string), CreationDate (date), ModDate (date), Trapped (name).

Returns:
null if no /Info.
Throws:
java.io.IOException

getMetadata

public java.lang.String getMetadata(java.lang.Object o)
                             throws java.io.IOException
Returns metadata associated with object, or return 0-length String if none. To obtain the metadata for the document as a whole, pass the document catalog.

Throws:
java.io.IOException

getEncrypt

public Encrypt getEncrypt()
Returns document-wide encryption manager. User of class should set the password, if any, through this object. If the password is null/empty, the password is automatically set.

See Also:
SecurityHandler.authUser(String), SecurityHandler.isAuthorized()

setPassword

public boolean setPassword(java.lang.String password)
                    throws java.io.IOException
Set password, returning true if document can be read unencrypted. Document may be unencryted for several reasons: not encrypted, password is null and so automatically unlocked, password is correct, password correctly set earlier. If password is the owner then all manipulations of the document are permitted; if the password is the user, then it may be restricted.

Throws:
java.io.IOException

isAuthorized

public boolean isAuthorized()

getPageCnt

public int getPageCnt()
               throws java.io.IOException
Returns number of pages in document.

Throws:
java.io.IOException

getPageRef

public IRef getPageRef(int pagenum)
                throws java.io.IOException
Given page number, finds corresponding a page object. Pages are numbered PDF-style: 1..getPageCnt(), inclusive. If object in that position is not a /Type /Page, returns null (not COS.OBJECT_NULL).

Throws:
java.io.IOException

getPage

public Dict getPage(int pagenum)
             throws java.io.IOException
Given page number, finds corresponding a page dictionary. Populates inheritable attirbutes by climbing parents as necessary. To get page dictionary without inheriting attributess, use getObject(getPageRef(pagenum)). Pages are numbered 1..getPageCnt(), inclusive. Reverse of getPageNum(Dict).

Throws:
java.io.IOException

getPageNum

public int getPageNum(Dict page)
               throws java.io.IOException
Reverse of getPage(int).

Throws:
java.io.IOException

getObjOff

public long getObjOff(int objnum)
Returns object's byte offset in file. In PDF 1.5 this is the cross reference table's field 1. N.B. Points to the object header n g obj, not to start of content.


getObjGen

public int getObjGen(int objnum)
Returns object's generation number. Generations are used in incremental writing and encryption. In PDF 1.5 this is the cross reference table's field 1.


getObjType

public byte getObjType(int objnum)
Returns object's type, which is one of COS.XREF_FREE, COS.XREF_NORMAL, or COS.XREF_OBJSTMC. In PDF 1.5, this is the cross reference table's field 0.


getObjCnt

public int getObjCnt()
Returns number of objects, numbered from 0.

Specified by:
getObjCnt in class COSSource

getInputStream

public InputStreamComposite getInputStream(java.lang.Object o,
                                           boolean iscontent)
                                    throws java.io.IOException
Given indirect reference to stream dictionary or array of such references, returns stream of uncompressed and decrypted data. (Images are not uncompressed here.)

Throws:
java.io.IOException

getInputStream

public InputStreamComposite getInputStream(java.lang.Object o)
                                    throws java.io.IOException
Same as getInputStream(Object, boolean), assuming not a content stream.

Throws:
java.io.IOException

getStreamData

public byte[] getStreamData(java.lang.Object ref,
                            boolean fraw,
                            boolean fcache)
                     throws java.io.IOException
Returns entire content of input stream. For a stream use getInputStream(Object, boolean) and InputStream.read() out the data.

Parameters:
ref - stream dictionary, or indirect ref to stream dictionary. If PDF is encrypted, must be indirect ref (perhaps freshly created for this purpose, with the right generation number).
fraw - raw data, not passed through filters
fcache - if true save data under COS.STREAM_DATA key, remove PDF /Length key, and strip out non-image filters from Filter value
Returns:
null if dictionary is not a stream
Throws:
java.io.IOException

eatSpace

public void eatSpace(com.pt.io.RandomAccess ra)
              throws java.io.IOException
Eat whitespace between tokens in COS object.

Throws:
java.io.IOException

readInt

public int readInt(com.pt.io.RandomAccess ra)
            throws java.io.IOException
Read positive integer from file.

Throws:
java.io.IOException

getReal

public static java.lang.Double getReal(double val)

readObject

public java.lang.Object readObject()
                            throws java.io.IOException
Returns next COS object from current file position, which may not be a top-level object starting with m n obj. Ordinarily you want getObject(int) or getObject(Object) instead. Comments, which are rare, are lost; the following object is returned. Keywords that are not boolean or null are returned as Strings.

Precondition: file pointer at start of token.
Postcondition: Eat following whitespace to bring file pointer to start of next token.

Throws:
java.io.IOException

getObject

public java.lang.Object getObject(java.lang.Object ref)
                           throws java.io.IOException
Returns referenced object, following any indirect references to concrete objects. In contrast to other methods, ref can be a Java null, so one can easily fully resolve an object that may or may not be present in a dictionary with a getObject(dict.get("key")).

Specified by:
getObject in class COSSource
Throws:
java.io.IOException

getObject

public java.lang.Object getObject(int num)
                           throws java.io.IOException
Returns object from xref table offset at point num, from 0 to getObjCnt(), taking from cache if available. Object is decrypted if necessary. All objects are cached, with SoftReferences so they are automatically garbage collected when memory is tight. If the object is a stream, its contents are not read, but the file position of the data (a Long) is stored under a new injected key COS.STREAM_DATA. If the object has been freed ('f' in xref table), COS.OBJECT_DELETED is returned -- the old object is not available. Object number is an int not a long, so it can handle only 2,147,483,647 of the possible 9,999,999,999 objects in a PDF, but even very large PDFs seldom have more than 100,000 objects.

Throws:
java.io.IOException

fault

public void fault()
           throws java.io.IOException
Faults into cache all objects reachable in document (starting from trailer), and sets unreachable objects (that have not been previously read by the caller) to COS.OBJECT_NULL. Some bad PDFs have cross reference entries to non-existent objects, but these objects aren't referenced by other objects so viewing works fine. This method ensures that a loop over all objects won't encounter an error either.

Throws:
java.io.IOException

countCached

public int countCached()
For performance tuning, teturns count of different objects that have been cached (but may have been subsequently garbage collected).


reset

public void reset()
Clears all cached objects, which may have been mutated.


readObject

public java.lang.Object readObject(InputStreamComposite in)
                            throws java.io.IOException
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts.

Throws:
java.io.IOException

eatSpace

public void eatSpace(InputStreamComposite in)
              throws java.io.IOException
Eat whitespace between tokens in content stream.

Throws:
java.io.IOException

readCommandArray

public Cmd[] readCommandArray(java.lang.Object contentstream)
                       throws java.io.IOException
Parses content stream into array of commands. If PDF is encrypted, contentstream must be IRef.

Throws:
java.io.IOException
See Also:
PDFWriter.writeCommandArray(Cmd[], boolean), Cmd

readCommand

public Cmd readCommand(InputStreamComposite in)
                throws java.io.IOException
Parse next command from content stream, or return null if no more. For inline images (BI..ID..EI), expands abbreviations in dictionary (in ops[0]) and strips non-image filters from data (in ops[1]).

Throws:
java.io.IOException

readInlineImage

public Dict readInlineImage(InputStreamComposite in)
                     throws java.io.IOException
Parse inline image from stream into a dictionary with its attributes and the data in a COS.CLASS_DATA under key COS.STREAM_DATA. Abbreviated keys (but not values) are expanded (e.g., /F => /Filter, but not the color space value G => DeviceGray), and non-image filters on the data (such as LZW) are removed. On entry input stream should be placed after the BI and following whitespace; on exit input stream is immediate after closing EI.

Throws:
java.io.IOException

getFileSpecification

public java.net.URI getFileSpecification(java.lang.Object spec)
                                  throws java.io.IOException
Converts simple or full file specification into a platform-independent URI.

Throws:
java.io.IOException

getFileInputStream

public java.io.InputStream getFileInputStream(java.lang.Object spec)
                                       throws java.io.IOException
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data. This may involve fetching the file over the network and writing files to the file system. Files may happen to have their data embedded. Client may want to wrap return value in a BufferedInputStream. If file is not found, returns null.

Throws:
java.io.IOException

getColorSpace

public java.awt.color.ColorSpace getColorSpace(java.lang.Object csref,
                                               Dict csres,
                                               Dict patres)
                                        throws java.io.IOException
ColorSpaces.createColorSpace(Object, PDFReader) with cacheing.

Throws:
java.io.IOException

getImage

public java.awt.image.BufferedImage getImage(IRef imgdictref,
                                             java.awt.geom.AffineTransform ctm,
                                             java.awt.Color fillcolor)
                                      throws java.io.IOException
Images.createImage(Dict, InputStream, AffineTransform, Color, PDFReader) with cacheing (under key COS.REALIZED).

Throws:
java.io.IOException

getFont

public NFont getFont(Dict fd,
                     float size,
                     java.awt.geom.AffineTransform Tm,
                     PDF pdf)
              throws java.io.IOException
Fonts#createFont(Dict,float.AffineTransform,Dict,PDF,PDFReader) with cacheing and scaling. Created font stored font dictionary in SoftReference under key #REALIZED.

Throws:
java.io.IOException

getCMap

public com.pt.awt.font.CMap getCMap(java.lang.Object ref)
                             throws java.io.IOException
Returns CMap for Encoding or ToUnicode.

Throws:
java.io.IOException

findNameTree

public java.lang.Object findNameTree(Dict root,
                                     java.lang.StringBuffer name)
                              throws java.io.IOException
Find name in name tree rooted at root and return its associated value. Used for Dests, AP, JavaScript, Pages, Templates, IDS, URLS. Yes, keys of name tree are (String)'s.

Returns:
null name is not found in tree or if root is null
Throws:
java.io.IOException

findNumberTree

public java.lang.Object findNumberTree(Dict root,
                                       int number)
                                throws java.io.IOException
Find number in number tree rooted at root and return its associated value. Used for PageLabels, ParentTree in structure tree root.

Returns:
null if number is not found in tree.
Throws:
java.io.IOException

Multivalent API