multivalent.std.adaptor.pdf
Class PDFReader
public
class
PDFReader
extends COSSource
Parse Adobe's Portable Document Format (PDF) and construct low-level objects (COS in Adobe terminology: string, number, dictionary)
and high-level Java objects (Font, Image).
Based on Adobe's PDF 1.5 Reference, available online.
This class provides easy access to all parts of a PDF for developers familiar with the PDF Reference.
At this time PDF files of up to 4GB in length are supported.
- Constructors and file control:
PDFReader, PDFReader,
getFile, getRAF, close
- Low-level structure:
getLinearized,
getTrailer, getStartXRef,
getObjCnt, PDFReader, PDFReader, PDFReader
- Bytes on disk to PDF/Java data structure:
PDFReader, getObject,
PDFReader,
readObject (using eatSpace and readInt),
PDFReader / getInputStream, PDFReader,
countCached
- Structure:
getCatalog, getInfo, getMetadata,
getEncrypt, setPassword, isAuthorized,
getPageCnt, PDFReader, PDFReader, getPageNum
- Page content stream:
readObject (using eatSpace),
readCommand, readInlineImage,
readCommandArray
- Higher level Java objects:
getFileSpecification, getFileInputStream,
PDFReader, PDFReader,
PDFReader, getCMap
- Query:
PDFReader, PDFReader
- Repair:
If an error is found in the xref table, an attempt to repair it is made by sequentially reading the entire PDF
looking for objects as indicated "
m n obj
" ... "endobj
" at the start of lines.
If needed, this is done automatically and transparently.
isModified, isRepaired
- Modernization:
Older versions of PDF are updated to the current specification (presently 1.5);
for instance, older PDF stored all named destinations in a single dictionary, whereas current PDF uses a name tree.
Turned off with PDFReader.
How to use this class
A PDF file is a series of numbered "COS" objects (strings, integers, dictionaries, streams, references to other objects, ...)
that can be interpreted as images, streams of page-building commands, annotations, and so on, as described in the PDF Reference.
To access these objects, first create an instance on a
.pdf file with a constructor.
Now all the PDF's objects are available by number via
PDFReader from 1 to
getObjCnt.
If an object refers to another by an indirect reference (
IRef),
getObject will follow the reference to the actual object.
PDF objects are represented with basic Java data types, e.g., PDF dictionaries as Java Map's, with the complete correspondence given by CLASS_* constants.
The data for images and page contents are kept in
streams, readable uncompressed and decrypted with
getInputStream.
At a higher level, you can ask this API for pages and to transform them into Java images, fonts, colorspaces, and so on.
Get a particular page's dictionary by number from 1 (not 0) to getPageCnt, inclusive, with PDFReader.
Pages' content streams that describe page appearance can be parsed into individual commands with readCommand.
High-level versions of images, colorspaces, and fonts are available by passing the PDF object to the appropriate method.
See Also
- Source code Info and Validate for examples of use.
- PDF to display pages
- PDFWriter to write new PDF data format from Java data structures
Other PDF manipulation libraries:
- Adobe's Acrobat Core API
has various layers of interfaces, including a high level one
that has many functions for specific actions like adding annotations.
An advantage of this is that the programming language helps ensure correctness of the PDF
by enforcing fixed sets of type-checked arguments.
A disadvantage is that you have nearly 3000(!) pages of API to master,
as compared to about 1000 for the PDF Specification itself.
In contrast, the Multivalent PDF API is, relatively, just a handful of methods
and furthermore uses common Java objects (java.util.Map for PDF dictionary) that you likely already know.
On the other hand, it does not aid correctness with type checking
or provide high level interfaces to complex operations.
- Enfocus Browser
- PDFBox
- JPedal
Version: $Revision: 1.91 $ $Date: 2003/12/30 02:19:23 $
Method Summary |
void | close()
Close use and free up resources.
|
int | countCached()
For performance tuning,
teturns count of different objects that have been cached (but may have been subsequently garbage collected). |
void | eatSpace(RandomAccess raf) Eat whitespace between tokens in COS object. |
void | eatSpace(InputStreamComposite in) Eat whitespace between tokens in content stream. |
void | fault()
Faults into cache all objects reachable in document (starting from trailer),
and sets unreachable objects (that have not been previously read by the caller) to OBJECT_NULL.
|
Object | findNameTree(Dict root, StringBuffer name)
Find name in name tree rooted at root and return its associated value.
|
Object | findNumberTree(Dict root, int number)
Find number in number tree rooted at root and return its associated value.
|
Dict | getCatalog()
Returns Document catalog.
|
CMap | getCMap(Object ref) |
ColorSpace | getColorSpace(Object csref, Dict csres, Dict patres) |
Encrypt | getEncrypt()
Returns document-wide encryption manager.
|
File | getFile() Returns associated PDF java.io.File, if any. |
InputStream | getFileInputStream(Object spec)
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data.
|
URI | getFileSpecification(Object spec)
Converts simple or full file specification into a platform-independent URI. |
NFont | getFont(Dict fd, float size, AffineTransform Tm, PDF pdf)
Fonts with cacheing and scaling.
|
BufferedImage | getImage(IRef imgdictref, AffineTransform ctm, Color fillcolor) |
Dict | getInfo()
Returns /Info dictionary from trailer. |
InputStreamComposite | getInputStream(Object o, boolean iscontent)
Given indirect reference to stream dictionary or array of such references,
returns stream of uncompressed and decrypted data.
|
InputStreamComposite | getInputStream(Object o) Same as PDFReader, assuming not a content stream. |
int | getLinearized() If document is linearized, returns integer > 0 that is object number of linearization dictionary. |
int | getMajorVersion() |
String | getMetadata(Object o)
Returns metadata associated with object, or return 0-length String if none.
|
int | getMinorVersion() |
int | getObjCnt() Returns number of objects, numbered from 0. |
Object | getObject(Object ref)
Returns referenced object, following any indirect references to concrete objects.
|
Object | getObject(int num)
Returns object from xref table offset at point num, from 0 to getObjCnt, taking from cache if available.
|
int | getObjGen(int objnum)
Returns object's generation number. |
long | getObjOff(int objnum)
Returns object's byte offset in file.
|
byte | getObjType(int objnum) |
Dict | getPage(int pagenum)
Given page number, finds corresponding a page dictionary.
|
int | getPageCnt() Returns number of pages in document. |
int | getPageNum(Dict page) |
IRef | getPageRef(int pagenum)
Given page number, finds corresponding a page object.
|
RandomAccess | getRAF() |
static Double | getReal(double val) |
long | getStartXRef() File offset of (last) trailer, which is needed for incremental updates. |
byte[] | getStreamData(Object ref, boolean fraw, boolean fcache)
Returns entire content of input stream.
|
Dict | getTrailer()
Document trailer.
|
boolean | isAuthorized() |
boolean | isModified() Modified, perhaps because repaired or annotated. |
boolean | isRepaired() |
Cmd | readCommand(InputStreamComposite in)
Parse next command from content stream, or return null if no more.
|
Cmd[] | readCommandArray(Object contentstream)
Parses content stream into array of commands.
|
Dict | readInlineImage(InputStreamComposite in)
Parse inline image from stream into a dictionary with its attributes
and the data in a CLASS_DATA under key STREAM_DATA.
|
int | readInt(RandomAccess raf) Read positive integer from file. |
Object | readObject()
Returns next COS object from current file position, which may not be a top-level object starting with m n obj .
|
Object | readObject(InputStreamComposite in)
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts |
Dict | readXref(boolean all)
Reads cross-reference table and returns its trailer.
|
void | setExact(boolean b) |
boolean | setPassword(String password)
Set password, returning true if document can be read unencrypted.
|
public PDFReader(File file)
Construct a instance corresponding to the .pdf file.
Java strings can be converted to a File with new File(string), URIs and URLs with new File(getPath())
public void close()
Close use and free up resources.
This class should be closed when no longer used.
public int countCached()
For performance tuning,
teturns count of different objects that have been cached (but may have been subsequently garbage collected).
Eat whitespace between tokens in COS object.
Eat whitespace between tokens in content stream.
public void fault()
Faults into cache all objects reachable in document (starting from trailer),
and sets unreachable objects (that have not been previously read by the caller) to
OBJECT_NULL.
Some bad PDFs have cross reference entries to non-existent objects,
but these objects aren't referenced by other objects so viewing works fine.
This method ensures that a loop over all objects won't enounter an error either.
public Object findNameTree(
Dict root, StringBuffer name)
Find
name in name tree rooted at
root and return its associated value.
Used for Dests, AP, JavaScript, Pages, Templates, IDS, URLS.
Yes, keys of
name tree are
(String)'s.
Returns: null name is not found in tree or if root is null
public Object findNumberTree(
Dict root, int number)
Find
number in number tree rooted at
root and return its associated value.
Used for PageLabels, ParentTree in structure tree root.
Returns: null if number is not found in tree.
Returns Document catalog.
Required keys: Type (=='Catalog'), Pages (dictionary),
Optional keys: PageLabels (number tree),
Names (dictionary), Dests (dictionary), ViewerPreferences (dictionary),
PageLayout (name), PageMode (name), Outlines (dictionary), Threads (array), OpenAction (array or dictionary),
URI (dictionary), AcroForm (dictionary), StructTreeRoot (dictionary), SpiderInfo (dictionary)
public CMap getCMap(Object ref)
public ColorSpace getColorSpace(Object csref,
Dict csres,
Dict patres)
Returns document-wide encryption manager.
User of class should set the password, if any, through this object.
If the password is null/empty, the password is automatically set.
See Also: authUser isAuthorized
public File getFile()
Returns associated PDF java.io.File, if any.
public InputStream getFileInputStream(Object spec)
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data.
This may involve fetching the file over the network and writing files to the file system.
Files may happen to have their data embedded.
Client may want to wrap return value in a java.io.BufferedInputStream.
If file is not found, returns null
.
public URI getFileSpecification(Object spec)
Converts simple or full file specification into a platform-independent URI.
public
NFont getFont(
Dict fd, float size, AffineTransform Tm,
PDF pdf)
Fonts with cacheing and scaling.
Created font stored font dictionary in SoftReference under key
PDFReader.
public BufferedImage getImage(
IRef imgdictref, AffineTransform ctm, Color fillcolor)
Returns /Info dictionary from trailer. Normalizes to remove 0-length values.
Optional keys: Title (string), Author (string), Subject (string), Keywords (string), Creator (string), Producer (string),
CreationDate (date), ModDate (date), Trapped (name).
Returns: null
if no /Info.
Given indirect reference to stream dictionary or array of such references,
returns stream of uncompressed and decrypted data.
(Images are not uncompressed here.)
Same as
PDFReader, assuming not a content stream.
public int getLinearized()
If document is linearized, returns integer > 0 that is object number of linearization dictionary.
public int getMajorVersion()
public String getMetadata(Object o)
Returns metadata associated with object, or return 0-length String if none.
To obtain the metadata for the document as a whole, pass the document catalog.
public int getMinorVersion()
public int getObjCnt()
Returns number of objects, numbered from 0.
public Object getObject(Object ref)
Returns referenced object, following any indirect references to concrete objects.
In contrast to other methods, ref can be a Java null
,
so one can easily fully resolve an object that may or may not be present in a dictionary with a getObject(dict.get("key"))
.
public Object getObject(int num)
Returns object from xref table offset at point
num, from 0 to
getObjCnt, taking from cache if available.
Object is decrypted if necessary.
All objects are cached, with java.lang.ref.SoftReferences so they are automatically garbage collected when memory is tight.
If the object is a stream, its contents are not read, but the file position of the data (a java.lang.Long) is stored under a new injected key
STREAM_DATA.
If the object has been freed ('f' in xref table),
OBJECT_DELETED is returned -- the old object is not available.
Object number is an
int
not a
long
, so it can handle only 2,147,483,647 of the possible 9,999,999,999 objects in a PDF,
but even very large PDFs seldom have more than 100,000 objects.
public int getObjGen(int objnum)
Returns object's generation number. Generations are used in incremental writing and encryption.
In PDF 1.5 this is the cross reference table's field 1.
public long getObjOff(int objnum)
Returns object's byte offset in file.
In PDF 1.5 this is the cross reference table's field 1.
N.B. Points to the object header n g obj
, not to start of content.
public byte getObjType(int objnum)
public
Dict getPage(int pagenum)
Given page number, finds corresponding a page dictionary.
Populates inheritable attirbutes by climbing parents as necessary.
To get page dictionary without inheriting attributess, use
getObject(getPageRef(pagenum))
.
Pages are numbered 1..
getPageCnt, inclusive.
Reverse of
getPageNum.
public int getPageCnt()
Returns number of pages in document.
public int getPageNum(
Dict page)
public
IRef getPageRef(int pagenum)
Given page number, finds corresponding a page object.
Pages are numbered PDF-style:
1..
getPageCnt,
inclusive.
If object in that position is not a /Type /Page, returns
null
(not
OBJECT_NULL).
Returns associated
RandomAccess.
Clients should not cache the return value since the RandomAccess can change.
public static Double getReal(double val)
public long getStartXRef()
File offset of (last) trailer, which is needed for incremental updates.
public byte[] getStreamData(Object ref, boolean fraw, boolean fcache)
Returns entire content of input stream.
For a stream use
PDFReader and java.io.InputStream#read() out the data.
Parameters: ref stream dictionary, or indirect ref to stream dictionary. If PDF is encrypted, must be indirect ref (perhaps freshly created for this purpose, with the right generation number). fraw raw data, not passed through filters fcache if true save data under STREAM_DATA key, remove PDF /Length key, and strip out non-image filters from Filter
value
Returns: null if dictionary is not a stream
Document trailer.
Required keys: Size, Root (to catalog), ID. (If no ID exists, one is created.)
Optional keys: Encrypt, Info.
public boolean isAuthorized()
public boolean isModified()
Modified, perhaps because repaired or annotated.
public boolean isRepaired()
Parse next command from content stream, or return null
if no more.
For inline images (BI..ID..EI), expands abbreviations in dictionary (in ops[0]) and strips non-image filters from data (in ops[1]).
public
Cmd[] readCommandArray(Object contentstream)
Parses content stream into array of commands.
If PDF is encrypted,
contentstream must be
IRef.
See Also: PDFWriter Cmd
Parse inline image from stream into a dictionary with its attributes
and the data in a
CLASS_DATA under key
STREAM_DATA.
Abbreviated keys (but not values) are expanded
(e.g.,
/F
=>
/Filter
, but not the color space value
G
=> DeviceGray
),
and non-image filters on the data (such as LZW) are removed.
On entry input stream should be placed after the
BI
and following whitespace;
on exit input stream is immediate after closing
EI
.
Read positive integer from file.
public Object readObject()
Returns next COS object from current file position, which may not be a top-level object starting with
m n obj
.
Ordinarily you want
PDFReader or
getObject instead.
Comments, which are rare, are lost; the following object is returned.
Keywords that are not boolean or null are returned as java.lang.Strings.
Precondition: file pointer at start of token.
Postcondition: Eat following whitespace to bring file pointer to start of next token.
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts.
public
Dict readXref(boolean all)
Reads cross-reference table and returns its trailer.
If
all flag is
true
, read entire table, chaining from trailer to trailer via
/Prev
.
Precondition: file pointer is at start of xref table, at the start of the
xref
keyword.
Usually the cross-reference table is read automatically at startup.
Returns: PDF trailer dictionary
See Also: PDFReader PDFReader getObjCnt
public void setExact(boolean b)
As a PDF is read in, COS objects are
normalized
.
Set exact to
true
to prevent this.
This should be set
immediately after instantiation and setting the encryption password if any;
after this point is cannot be changed, because that would leave some parts updated and others not and there would be conflicts.
public boolean setPassword(String password)
Set password, returning true if document can be read unencrypted.
Document may be unencryted for several reasons: not encrypted, password is null and so automatically unlocked, password is correct, password correctly set earlier.
Once the password is correctly set, it may not be unset.