|
Multivalent API | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmultivalent.std.adaptor.pdf.COSSource
multivalent.std.adaptor.pdf.PDFReader
Parse Adobe's Portable Document Format (PDF) and construct low-level objects (COS in Adobe terminology: string, number, dictionary) and high-level Java objects (Font, Image). Also parses Forms Data Format (FDF), which can hold form and annotation data. Based on Adobe's PDF 1.5 Reference, available online. This class provides easy access to all parts of a PDF for developers familiar with the PDF Reference. At this time PDF files of up to 4GB in length are supported.
PDFReader(File)
, PDFReader(InputUni)
,
getURI()
, getRA()
, close()
getLinearized()
,
getTrailer()
, getStartXRef()
,
getObjCnt()
, getObjOff(int)
, getObjGen(int)
, getObjType(int)
getObject(int)
, getObject(Object)
,
fault()
,
readObject()
(using eatSpace(RandomAccess)
and readInt(RandomAccess)
),
getInputStream(Object, boolean)
/ getInputStream(Object)
, getStreamData(Object, boolean, boolean)
,
countCached()
getCatalog()
, getInfo()
, getMetadata(Object)
,
getEncrypt()
, setPassword(String)
, isAuthorized()
,
getPageCnt()
, getPageRef(int)
, getPage(int)
, getPageNum(Dict)
readObject(InputStreamComposite)
(using eatSpace(InputStreamComposite)
),
readCommand(InputStreamComposite)
, readInlineImage(InputStreamComposite)
,
readCommandArray(Object)
getFileSpecification(Object)
, getFileInputStream(Object)
,
getColorSpace(Object, Dict, Dict)
, getImage(IRef, AffineTransform, Color)
,
getFont(Dict, float, AffineTransform, PDF)
, getCMap(Object)
,
findNameTree(Dict, StringBuffer)
, #findNumberTree(Dict, Number)
m n obj
" ... "endobj
" at the start of lines.
If needed, this is done automatically and transparently.
isModified()
, isRepaired()
setExact(boolean)
.
getObject(int)
from 1 to getObjCnt()
.
If an object refers to another by an indirect reference (IRef
), getObject(Object)
will follow the reference to the actual object.
PDF objects are represented with basic Java data types, e.g., PDF dictionaries as Java Map's, with the complete correspondence given by CLASS_* constants.
The data for images and page contents are kept in streams, readable uncompressed and decrypted with getInputStream(Object)
.
At a higher level, you can ask this API for pages and to transform them into Java images, fonts, colorspaces, and so on.
Get a particular page's dictionary by number from 1 (not 0) to getPageCnt()
, inclusive, with getPage(int)
.
Pages' content streams that describe page appearance can be parsed into individual commands with readCommand(InputStreamComposite)
.
High-level versions of images, colorspaces, and fonts are available by passing the PDF object to the appropriate method.
tool.pdf.Info
and tool.pdf.Validate
for examples of use.
PDF
to display pages
PDFWriter
to write new PDF data format from Java data structures
Constructor Summary | |
---|---|
PDFReader(java.io.File file)
Constructs new instance corresponding to the .pdf file. |
|
PDFReader(com.pt.io.InputUni iu)
Constructs new instance corresponding to the InputUni . |
|
PDFReader(java.lang.Object[] objs,
Dict trailer)
Constructs new instance given the data structures of a PDF (for experts). |
Method Summary | |
---|---|
void |
close()
Close use and free up resources, including file descriptors. |
int |
countCached()
For performance tuning, teturns count of different objects that have been cached (but may have been subsequently garbage collected). |
void |
eatSpace(InputStreamComposite in)
Eat whitespace between tokens in content stream. |
void |
eatSpace(com.pt.io.RandomAccess ra)
Eat whitespace between tokens in COS object. |
void |
fault()
Faults into cache all objects reachable in document (starting from trailer), and sets unreachable objects (that have not been previously read by the caller) to COS.OBJECT_NULL . |
java.lang.Object |
findNameTree(Dict root,
java.lang.StringBuffer name)
Find name in name tree rooted at root and return its associated value. |
java.lang.Object |
findNumberTree(Dict root,
int number)
Find number in number tree rooted at root and return its associated value. |
Dict |
getCatalog()
Returns Document catalog. |
com.pt.awt.font.CMap |
getCMap(java.lang.Object ref)
Returns CMap for Encoding or ToUnicode. |
java.awt.color.ColorSpace |
getColorSpace(java.lang.Object csref,
Dict csres,
Dict patres)
ColorSpaces.createColorSpace(Object, PDFReader) with cacheing. |
Encrypt |
getEncrypt()
Returns document-wide encryption manager. |
java.io.InputStream |
getFileInputStream(java.lang.Object spec)
Given a PDF external file specification, which can be a local file or network URI, returns a stream of data. |
java.net.URI |
getFileSpecification(java.lang.Object spec)
Converts simple or full file specification into a platform-independent URI. |
NFont |
getFont(Dict fd,
float size,
java.awt.geom.AffineTransform Tm,
PDF pdf)
Fonts#createFont(Dict,float.AffineTransform,Dict,PDF,PDFReader) with cacheing and scaling. |
java.awt.image.BufferedImage |
getImage(IRef imgdictref,
java.awt.geom.AffineTransform ctm,
java.awt.Color fillcolor)
Images.createImage(Dict, InputStream, AffineTransform, Color, PDFReader) with cacheing (under key COS.REALIZED ) |
Dict |
getInfo()
Returns /Info dictionary from trailer. |
InputStreamComposite |
getInputStream(java.lang.Object o)
Same as getInputStream(Object, boolean) , assuming not a content stream. |
InputStreamComposite |
getInputStream(java.lang.Object o,
boolean iscontent)
Given indirect reference to stream dictionary or array of such references, returns stream of uncompressed and decrypted data. |
int |
getLinearized()
If document is linearized, returns integer > 0 that is object number of linearization dictionary. |
java.lang.String |
getMetadata(java.lang.Object o)
Returns metadata associated with object, or return 0-length String if none. |
int |
getObjCnt()
Returns number of objects, numbered from 0. |
java.lang.Object |
getObject(int num)
Returns object from xref table offset at point num, from 0 to getObjCnt() , taking from cache if available. |
java.lang.Object |
getObject(java.lang.Object ref)
Returns referenced object, following any indirect references to concrete objects. |
int |
getObjGen(int objnum)
Returns object's generation number. |
long |
getObjOff(int objnum)
Returns object's byte offset in file. |
byte |
getObjType(int objnum)
Returns object's type, which is one of COS.XREF_FREE , COS.XREF_NORMAL , or COS.XREF_OBJSTMC . |
Dict |
getPage(int pagenum)
Given page number, finds corresponding a page dictionary. |
int |
getPageCnt()
Returns number of pages in document. |
int |
getPageNum(Dict page)
Reverse of getPage(int) . |
IRef |
getPageRef(int pagenum)
Given page number, finds corresponding a page object. |
com.pt.io.RandomAccess |
getRA()
Returns associated RandomAccess . |
static java.lang.Double |
getReal(double val)
|
long |
getStartXRef()
File offset of (last) trailer, which is needed for incremental updates. |
byte[] |
getStreamData(java.lang.Object ref,
boolean fraw,
boolean fcache)
Returns entire content of input stream. |
Dict |
getTrailer()
Document trailer. |
java.net.URI |
getURI()
Returns associated URI. |
phelps.util.Version |
getVersion()
Returns the major version of PDF used; for example, for PDF 1.4. |
boolean |
isAuthorized()
|
boolean |
isModified()
Modified, perhaps because repaired or annotated. |
boolean |
isRepaired()
|
Cmd |
readCommand(InputStreamComposite in)
Parse next command from content stream, or return null if no more. |
Cmd[] |
readCommandArray(java.lang.Object contentstream)
Parses content stream into array of commands. |
Dict |
readInlineImage(InputStreamComposite in)
Parse inline image from stream into a dictionary with its attributes and the data in a COS.CLASS_DATA under key COS.STREAM_DATA . |
int |
readInt(com.pt.io.RandomAccess ra)
Read positive integer from file. |
java.lang.Object |
readObject()
Returns next COS object from current file position, which may not be a top-level object starting with m n obj . |
java.lang.Object |
readObject(InputStreamComposite in)
Reads a complete object from a content stream: int, string, dictionary, map, ..., including all subparts |
Dict |
readXref(boolean all)
Reads cross-reference table and returns its trailer. |
void |
reset()
Clears all cached objects, which may have been mutated. |
void |
setExact(boolean b)
As a PDF is read in, COS objects are normalized . |
boolean |
setPassword(java.lang.String password)
Set password, returning true if document can be read unencrypted. |
Methods inherited from class multivalent.std.adaptor.pdf.COSSource |
---|
connected, getDecodeParms, getObjInt |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PDFReader(java.io.File file) throws java.io.IOException, ParseException
java.io.IOException
ParseException
public PDFReader(com.pt.io.InputUni iu) throws java.io.IOException, ParseException
InputUni
.
java.io.IOException
ParseException
public PDFReader(java.lang.Object[] objs, Dict trailer) throws java.io.IOException
java.io.IOException
Method Detail |
---|
public void close() throws java.io.IOException
java.io.IOException
public com.pt.io.RandomAccess getRA()
RandomAccess
.
Clients should not cache the return value since the RandomAccess can change.
public java.net.URI getURI()
public void setExact(boolean b)
normalized
.
Set exact to true
to prevent this.
This should be set immediately after instantiation and setting the encryption password if any;
after this point is cannot be changed, because that would leave some parts updated and others not and there would be conflicts.
public boolean isModified()
public boolean isRepaired()
public Dict readXref(boolean all) throws java.io.IOException, ParseException
true
, read entire table, chaining from trailer to trailer via /Prev
.
Precondition: file pointer is at start of xref table, at the start of the xref
keyword.
Usually the cross-reference table is read automatically at startup.
java.io.IOException
ParseException
getObjOff(int)
,
getObjGen(int)
,
getObjCnt()
public phelps.util.Version getVersion()
COSSource
getVersion
in class COSSource
public int getLinearized()
public Dict getTrailer()
getTrailer
in class COSSource
public long getStartXRef()
public Dict getCatalog() throws java.io.IOException
Required keys: Type (=='Catalog'), Pages (dictionary),
Optional keys: PageLabels (number tree), Names (dictionary), Dests (dictionary), ViewerPreferences (dictionary), PageLayout (name), PageMode (name), Outlines (dictionary), Threads (array), OpenAction (array or dictionary), URI (dictionary), AcroForm (dictionary), StructTreeRoot (dictionary), SpiderInfo (dictionary)
getCatalog
in class COSSource
java.io.IOException
public Dict getInfo() throws java.io.IOException
Optional keys: Title (string), Author (string), Subject (string), Keywords (string), Creator (string), Producer (string), CreationDate (date), ModDate (date), Trapped (name).
null
if no /Info.
java.io.IOException
public java.lang.String getMetadata(java.lang.Object o) throws java.io.IOException
java.io.IOException
public Encrypt getEncrypt()
SecurityHandler.authUser(String)
,
SecurityHandler.isAuthorized()
public boolean setPassword(java.lang.String password) throws java.io.IOException
java.io.IOException
public boolean isAuthorized()
public int getPageCnt() throws java.io.IOException
java.io.IOException
public IRef getPageRef(int pagenum) throws java.io.IOException
getPageCnt()
, inclusive.
If object in that position is not a /Type /Page, returns null
(not COS.OBJECT_NULL
).
java.io.IOException
public Dict getPage(int pagenum) throws java.io.IOException
getObject(getPageRef(pagenum))
.
Pages are numbered 1..getPageCnt()
, inclusive.
Reverse of getPageNum(Dict)
.
java.io.IOException
public int getPageNum(Dict page) throws java.io.IOException
getPage(int)
.
java.io.IOException
public long getObjOff(int objnum)
n g obj
, not to start of content.
public int getObjGen(int objnum)
public byte getObjType(int objnum)
COS.XREF_FREE
, COS.XREF_NORMAL
, or COS.XREF_OBJSTMC
.
In PDF 1.5, this is the cross reference table's field 0.
public int getObjCnt()
getObjCnt
in class COSSource
public InputStreamComposite getInputStream(java.lang.Object o, boolean iscontent) throws java.io.IOException
java.io.IOException
public InputStreamComposite getInputStream(java.lang.Object o) throws java.io.IOException
getInputStream(Object, boolean)
, assuming not a content stream.
java.io.IOException
public byte[] getStreamData(java.lang.Object ref, boolean fraw, boolean fcache) throws java.io.IOException
getInputStream(Object, boolean)
and InputStream.read()
out the data.
ref
- stream dictionary, or indirect ref to stream dictionary. If PDF is encrypted, must be indirect ref (perhaps freshly created for this purpose, with the right generation number).fraw
- raw data, not passed through filtersfcache
- if true save data under COS.STREAM_DATA
key, remove PDF /Length key, and strip out non-image filters from Filter
value
- Returns:
- null if dictionary is not a stream
- Throws:
java.io.IOException
public void eatSpace(com.pt.io.RandomAccess ra) throws java.io.IOException
java.io.IOException
public int readInt(com.pt.io.RandomAccess ra) throws java.io.IOException
java.io.IOException
public static java.lang.Double getReal(double val)
public java.lang.Object readObject() throws java.io.IOException
m n obj
.
Ordinarily you want getObject(int)
or getObject(Object)
instead.
Comments, which are rare, are lost; the following object is returned.
Keywords that are not boolean or null are returned as String
s.
Precondition: file pointer at start of token.
Postcondition: Eat following whitespace to bring file pointer to start of next token.
java.io.IOException
public java.lang.Object getObject(java.lang.Object ref) throws java.io.IOException
null
,
so one can easily fully resolve an object that may or may not be present in a dictionary with a getObject(dict.get("key"))
.
getObject
in class COSSource
java.io.IOException
public java.lang.Object getObject(int num) throws java.io.IOException
getObjCnt()
, taking from cache if available.
Object is decrypted if necessary.
All objects are cached, with SoftReference
s so they are automatically garbage collected when memory is tight.
If the object is a stream, its contents are not read, but the file position of the data (a Long
) is stored under a new injected key COS.STREAM_DATA
.
If the object has been freed ('f' in xref table), COS.OBJECT_DELETED
is returned -- the old object is not available.
Object number is an int
not a long
, so it can handle only 2,147,483,647 of the possible 9,999,999,999 objects in a PDF,
but even very large PDFs seldom have more than 100,000 objects.
java.io.IOException
public void fault() throws java.io.IOException
COS.OBJECT_NULL
.
Some bad PDFs have cross reference entries to non-existent objects,
but these objects aren't referenced by other objects so viewing works fine.
This method ensures that a loop over all objects won't encounter an error either.
java.io.IOException
public int countCached()
public void reset()
public java.lang.Object readObject(InputStreamComposite in) throws java.io.IOException
java.io.IOException
public void eatSpace(InputStreamComposite in) throws java.io.IOException
java.io.IOException
public Cmd[] readCommandArray(java.lang.Object contentstream) throws java.io.IOException
IRef
.
java.io.IOException
PDFWriter.writeCommandArray(Cmd[], boolean)
,
Cmd
public Cmd readCommand(InputStreamComposite in) throws java.io.IOException
null
if no more.
For inline images (BI..ID..EI), expands abbreviations in dictionary (in ops[0]) and strips non-image filters from data (in ops[1]).
java.io.IOException
public Dict readInlineImage(InputStreamComposite in) throws java.io.IOException
COS.CLASS_DATA
under key COS.STREAM_DATA
.
Abbreviated keys (but not values) are expanded
(e.g., /F
=> /Filter
, but not the color space value G
=> DeviceGray
BI
and following whitespace;
on exit input stream is immediate after closing EI
.
java.io.IOException
public java.net.URI getFileSpecification(java.lang.Object spec) throws java.io.IOException
java.io.IOException
public java.io.InputStream getFileInputStream(java.lang.Object spec) throws java.io.IOException
BufferedInputStream
.
If file is not found, returns null
.
java.io.IOException
public java.awt.color.ColorSpace getColorSpace(java.lang.Object csref, Dict csres, Dict patres) throws java.io.IOException
ColorSpaces.createColorSpace(Object, PDFReader)
with cacheing.
java.io.IOException
public java.awt.image.BufferedImage getImage(IRef imgdictref, java.awt.geom.AffineTransform ctm, java.awt.Color fillcolor) throws java.io.IOException
Images.createImage(Dict, InputStream, AffineTransform, Color, PDFReader)
with cacheing (under key COS.REALIZED
).
java.io.IOException
public NFont getFont(Dict fd, float size, java.awt.geom.AffineTransform Tm, PDF pdf) throws java.io.IOException
Fonts#createFont(Dict,float.AffineTransform,Dict,PDF,PDFReader)
with cacheing and scaling.
Created font stored font dictionary in SoftReference under key #REALIZED
.
java.io.IOException
public com.pt.awt.font.CMap getCMap(java.lang.Object ref) throws java.io.IOException
java.io.IOException
public java.lang.Object findNameTree(Dict root, java.lang.StringBuffer name) throws java.io.IOException
null
java.io.IOException
public java.lang.Object findNumberTree(Dict root, int number) throws java.io.IOException
java.io.IOException
|
Multivalent API | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |