J a v a   T o o l   K i t
Java Tool Kit
Preface
The PdfFile class of the Java Tool Kit was designed to convert HTML to PDF on-the-fly, from well-formed, dynamically-generated HTML in a controlled environment. In other words, it is a powerful and easy way to create PDFs from the HTML generated by a Web application. It was never intended to be a PDF-based Internet Explorer. Regardless, it can perform very accurate external "scrape" conversions of Internet sites if the authors of the pages used well-formed HTML.
What does "well-formed" mean? HTML, is a mark-up language that consists of tagged elements. When a tag is opened, it should be closed in the reflective hierarchal order. Many of today's HTML generators fail miserably in creating well-formed HTML. Fortunately for users, today's browsers have become more forgiving and can ignore errors that would normally be detrimental to the page's display.
The PdfFile engine utilizes the Java "swing" classes to interpret and position the source HTML to best mimic today's leading Web browsers. After laying out the page, the custom PDF engine translates the interpretted HTML into a PDF file. Tables and borders are translated into lines and rectangles, and form elements are converted to a combination of rectangles, graphics, and text. This complex process relies heavily on the "swing" components ability to parse HTML.
As of Java 1.4.2, these components only support HTML 3.2 and are not as robust as the leading browser engines. There are many short-comings. Most of which can be easily overcome if the following guidelines are adhered to and known issues avoided.
Known Issues
• <SPAN> tags are not recognized by the Java HTMLFactory. Therefore, a default inline empty element is displayed as to not disturb layout. However, CSS classes, styles and other attributes are ignored. Hopefully, new releases of Java will fix this. In the meantime, a work-around is in the works. Java does interpret the <FONT> tag, now deprecated in HTML 4.0, and all attributes as an inline element. This can be used as as substitute for <SPAN>.
• <STYLE> tags are ignore in the body of an HTML document. This is one of those situations that is forgiven by most browsers. Java will only interpret a <STYLE> tag in the head of the document where it was intended. In addition, the contents of the <STYLE> tag (stylesheet classes) will be displayed as body content in the PDF document. This is easy to circumvent by placing your <STYLE> tags in the head of the HTML document.
• Stylesheet borders do not display for tables. Java only interprets the default raised table borders display.
• Multiple CSS class inheritance does not work. As with MS Word and Excel, only the first class is recognized in an object. only the class "myclass" will be recognized and rendered in the following: <div class="myclass myotherclass">
• Fonts are limited to the 14 embedded Adobe Acrobat fonts to avoid missing client fonts, substitution errors, and issues like the dreaded "Bad BBox" error. The method FontSupport() can be called to allow unsupported fonts to be utilized. However, this is not recommended as the PdfFile class cannot generate dynamic font descriptors for fonts within the HTML source document. A "Missing Font" or "Bad BBox" error will occur upon opening the resulting PDF. The fonts may display properly after ignoring the Acrobat Reader error.
• The preceeding issue is the most impactful because of its effect on rendering. In this situation, Java swing is interpretting the actual font size and styles within the source HTML. The resulting rendered page within the PdfFile virtual browser is based on the actual declared source fonts and their sizes. These areas are calculated prior to font substitution. Hence, the substituted font within the PDF document may require more area than the original font, causing overlap or gaps between HTML object containers. Please read the portion on fonts in the Tips & Tricks section below.
• Font sizes declared in points render in pixels and vice-versa. It appears as though the developers at SUN have confused pixels and points. If a font-size attribute is declared as 12px it is rendered at 12pt; 12pt as 12px. The remedy for this is to simply dispose of the unit size in your size declaration. When declared as 12 with no unit, Java will render the font or object in pixels. By default, browsers will do the same.
• If image size attributes ( WIDTH and HEIGHT ) for an HTML <IMG> tag are not specified, some images will not properly display at their default size. This only occurs for images created in certain image editors that do not create a proper size header. This is a rare issue, and ONLY occurs when size attributes are omitted. Always use width and height attributes when possible.
• Some styles within class objects may be ignored in <TD> tags. This is puzzling and unfortunate, although easily remedied. So far, it appears that padding and margin styles declared in a custom class object will be ignored, but recognized and rendered properly when placed in a style attribute. Rather than doing <td class="pad">, use <td style="padding: 10;"> instead.
• Only characters 1 through 255 are displayed properly. Double-byte characters will not display as expected.
• Using the RELOADABLE context attribute in Apache Tomcat can often cause ClassLoader Lifecycle issues. One of the effected classes is the DefaultUI in Java swing. This will cause form elements to fail rendering. Because of the overhead, RELOADABLE is not recommended by Apache in production environments. Dan Oross Consulting urges setting RELOADABLE to false.
HTML-to-PDF Conversion Guide   ( Tips & Tricks )
• Allow the PdfFile component to substitute the embedded Acrobat fonts for your HTML source custom fonts. To avoid mismatches in calculated and actual rendered area dimensions, use the embedded fonts Helvetica, Arial, Courier, Times, Symbol in your source HTML. The result will be a PDF file that exactly matches your Web browser rendered version.
• All form element sizes will be calculated and rendered as if the fonts are the default DIALOG 12 PLAIN. Stylesheet data for form elements is ignored. If the default form element styles are used, the PDF document will match the Web browser display version.
• When CSS classes fail to render in PDF as they do on screen, don't despair. Try stylesheet setting style attributes within the HTML element, and as a last resort, try the default HTML tag's base attributes to set properties.
• Always set image tag width and height attribute values.
• Do not specify units in your stylesheets. PdfFile and most browsers will default to pixels. This will provide the closest rendition of your HTML when rendered as PDF.
• The custom tag <PDF> will signify to the PdfFile component to provide a certain functionality specified by an attribute within the tag. Currently the only available action is "pagebreak". The "pagebreak" attribute will tell the PDF engine to insert a page break when this tag is encountered. An integer can be provided to tell the engine how many pixels before or after the tag to break. If no attribute is provided, a pagebreak at zero (0) pixels will be inserted. For example, the tag <pdf pagebreak="-5"> will tell the PdfFile component to insert a page break 5 pixels above the vertical position in which the tag was encountered. A value of positive five (5) would move 5 pixels below the tag position. Be sure to include this tag WITHIN a content tag like <DIV> or <TD>, not within container tags like <TABLE> or <TR>.
Copyright ©2008 Dan Oross Consulting, Inc. All rights reserved.
Products and Web site designed and developed by Dan Oross