Previous Topic

Next Topic

Book Contents

Book Index

About the darker side of PDF

First, let us make clear that we like PDF and that we welcome it as a very useful tool to spread the basic idea of electronic paper – an idea developed by the author of this software and this manual much earlier than PDF had been available or had even been publicly known (see the chapter about the history of CDF, which is much older than PDF).

Nevertheless, there are some substantial problems either inherent to the PDF technology or the marketing philosophy behind it (which we regard to be much more presumptive). Since these problems can substantially complicate a data import from PDF, you must know about the technical backgrounds and reasons to master these problems – and they can be mastered without all too much trouble!

Most major problems derive from the fact that the PDF file format stores text independent of any logical bindings in the source data. It does not even maintain word boundaries, not to mention phrases or paragraphs. Nothing of this vital information originally contained in the source data is maintained in the PDF file format (for reasons which remain completely obscure to anybody but the authors).

Practically this means, that the characters, which in the original data clearly formed a unit (i.e. a word) can be and often are torn apart and stored as parts of separate text blocks. It is not even guaranteed that such text blocks are stored in a consecutive sequence, which could help to reconstruct the original data entities (I. e. words).

No, PDF prefers to store text in sometimes rather arbitrary sequences and text blocks, which are in no way linked to one another.

We do not know if this is a bug or a feature, we only know that we cannot see any sensible reason for this, and – that you and we will have to live with this nonsense!

The problems resulting from these obscurities are best illustrated by examples: you have a simple multi-line block of text stored in a word-processing or DTP software. Three consecutive words in this text block are to be printed bold and another word is to be printed bold italic. All word processing and DTP packages known to us will keep the text block as a logical unit and store the attributes belonging to the bold and italic words in a third dimension "behind" the actual text and as special attributes of this text block – with some information about the characters to which these font styles refer.

They will definitely not tear the text block apart only because some of the words are printed in a different style.

PDF does exactly this: it splits the text block into various parts depending on the fonts used for specific characters sequences. "Words" as a unit do not even exist in PDF! This practically means that the one text block, which originally formed an entity, is split into several independent units by PDF each of which has its own font definition. Unfortunately, neither the order nor the logic leading to this separation of a logical text unit is documented or apparent.

Even worse: in some not so rare cases, PDF for some unknown reason "decides" to even split words into two or more parts in cases, where the entire word is printed in one unique font, i.e. where there is absolutely no reason nor sense to tear the characters apart, which are forming this word, and store them in separate text blocks. These blocks, to make things worse, are often not even stored in any sequence or with any reference to one another.

These principles used by PDF, if such should exist at all, contradict today's widely accepted concepts of "object-orientation" to the maximum extent possible! PDF does just the exact opposite!

This causes great problems for any program trying to recompose separated text blocks back into what originally had been logical entities - words, phrases or paragraphs.

This PDF particularity is the sole reason for some settings needed by the Clickcat software to recompose arbitrarily spread characters from PDF to text entities (words), which were senselessly torn apart when the PDF file was created.

We are extremely sorry for this inconvenience which is exclusively caused by the very exceptional way in which PDF stores text blocks with different fonts. We have tried to make the best out of this situation and we strongly recommend you first look at the explanatory drawings and examples in the section "Optimal use of import settings" and then to play with different settings when importing your PDF files. You will soon find a set of optimum settings for your particular situation.

That much about the darker side of PDF! Let us again make clear, that despite all these obscurities we see PDF as a very useful and partially brilliant tool to use documents electronically. And we see it as a door opener for a much more specialised technology like our CDF!

See Also

HTMLGenerator Import form

Overview of the PDFImporter Import form

Settings in the PDFImporter Import form

Object identification in the PDFImporter Import form

Button Import in the PDFImporter Import form