create blog

go home go home
  1. about
  2. code
  3. wiki
  4. blog

Archive for the ‘PDF’ Category

PDF Embedding and So, So Much More

Friday, June 12th, 2009

The framework’s PDF engine has largely been rewritten. This adds a lot of new abilities which are quite exciting (and, in the meantime, we even changed a couple of unrelated things).

Here’s a list of changes:

  • Page Trees. We used to write one PDF array of pages. Now, we write trees of pages, allowing us to keep only a small number of page nodes in memory at a time.
  • Compression. Content streams in PDF files are now compressed. Why did we implement this now? Well, that’s a funny story…
  • Decompression. It turns out that, in our effort to embed one PDF inside of another, we may have to deal with multiple content streams. It further turns out that they still represent one long stream (as opposed to several individual segments) and as such cannot be used separately from each other. So, to put them in one object in the PDF file that may be reused you have to decompress and concatenate them. And, since we’re decompressing, why not compress as well?
  • PDF Parsing. After the near-rewrite of the PDF component, it was actually quite easy to allow PDF parsing. It only took us around a week or so. Now, we are able to open a PDF file and rip contents out of it.
  • PDF Embedding. It has, for awhile, been possible to embed PDFs inside PPML. However, to embed PDFs inside other PDF files, you need to actually deconstruct the PDF to be embedded. That’s why we implemented parsing. Amazingly, it all appears to work — although, unfortunately, it does not (currently) work for PDFs with compressed cross-reference streams or compressed object streams.
  • Better PDF Embedding. Not only does PDF-in-PDF embedding work, but PDF-in-PPML embedding has been improved nicely. The width and height of any embedded PDF are read in, so it is now possible to do lots of fancy stuff. Now, PDFs are treated almost exactly like images. You can make them fill a frame, you can make them fit in a frame, you can make them centered in a frame… basically, you can do anything you can do with an image with a PDF — in both PPML and PDF outputs.
  • Text Processing Bug Fixed. Imagine the words “Hello World.” What if both cannot fit on the same line? They should, then, naturally be put on separate lines. But what if it was just the space between the words making the difference? Still, they need to be on separate lines. Unfortunately, the framework had a bug here — it was determining that it needed to break, but although it split at the right point, it did not actually go to the new line. This has been fixed!
  • Line Height = 1.2. The framework now supports line heights for runs of text. This is, basically, the amount of spacing there should be between lines. Because we use InDesign quite often here, we made the framework default to line spacing of 1.2 — like InDesign’s — which changes how text flows in any application that has text boxes with more than one line of text.

Where Are We Going So what’s left? Well, there could be some memory usage issues; the engine was using a suspiciously large amount of memory the other day when processing a set of a few thousand records. It may or may not have been a small (or large) leak. We will get around to fixing this problem (if it exists) when it starts to bother us.

We want to implement scripting support for PDF, and by that, I mean that we want to be able to open a PDF using Script, determine things such the size of a page, and so on. It does not necessarily need to talk directly to PDF — it could talk to PDFLink — but we want a tool we can use to automate things such as imposition.

Of course, scripting of views could allow this, and this is an eventual goal. In the meantime, however, we may see some form of compiled dC, where dC is generated (compiled) live through some script, allowing forms of meta-programming.

A big part of all of this will be the conversion from the Engine to the Shell, which is a JavaScript based platform. The Shell will be able to simply run scripts, or do Create Framework related things. This will make it much easier to maintain the current Shell, which is used for running some current scripts here at TPSi.

Finally, dC namespaces need a rewrite, and that will happen eventually.

We have another project we will be working on for a little while. It is related to the Create Framework, but a bit too experimental to announce quite yet.

My Coding Style

Tuesday, May 19th, 2009

My coding style (for C++, at least) is probably slightly controversial. I use tabs, not spaces. Every brace, almost without exception, is on a new line (the exception is when the entire function definition, including braces, can be on one line).

Now one that is a bit weird, but I started doing for my own sanity awhile back: before every few lines, if not every line, there is a comment. After the line or set of lines, there is a blank line. It doesn’t matter if the comment says something extremely meaningful (though I prefer it to be meaningful), but it matters that it is there. The syntax coloring of the comment, combined with the line break, somehow makes the code much simpler for me to read.

It is preferable that the content be in first-person, but in plural form; that is, it is preferable that it start with “we,” as in “I and the program,” or “whoever you are reading this, and I the original programmer.” I don’t know why, but that is my preference, and, being the one who works most on the code, what I say goes.

Here is an example from a still very much work-in-progress function (so don’t make too much fun!) involved in the parsing of PDF:

PDFObject PDFFile::parse()
{
	//there are a few main things we could see. We could see an object, or a
	//delimiter.
 
	//what we do is simple. Skip whitespace (we have a function for that), read
	//until either delimiter or whitespace.
 
	//we will either have a simple object (number, boolean, integer, null)
	//or a delimiter. The delimiter tells us what we may want to do next, for
	//instance, use parseString, parseName, etc.
 
	//NOTE:
	//peek is our friend. Since we read one character at a time, it is much more
	//practical than creating our own in-memory buffer to remember, for instance,
	//the last delimiter we saw.
 
	//consume whitespace
	this->consumeWhitespace();
 
	//token buffer. This holds the token if we can process it whole.
	std::string buffer;
 
	//loop until we see either whitespace or a delimiter 
	//(peeking the whole time)
	while (true)
	{
		char c = this->input->peek();
 
		//if it is a delimiter or is whitespace, we are done reading.
		if (isWhitespace(c) || isDelimiter(c))
			break;
 
		this->input->get(c);
		buffer += c;
	}
 
	//now parse our buffer. If it is empty, it must be a boolean, string, etc.
	if (buffer.length() == 0)
	{
		//we must have a delimiter.
		//see what it means.
		char delimiter;
		this->input->get(delimiter);
 
		//see what it is...
		if (delimiter == '(')
		{
			return processString(); //will consume up to the ending )	
		}
		else if (delimiter == '<')
		{
			//peek the next character. if it is also a <, then this is a dict.
			if (this->input->get( //note: this is why I said work-in-progress.
			// it isn't finished.
		}
	}
 
 
}

Is the code a lot longer than it could be due to comments? Yes. But for one reason or another, it makes it much easier for me to understand and debug later.

Update: Something that, unfortunately, I am not consistent enough about is the use of the this-> prefix for variable and function names. I prefer to use it, but sometimes I don’t, as seen in “processString” above.

PDF Parsing

Tuesday, May 19th, 2009

Just a little status update: yesterday, I finally managed to make the new PDF engine pass all PDF-related tests in the Create Framework. It is revision 100 in the pdf-parsing branch on Launchpad.

Yes, all of the tests have been updated, but they were hand-checked first. As the new engine handles the timing of the writing of objects differently, the tests will work differently. They will change further, in fact, because I just realized that the maximum size for a page tree node is still set to 3 (for debugging purposes) — which is not at all what the final size should be — so the tests that have more than three pages are, unfortunately, wrong. Further, it is possible that some areas of the Create Framework may even have some leaks regarding PDF, so fixes to those could cause problems. This is not a stable branch, is not meant to be a stable branch, and is not guaranteed to even be in a compilable state. I’ve disclaimed, so if your computer blows up, it is not my fault.

Now, I’ve finally started PDF parsing. Currently, I’ve got the first XRef table being read in. I’m excited. Hopefully, by the end of the week, I’ll have some form of PDF-in-PDF embedding working. Currently, I’m aiming to support only PDF 1.4 — before all of that cross-reference stream and object stream business came about that would require the ability to handle compression (which, while on the eventual to-do list, is not currently a priority).