martedì, aprile 20, 2010

PdfBox text extraction & GAE

How to do text extraction from pdf files using PdfBox on Google App Engine

(Warning: I used an old version of PdfBox: 0.7.3.)

PdfBox is a very popular Java library for creating and managing pdf files. It's also able to extract text from existing pdf files. Pdfbox is published as a jar file.
I'd like to use it on Google App Engine (java version) for text extraction from particular area of the page of pdf files. PdfBox allows that. The class to use is PDFTextStripperByArea. I tried it but GAE blocked me: PDFTextStripperByArea uses not allowed JRE classes. In particular jawa.awt.Rectangle and Rectangle2D. GAE applies a "white list" approach: only a subset of the standard JRE classes is allowed to run on GAE. 99% of Java.awt.* is blocked. http://code.google.com/appengine/docs/java/jrewhitelist.html
There is also another problem. During text extraction PdfBox uses a temp file. By default it's created on the file-system. GAE also blocks the access to the file-system.

My solution was:
  • use my own Rectangle instead of java.awt.Rectangle
  • use a "in memory" temp file
The first required modification and recompilation of PdfBox.

My own Rectangle

I created my own Rectangle and Rectangle2D classes. My rectangle implementation is not complete compared to the awt one. I only created fields and methods required.
Than I created a new PDFTextStripperByArea: PDFTextStripperByAreaGAE. I not modified the original PDFTextStripperByArea because I didn't want to break the PdfBox library compatibility.
The new class only use my Rectangle. No more references to java.awt. So now GAE allows it to run.
The new PDFTextStripperByAreaGAE is equal to the old PDFTextStripperByArea . The only difference is the use of my Rectangle instead of java.awt.Rectangle. I copied and pasted 99% of the original code.

Temp file in memory

PdfBox uses File System by default. But you can force it to use a "in memory" buffer. PdfBox ships with org.pdfbox.io.RandomAccessBuffer. I use it.


byte[] pdfBytes; // contains the bytes of the Pdf file
RandomAccessBuffer tempMemBuffer = new RandomAccessBuffer();
PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdfBytes), tempMemBuffer);
PDFTextStripperByAreaGAE sa = new PDFTextStripperByAreaGAE();
sa.addRegion("Area1", new Rectangle(26, 86, 62, 10));
sa.addRegion("Area2", new Rectangle(99, 86, 94, 14));
...
PDPage p = (PDPage) doc.getDocumentCatalog().getAllPages().get(0); // page 1
sa.extractRegions(p);
String area1 = sa.getTextForRegion("Area1")
String area2 = sa.getTextForRegion("Area2")
...
doc.close();

Live demo

http://fhtino.appspot.com/PdfBoxGAE/demo.jsp

(please, use small pdf files)

venerdì, aprile 02, 2010

Spunti interessanti dalle slide di Jeff Dean

Stavo guardando le slide che Jeff Dean (Google) ha presentato alla Cornell University:
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Moltissimi spunti interessanti conditi con mie considerazioni:

  • le cose (server, dischi, schede e apparati di rete, ecc.) si rompono: bisogna imparare a gestire la cosa. Non fare finta che non possa accadere.
  • sviluppare servizi "interni", con poche dipendenze, chiare e documentate
  • usare "protocolli"  (strutture dati) che possono evolvere in modo trasparente per i "sistemi intermedi" (ogni applicazione capisce i propri "tag")
  • conoscere i  "numeri base" del performance di trasferimento dati in condizioni diverse (memoria, disco, lan, rete geografica, ecc.)
  • fare fronte alle richieste degli utenti senza far esplodere il sistema in termini di complessità. Oltre un certo punto, per far fronte a tutte le richieste, il sistema diventa troppo complesso e costoso.
  • sviluppare infrastrutture (e software) in base ad esigenze reali e non speculare su possibili esigenze future che, ad oggi, non esistono.  Considerare invece come le esigenze attuale potrebbero evolvere.
  • la velocità di risposta delle applicazioni è molto importante. Attenzione alla latenza.