程式扎記: [ Java 套件 ] PDFBox - Extract text from PDF file

2014年4月2日星期三

[ Java 套件 ] PDFBox - Extract text from PDF file

Preface:
The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License v2.0. 這邊要來看如何利用這個套件, 將 PDF 中的文字內容給輸出.

在準備工作當然要先去下載該套件, 這邊使用的是 pdfbox-app-1.8.4.jar (pre-built PDFBox standalone binary), 或者你可以去官方的下載網頁看看有沒有新版的 Release.

Extracting text from a PDF file:
在開始看範例代碼前, 我們先手動建立了一個測試用的 PDF 檔案:
- test.pdf

要從 PDF 檔案取出文字內容, 會使用到 PDFTextStripper class 中的方法:

* String getText(PDDocument doc) : This will return the text of a document. Remember it returns a 'String'
* void writeText(PDDocument doc, Writer outputStream) : This will take a PDDocument and write the text of that document to the print writer.
* getPageSeparator(): This will get the page separator.
* getPageStart(): Returns the string which will be used at the beginning of a page.

除此之外, 你也可以設定要進行處理的頁數:

* public void setStartPage(int startPageValue): Where startPageValue is the starting page. The first page of the PDF is 1, second page is 2 and so on.
* public void setEndPage(int endPageValue): Where endPageValue is the last page that you want to extract. The first page of the PDF is 1 and so on.

接著底下是範例代碼:

view plaincopy to clipboardprint?
PDDocument pd;  
BufferedWriter wr;  
try {  
    File input = new File("test.pdf"); // The PDF file from where  
                                                // you would like to  
                                                // extract  
    File output = new File("test.txt"); // The text file where  
                                                    // you are going to  
                                                    // store the  
                                                    // extracted data  
    pd = PDDocument.load(input);  
    System.out.println(pd.getNumberOfPages());  
    System.out.println(pd.isEncrypted());             
    PDFTextStripper stripper = new PDFTextStripper();  
    //stripper.setStartPage(3); // Start extracting from page 3  
    //stripper.setEndPage(5); // Extract till page 5  
    wr = new BufferedWriter(new OutputStreamWriter(  
            new FileOutputStream(output)));  
    stripper.writeText(pd, wr);  
    if (pd != null) {  
        pd.close();  
    }  
    // I use close() to flush the stream.  
    wr.close();  
} catch (Exception e) {  
    e.printStackTrace();  
}  

底下是輸出 test.txt 的內容:

因為是 "文字內容", 所以圖片與連結並沒有辦法在文字檔中顯示. 如果要從 PDF 中取出圖片的話可以參考下面代碼: (pd 變數為 PDDocument 物件)

view plaincopy to clipboardprint?
System.out.printf("\t[Info] Extract image(s)...\n");              
List pages = pd.getDocumentCatalog().getAllPages();  
Iterator iter = pages.iterator();             
   while (iter.hasNext()) {  
       PDPage page = (PDPage) iter.next();  
       PDResources resources = page.getResources();  
       Map pdxMap = resources.getXObjects();  
       if (pdxMap != null) {   
           Iterator> pdxMapIter = pdxMap.entrySet().iterator();  
           while(pdxMapIter.hasNext())  
           {  
            Entry e = pdxMapIter.next();  
            if((Object)(e.getValue()) instanceof PDXObjectImage)  
            {  
                PDXObjectImage imageObj = (PDXObjectImage)e.getValue();  
                String fn = String.format("%s.jpeg", e.getKey());  
                System.out.printf("\t\tOutput %s\n", fn);  
                imageObj.write2file(new File(fn));  
            }  
           }                                      
       }  
   }  
     
   if (pd != null) {  
    pd.close();  
}  

如果要從 PDF 從取出超連結, 則可以參考下面代碼: (page 為 PDPage 物件.)

view plaincopy to clipboardprint?
List l = page.getAnnotations();  
for(PDAnnotation pdan:l)  
{  
    if(pdan instanceof PDAnnotationLink)  
    {  
        PDAnnotationLink link = (PDAnnotationLink)pdan;  
        PDActionURI pdl= (PDActionURI)link.getAction();  
        System.out.println("\t\tPDF Link: "+pdl.getURI());  
        wr.append(String.format("Link: %s\n", pdl.getURI()));  
    }  
}  

Supplement:
* Basic PDFBox Tutorial
* PDFBox API
* Stackoverflow: extract images from pdf using pdfbox
* PDFBox extract link information

程式扎記

標籤

2014年4月2日星期三

[ Java 套件 ] PDFBox - Extract text from PDF file

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年4月2日 星期三

[ Java 套件 ] PDFBox - Extract text from PDF file

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2014年4月2日星期三