The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License v2.0. 這邊要來看如何利用這個套件, 將 PDF 中的文字內容給輸出.
在準備工作當然要先去下載該套件, 這邊使用的是 pdfbox-app-1.8.4.jar (pre-built PDFBox standalone binary), 或者你可以去官方的 下載網頁 看看有沒有新版的 Release.
Extracting text from a PDF file:
在開始看範例代碼前, 我們先手動建立了一個測試用的 PDF 檔案:
- test.pdf
要從 PDF 檔案取出文字內容, 會使用到 PDFTextStripper class 中的方法:
除此之外, 你也可以設定要進行處理的頁數:
接著底下是範例代碼:
- PDDocument pd;
- BufferedWriter wr;
- try {
- File input = new File("test.pdf"); // The PDF file from where
- // you would like to
- // extract
- File output = new File("test.txt"); // The text file where
- // you are going to
- // store the
- // extracted data
- pd = PDDocument.load(input);
- System.out.println(pd.getNumberOfPages());
- System.out.println(pd.isEncrypted());
- PDFTextStripper stripper = new PDFTextStripper();
- //stripper.setStartPage(3); // Start extracting from page 3
- //stripper.setEndPage(5); // Extract till page 5
- wr = new BufferedWriter(new OutputStreamWriter(
- new FileOutputStream(output)));
- stripper.writeText(pd, wr);
- if (pd != null) {
- pd.close();
- }
- // I use close() to flush the stream.
- wr.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
因為是 "文字內容", 所以圖片與連結並沒有辦法在文字檔中顯示. 如果要從 PDF 中取出圖片的話可以參考下面代碼: (pd 變數為 PDDocument 物件)
- System.out.printf("\t[Info] Extract image(s)...\n");
- List pages = pd.getDocumentCatalog().getAllPages();
- Iterator iter = pages.iterator();
- while (iter.hasNext()) {
- PDPage page = (PDPage) iter.next();
- PDResources resources = page.getResources();
- Map
pdxMap = resources.getXObjects(); - if (pdxMap != null) {
- Iterator
> pdxMapIter = pdxMap.entrySet().iterator(); - while(pdxMapIter.hasNext())
- {
- Entry
e = pdxMapIter.next(); - if((Object)(e.getValue()) instanceof PDXObjectImage)
- {
- PDXObjectImage imageObj = (PDXObjectImage)e.getValue();
- String fn = String.format("%s.jpeg", e.getKey());
- System.out.printf("\t\tOutput %s\n", fn);
- imageObj.write2file(new File(fn));
- }
- }
- }
- }
- if (pd != null) {
- pd.close();
- }
- List
l = page.getAnnotations(); - for(PDAnnotation pdan:l)
- {
- if(pdan instanceof PDAnnotationLink)
- {
- PDAnnotationLink link = (PDAnnotationLink)pdan;
- PDActionURI pdl= (PDActionURI)link.getAction();
- System.out.println("\t\tPDF Link: "+pdl.getURI());
- wr.append(String.format("Link: %s\n", pdl.getURI()));
- }
- }
Supplement:
* Basic PDFBox Tutorial
* PDFBox API
* Stackoverflow: extract images from pdf using pdfbox
* PDFBox extract link information
沒有留言:
張貼留言