Oorexx and the Apache Pdfbox Library – Thesis Nutshell Examples for Managing a PDF File

WIRTSCHAFTSUNIVERSITÄT WIEN Vienna University of Economics and Business Bachelor Thesis English title of the Bachelor ooRexx and the Apache PDFBox Library – Thesis Nutshell Examples for Managing a PDF File German title of the Bachelor ooRexx und die Apache PDFBox-Library – Thesis Nutshell-Beispiele für die Verwaltung von PDF-Dateien Author Dobrea, Cristina Nicoleta last name, first name(s) Student ID number 0652377 Degree program J 033 561 Examiner ao. Univ.Prof. Mag. Dr. Rony G. Flatscher degree, first name(s), last name I hereby declare that 1. I have written this Bachelor thesis independently and without the aid of unfair or unauthorized resources. Whenever content was taken directly or indirectly from other sources, this has been indicated and the source referenced. 2. this Bachelor thesis has neither previously been presented for assessment, nor has it been published. 3. this Bachelor thesis is identical with the assessed thesis and the thesis which has been submit- ted in electronic form. 4. (only applicable if the thesis was written by more than one author): this Bachelor thesis was written together with first name(s), last name(s). The individual contributions of each writer as well as the co-written passages have been indicated. Date Signature Bachelor Thesis ooRexx and the Apache PDFBox Library – Nutshell Examples for Managing a PDF File Summer Term 2012 Author: Cristina Nicoleta Dobrea Enrollment Number: 0652377 Examiner: ao. Univ.Prof. Mag. Dr. Rony G. Flatscher ooRexx and the Apache PDFBox Library Abstract This paper provides short examples for working with the Apache PDFBox library. Basic information about the structure of a PDF file is provided to ease understanding. The nutshell examples are written in ooRexx. The functionality of the Java library is imported using BSF4ooRexx. Keywords ooRexx, BSF4ooRexx, PDFBox, PDF, object oriented programming 1 Cristina Nicoleta Dobrea Content Acknowledgements Two people have helped and comfort me while writing the bachelor thesis and deserve my appreciation. I now take the opportunity to thank them for their involvement. First of all, thank you Prof. Dr. Flatscher for encouraging me to take over this ambitions project. Also, thank you for keeping faith that I will successfully handle this venture. Second, I would like to express my gratitude to a dear friend who reviewed my progress and helped me to overcome difficulties. 2 ooRexx and the Apache PDFBox Library Content Table of Contents Content ................................................................................................................................................... 3 1. Introduction .................................................................................................................................... 5 2. PDF – The Portable Document Format ..................................................................................... 6 2.1. The Acrobat Layers .................................................................................................................. 6 2.2. Document Structure ................................................................................................................ 8 3. Involved Components ................................................................................................................ 10 3.1. The Apache PDFBox Library ................................................................................................... 10 3.2. Involved Languages ............................................................................................................... 11 3.2.1. The Principles of Object Oriented Programming ................................................... 11 3.2.2. Java ............................................................................................................................... 13 3.2.3. ooRexx ......................................................................................................................... 14 3.2.4. BSF4ooRexx ............................................................................................................... 15 4. Installation Guide ........................................................................................................................ 17 4.1. ooRexx ................................................................................................................................... 17 4.2. BSF4ooRexx ........................................................................................................................... 17 4.3. Apache PDFBox ...................................................................................................................... 17 5. Used Environment ...................................................................................................................... 18 6. Managing a PDF File – Nutshell Examples ............................................................................ 19 6.1. Create a New Document and Add Text ................................................................................. 19 6.2. Add Image to PDF .................................................................................................................. 22 6.3. Merge PDF Documents .......................................................................................................... 24 6.4. Extract Text from PDF File ..................................................................................................... 25 6.5. Search for String in PDF File .................................................................................................. 28 6.6. Split Up a PDF File by Handling Individual Pages ................................................................... 31 6.7. Split Up a PDF File by Sections .............................................................................................. 34 6.8. Create New Bookmark........................................................................................................... 36 6.9. Get Bookmarks on a Single Level .......................................................................................... 39 6.10. Get the Complete Bookmark Structure of a PDF File ........................................................ 42 6.11. Encrypt a PDF Document ................................................................................................... 48 7. Conclusion and Outlook ............................................................................................................. 51 8. Bibliography ................................................................................................................................. 52 List of Abbreviations .............................................................................................................................. 56 3 Cristina Nicoleta Dobrea Content Table of Source Code Listings Listing 1: Using the BSF class – option1 ........................................................................................ 16 Listing 2: Using the BSF class – option 2 ....................................................................................... 16 Listing 3: Using the BSF class – option 3 ....................................................................................... 16 Listing 4: PDF file creation and text insertion ................................................................................. 19 Listing 5: Image insertion ................................................................................................................... 22 Listing 6: Merging documents ........................................................................................................... 24 Listing 7: Text extraction - ooRexx stream object .......................................................................... 25 Listing 8: Text extraction - Java stream object ............................................................................... 26 Listing 9: Searching for a string ........................................................................................................ 28 Listing 10: Splitting up a PDF file: handling page objects ............................................................ 31 Listing 11: Splitting up a PDF file: creating a splitting algorithm ................................................. 34 Listing 12: Creating new bookmarks ................................................................................................ 36 Listing 13: Extracting outline information: bookmarks on a single level ..................................... 39 Listing 14: Extracting outline information: the complete outline structure .................................. 45 Listing 15: Encrypting a PDF document .......................................................................................... 48 Table of Figures Figure 1: Structure of a PDF Document [PDF Reference, Page 113] .......................................... 9 Figure 2: Searching for a string - outcome ..................................................................................... 29 Figure 3: Creating new bookmarks - outcome ............................................................................... 38 Figure 4: Extracting outline information: bookmarks on a single level - outcome .................... 41 Figure 5: Graphical display of ordered data tree ........................................................................... 42 Figure 6: Extracting outline information: the complete outline structure – outcome ...............

Load more