XML Processing and Website Scraping in Java How to Use Jsoup and Xmlbeam in Practice
Total Page:16
File Type:pdf, Size:1020Kb
XML processing and website scraping in Java How to use JSoup and XMLBeam in practice Gábor László Hajba This book is for sale at http://leanpub.com/javaxml This version was published on 2019-12-29 This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do. © 2014 - 2019 Gábor László Hajba Tweet This Book! Please help Gábor László Hajba by spreading the word about this book on Twitter! The suggested hashtag for this book is #WebsiteScrapingWithJava. Find out what other people are saying about the book by clicking on this link to search for this hashtag on Twitter: #WebsiteScrapingWithJava Contents Preface .............................. 1 What took me the most time? ............. 1 Acknowledgement .................... 2 XML Processing and the Google App Engine ..... 3 Why GAE? ......................... 3 Getting the data ...................... 4 XML to HTML ...................... 4 XML to PDF ........................ 6 XML to RTF ........................ 7 XML to “.*X” ........................ 8 Exporting the files in GAE ............... 8 XML Processing Advanced ................. 11 XML processing when memory matters ........ 12 Website scraping with JSoup and XMLBeam ..... 13 Runtime comparison advanced .............. 14 Upgrade to Java 8 ....................... 15 Custom printing for HTML with JSoup ......... 16 Printing XMLBeam projections .............. 17 Preface This is a book about using XML and HTML processing tools with the Java platform. I do not want to explain every tool you can find on the internet. It would be overwhelming and bad for the time-line of this book. So I just look at those tools I encountered in my daily work. Sometimes I toyed with those tools outside my daily work and researched something new or tried out some new features. One of those new features war the performance-tuning of the Website Scraper application mentioned in Chapter 3, where I thought about the performance of the application (you’ll see running with the whole data-set was a pain) and I changed some measurement parts and optimized the tools for a better comparison. And at the end I did parallelize the whole process with Java 8 and streams. The tools I will mention and explain a bit are XMLBeam and JSoup. The first one is an awesome XML processing engine, the second one is a website scraper tool. However you could use XMLBeam to do the same as JSoup – but the query language is a bit bothersome if you do not know XPath. What took me the most time? As I was writing this book the most time went on to have a stable base of application what I can use for performance 1 Preface 2 measurement and displaying the results. I had many ideas in mind and as I developed the pieces I got on more and better solutions and things I wanted to show you so I went on, refactored a part of the code. This meant that sometimes I had to redo runs with different configurations to have better results. But I do not regret that I did these errands – if I want them call so. I learned new things and I have many ideas where to go on. Perhaps this will result in another book or more blog articles. Acknowledgement I have to thank Sven Ewald, the creator of XMLBeam¹, for reviewing my book’s chapters about XMlBeam. Beside this he found time to answer my question and provide me samples with answers. Because LeanPub currently cannot display the whole book’s Table of Contents in the sample, I use this workaround to show you what you’ll get if you buy this book. I know, it is a bit awkward (you get some empty pages in your sample PDF) but so you can see what comes in the other version. ¹http://xmlbeam.org XML Processing and the Google App Engine In this chapter I’ll introduce you to XML processing and the Google App Engine (GAE). Why GAE? This is a good question. Mostly because I’ve worked with the GAE and I encountered some problems with it and the XML processing. So I thought I could share my problems and solutions with you. Perhaps you are interested in it or even it helps you to solve some problems. Writing about development to a GAE environment is always kind of “fun” because you have your solutions – and at the end you get a punch in your face from the GAE: some classes you want to use are not permitted. Then you start looking for a solution inside of the feasible area. This was the case when we (a co-worker and I) had the task to render an XML (provided from somewhere somehow – it is not important in the current context) in various formats: PDF and RTF. And as a bonus (because rendering those documents was not the easiest thing) I implemented a web-based display too to see if we get the right data. Visualizing XML as HTML is always the easiest thing. For me at least. 3 XML Processing and the Google App Engine 4 Getting the data The data came through a SOAP interface in an XML-bundle. I will not go into detail how to access the SOAP interface because it was not the easiest thing, and I’ve written an article in my blog about it some time ago. And SOAP is dead, REST is in, and currently HATEOAS is the new path you should walk when you work with remote structured data. However XML is a good structured data format which you can use in many ways. As you’ll see later, we needed an XML parser to get the data extracted from the transmitted XML. For this I created a quick and easy XML extractor which took the XML and extracted the required data with some XPath expressions into objects. It was not the best solution but it was least time-consuming. And it was a good practice for me to work with XML. XML to HTML As I mentioned this was no requirement but I wanted to see results as soon as possible so I added an HTML display of the XML input. Converting XML to HTML is easy: you only need to do an XSL Transformation (XSLT) and then you are done. The result you get is an HTML file (or XML or text – depending on your configuration). But this is for GAE a no-go because you are not allowed to create files dynamically from your application. Nevertheless you can end up with a solution to display your XML data represented as an HTML page: you only have to XML Processing and the Google App Engine 5 add the stylesheet to your data and most of the browsers will display it correctly. How to add the stylesheet? You have to add a tag containing the stylesheet to your XML- Data. For example: <?xml-stylesheet type="text/xsl" href="stylesheets/detailHtml.xsl\ "?> to transform the XML to HTML with XSLT (the detail- Html.xsl contains the transformation information). If you get your data from an interface (for example from a SOAP service) you have to be a bit tricky to get your XSL into your XML – because you get all of the data in one XML dataset. However if you think about a solution you would end up with: replacing the starting root node with itself and the stylesheet-node. With this workaround you can alter the XML dataset and display it along with XSLT. And this works with GAE too. String rootNode = "<rootNode>"; xmlString.replaceFirst(rootNode, "<?xml-stylesheet type=\"text/xsl\" href=\"stylesheets/detail\ Html.xsl\"?>" + rootNode); The example above is a little hack but you have to do this to add the stylesheet to the XML data. XML Processing and the Google App Engine 6 XML to PDF Converting an XML to a PDF is something simple too: with XSLT you create an XSL-FO (FO for Formatted Objects) document from your XML. An FO document is an XML using element names (node names) from the FO namespace. After this you can send your resulting FO document to a render- engine (for example Apache FOP) and you get your PDF. Sounds simple however GAE does not allow some of the classes which are used by Apache FOP (for example AWT graphics). So there is need for another workaround. iText is a good alternative to FOP however it does not handle FO documents. Nevertheless, iText has an XmlWorker project which should be used to render XML (XHTML) documents. So this sounds very good so I gave it a try. To get an XHTML from the XML I used again XSLT. Unfortunately I had some problems with applying the re- quired CSS to the XHTML output (some of them worked, some not) and as far as I can remember the XmlWorker had some problems with displaying the required images too. And beside the images there is a requirement of specific fonts to use when displaying the texts – and this is hardly manageable too when it comes to XHTML to PDF conversion (or at least I did not find a good-enough solution). So I ended up creating the PDF manually with iText added each element on it’s own, programmatically. To achieve this I created a custom XML extractor which split the provided XML result document into some classes (grouped by coher- ence) and added display-information to these classes. This was the least time-consuming solution.