Data Munging with Perl.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Data Munging with Perl Data Munging with Perl DAVID CROSS MANNING Greenwich (74° w. long.) For electronic information and ordering of this and other Manning books, go to www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact: Special Sales Department Manning Publications Co. 32 Lafayette Place Fax: (203) 661-9018 Greenwich, CT 06830 email: [email protected] ©2001 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books they publish printed on acid-free paper, and we exert our best efforts to that end. Library of Congress Cataloging-in-Publication Data Cross, David, 1962- Data munging with Perl / David Cross. p. cm. Includes bibliographical references and index. ISBN 1-930110-00-6 (alk. paper) 1. Perl (Computer program language) 2. Data structures (Computer science) 3. Data transmission systems. I. Title. QA76.73.P22 C39 20001998 005.7'2—dc21 00-050009 CIP Manning Publications Co. Copyeditor: Elizabeth Martin 32 Lafayette Place Typesetter: Dottie Marsico Greenwich, CT 06830 Cover designer: Leslie Haimes Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – VHG – 04 03 02 01 contents contents foreword xi preface xiii about the cover illustration xviii PART IFOUNDATIONS................................................. 1 Data, data munging, and Perl 3 1 1.1 What is data munging? 4 Data munging processes 4 ■ Data recognition 5 Data parsing 6 ■ Data filtering 6 ■ Data transformation 6 1.2 Why is data munging important? 7 Accessing corporate data repositories 7 ■ Transferring data between multiple systems 7 ■ Real-world data munging examples 8 1.3 Where does data come from? Where does it go? 9 Data files 9 ■ Databases 10 ■ Data pipes 11 Other sources/sinks 11 1.4 What forms does data take? 12 Unstructured data 12 ■ Record-oriented data 13 Hierarchical data 13 ■ Binary data 13 1.5 What is Perl? 14 Getting Perl 15 vi CONTENTS 1.6 Why is Perl good for data munging? 16 1.7 Further information 17 1.8 Summary 17 General munging practices 18 2 2.1 Decouple input, munging, and output processes 19 2.2 Design data structures carefully 20 Example: the CD file revisited 20 2.3 Encapsulate business rules 25 Reasons to encapsulate business rules 26 ■ Ways to encapsulate business rules 26 ■ Simple module 27 Object class 28 2.4 Use UNIX “filter” model 31 Overview of the filter model 31 ■ Advantages of the filter model 32 2.5 Write audit trails 36 What to write to an audit trail 36 ■ Sample audit trail 37 ■ Using the UNIX system logs 37 2.6 Further information 38 2.7 Summary 38 Useful Perl idioms 39 3 3.1 Sorting 40 Simple sorts 40 ■ Complex sorts 41 ■ The Orcish Manoeuvre 42 ■ Schwartzian transform 43 The Guttman-Rosler transform 46 ■ Choosing a sort technique 46 3.2 Database Interface (DBI) 47 Sample DBI program 47 3.3 Data::Dumper 49 3.4 Benchmarking 51 3.5 Command line scripts 53 CONTENTS vii 3.6 Further information 55 3.7 Summary 56 Pattern matching 57 4 4.1 String handling functions 58 Substrings 58 ■ Finding strings within strings (index and rindex) 59 ■ Case transformations 60 4.2 Regular expressions 60 What are regular expressions? 60 ■ Regular expression syntax 61 ■ Using regular expressions 65 ■ Example: translating from English to American 70 ■ More examples: /etc/passwd 73 ■ Taking it to extremes 76 4.3 Further information 77 4.4 Summary 78 PART II DATA MUNGING ............................................ 79 Unstructured data 81 5 5.1 ASCII text files 82 Reading the file 82 ■ Text transformations 84 Text statistics 85 5.2 Data conversions 87 Converting the character set 87 ■ Converting line endings 88 ■ Converting number formats 90 5.3 Further information 94 5.4 Summary 95 Record-oriented data 96 6 6.1 Simple record-oriented data 97 Reading simple record-oriented data 97 ■ Processing simple record-oriented data 100 ■ Writing simple record-oriented data 102 ■ Caching data 105 viii CONTENTS 6.2 Comma-separated files 108 Anatomy of CSV data 108 ■ Text::CSV_XS 109 6.3 Complex records 110 Example: a different CD file 111 Special values for $/ 113 6.4 Special problems with date fields 114 Built-in Perl date functions 114 Date::Calc 120 ■ Date::Manip 121 Choosing between date modules 122 6.5 Extended example: web access logs 123 6.6 Further information 126 6.7 Summary 126 Fixed-width and binary data 127 7 7.1 Fixed-width data 128 Reading fixed-width data 128 ■ Writing fixed-width data 135 7.2 Binary data 139 Reading PNG files 140 ■ Reading and writing MP3 files 143 7.3 Further information 144 7.4 Summary 145 PART III SIMPLE DATA PARSING.................................. 147 Complex data formats 149 8 8.1 Complex data files 150 Example: metadata in the CD file 150 ■ Example: reading the expanded CD file 152 8.2 How not to parse HTML 154 Removing tags from HTML 154 ■ Limitations of regular expressions 157 CONTENTS ix 8.3 Parsers 158 An introduction to parsers 158 ■ Parsers in Perl 161 8.4 Further information 162 8.5 Summary 162 HTML 163 9 9.1 Extracting HTML data from the World Wide Web 164 9.2 Parsing HTML 165 Example: simple HTML parsing 165 9.3 Prebuilt HTML parsers 167 HTML::LinkExtor 167 ■ HTML::TokeParser 169 HTML::TreeBuilder and HTML::Element 171 9.4 Extended example: getting weather forecasts 172 9.5 Further information 174 9.6 Summary 174 XML 175 10 10.1 XML overview 176 What’s wrong with HTML? 176 ■ What is XML? 176 10.2 Parsing XML with XML::Parser 178 Example: parsing weather.xml 178 ■ Using XML::Parser 179 ■ Other XML::Parser styles 181 XML::Parser handlers 188 10.3 XML::DOM 191 Example: parsing XML using XML::DOM 191 10.4 Specialized parsers—XML::RSS 193 What is RSS? 193 ■ A sample RSS file 193 Example: creating an RSS file with XML::RSS 195 Example: parsing an RSS file with XML::RSS 196 10.5 Producing different document formats 197 Sample XML input file 197 ■ XML document transformation script 198 ■ Using the XML document transformation script 205 x CONTENTS 10.6 Further information 208 10.7 Summary 208 Building your own parsers 209 11 11.1 Introduction to Parse::RecDescent 210 Example: parsing simple English sentences 210 11.2 Returning parsed data 212 Example: parsing a Windows INI file 212 Understanding the INI file grammar 213 Parser actions and the @item array 214 Example: displaying the contents of @item 214 Returning a data structure 216 11.3 Another example: the CD data file 217 Understanding the CD grammar 218 ■ Testing the CD file grammar 219 ■ Adding parser actions 220 11.4 Other features of Parse::RecDescent 223 11.5 Further information 224 11.6 Summary 224 PART IV THE BIG PICTURE ........................................ 225 Looking back—and ahead 227 12 12.1 The usefulness of things 228 The usefulness of data munging 228 ■ The usefulness of Perl 228 ■ The usefulness of the Perl community 229 12.2 Things to know 229 Know your data 229 ■ Know your tools 230 Know where to go for more information 230 appendix A Modules reference 232 appendix B Essential Perl 254 index 273 foreword foreword Perl is something of a weekend warrior. Outside of business hours you’ll find it indulging in all kinds of extreme sports: writing haiku; driving GUIs; reviving Lisp, Prolog, Forth, Latin, and other dead languages; playing psychologist; shovelling MUDs; inflecting English; controlling neural nets; bringing you the weather; play- ing with Lego; even running quantum computations. But that’s not its day job. Nine-to-five it earns its keep far more prosaically: storing information in data- bases, extracting it from files, reorganizing rows and columns, converting to and from bizarre formats, summarizing documents, tracking data in real time, creating statistics, doing back-up and recovery, merging and splitting data streams, logging and checkpointing computations. In other words, munging data. It’s a dirty job, but someone has to do it. If that someone is you, you’re definitely holding the right book. In the follow- ing pages, Dave will show you dozens of useful ways to get those everyday data manipulation chores done better, faster, and more reliably. Whether you deal with fixed-format data, or binary, or SQL databases, or CSV, or HTML/XML, or some bizarre proprietary format that was obviously made up on a drunken bet, there’s help right here. Perl is so good for the extreme stuff, that we sometimes forget how powerful it is for mundane data manipulation as well. As this book so ably demonstrates, in addi- tion to the hundreds of esoteric tools it offers, our favourite Swiss Army Chainsaw also sports a set of simple blades that are ideal for slicing and dicing ordinary data. Now that’s a knife! DAMIAN CONWAY preface preface Over the last five years there has been an explosion of interest in Perl. This is largely because of the huge boost that Perl received when it was adopted as the de facto language for creating content on the World Wide Web.