Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

Total Page:16

File Type:pdf, Size:1020Kb

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics M. Migliorini1,2, R. Castellotti1, L. Canali1,*, and M. Zanetti2 1European Organization for Nuclear Research (CERN), Geneva, Switzerland 2University of Padova, Padua, Italy *[email protected] Abstract amples of \big-data" endeavors: chasing extremely rare physics processes requires producing, manag- The effective utilization at scale of complex machine ing and analyzing large amounts of complex data. learning (ML) techniques for HEP use cases poses Data processing throughput of those experiments is several technological challenges, most importantly on expected to exceed 200 TB/s in 2025 with the up- the actual implementation of dedicated end-to-end grade of the LHC (HL-LHC project), which, after data pipelines. A solution to these challenges is pre- tight online filtering, implies saving on permanent sented, which allows training neural network clas- storage 100 PB of data per year. Thorough and com- sifiers using solutions from the Big Data and data plex data processing enables major scientific achieve- science ecosystems, integrated with tools, software, ments, like the discovery of the Higgs boson in 2012. and platforms common in the HEP environment. Data ingestion, feature engineering, data reduction In particular, Apache Spark is exploited for data and classification are complex tasks, each requiring preparation and feature engineering, running the cor- advanced techniques to be accomplished. While, so responding (Python) code interactively on Jupyter far, custom solutions have been employed, recent de- notebooks. Key integrations and libraries that make velopments of open source tools are making the lat- Spark capable of ingesting data stored using ROOT ter compelling options for HEP specific use-cases. format and accessed via the XRootD protocol, are Furthermore, physics data analysis is profiting to a described and discussed. Training of the neural net- large extent from modern Machine Learning (ML) work models, defined using the Keras API, is per- techniques, which are revolutionizing each processing formed in a distributed fashion on Spark clusters by step, from physics objects reconstruction (feature en- using BigDL with Analytics Zoo and also by using gineering), to parameter estimation (regression) and arXiv:1909.10389v5 [cs.DC] 16 Jun 2020 TensorFlow, notably for distributed training on CPU signal selection (classification). In this scope, Apache and GPU resourcess. The implementation and the Spark [1] represents a very promising tool to extend results of the distributed training are described in the traditional HEP approach, by combining in a detail in this work. unique system powerful means for both sophisticated data engineering and Machine Learning. Among the most popular analytics engines for big data process- Introduction ing, Spark allows performing interactive analysis and data exploration through its mature data processing High energy physics (HEP) experiments like those at engine and API for distributed data processing, its in- the Large Hadron Collider (LHC) are paramount ex- 1 tegration with cluster systems and by featuring ML Centralized production systems orchestrate the libraries giving the possibility to train in a distributed data processing workflow, converting the raw infor- fashion all common classifiers and regressors on large mation into higher-level quantities (data or feature datasets. engineering). Computing resources, organized world- It has been proved (e.g. in [2]) that Deep Learning wide in hierarchical tiers, are exploited employing can boost considerably the performances of physics GRID protocols [6]. Although the centrally pro- data analysis, yielding remarkable results from larger duced datasets may require some additional feature sets of low-level (i.e. less \engineered") quantities. preparation, from this stage on, the processing is There is thus a significant interest in integrating analysis-dependent and is done by the users using Spark with tools, like BigDL [3], allowing distributed batch systems. Machine Learning algorithms are ex- training of deep learning models. ecuted either from within the ROOT framework (us- The development of an end-to-end machine learn- ing TMVA [7], a toolkit for multivariate data analy- ing pipeline to analyze HEP data using Apache Spark sis) or using more common open source frameworks is described in this paper. After briefly recalling the (Keras/Tensorflow, PyTorch, etc.). traditional data processing and analysis workflow in HEP, the specific physics use-case addressed in work is presented; the various steps of the pipeline are The Physics Use Case then described in detail, from data ingestion to model training, whereas the overall results are reported in Physicists primarily aim at distinguishing interest- the final section. ing collision events from uninteresting ones, the for- The primary goal of this work is to reproduce the mer being those associated with specific physics sig- classification performance results of [4] using tools nals whose existence is sought or whose properties from the Big Data and data science ecosystems, show- are worth being studied. Typically, those signals are ing that the proposed pipeline makes more efficient extremely rare and correspond to a tiny fraction of usage of computing resources and provides a more the whole dataset. Data analysis results then in a productive interface for the physicists, along all the classification problem, where, in addition to the sig- steps of the processing pipeline. nal category, the background is often also split into several classes. Out of the 40 million collisions produced by the LHC every second, only a small fraction (about 1000 Traditional Analysis Workflow events) can currently be stored by the online pipelines and Tools of the two omni-purpose detectors, CMS and AT- LAS. A haphazard selection of those events would The backbone of the traditional HEP analysis work- dilute the already rare signal processes, thus efficient flow is ROOT [5], a multipurpose C++ toolkit devel- data classification needs to take place already online. oped at CERN implementing functionalities for I/O Experiments implement complex trigger systems, de- operations, persistent storage, statistical analysis, signed to maximize the true-positive rate and min- and data visualization. Data gathered by LHC exper- imize the false-positive rate thus allowing effective iments or produced by their simulation software are utilization of computing resources both online and provided in ROOT format, with file-based data rep- offline (e.g. processing units, storage, etc.). resentation and an event-based class structure with The Machine Learning pipeline described in this branches. The latter is a feature of paramount impor- paper addresses the same physics use-case considered tance, as it enables the flexibility required to preserve in the work by Nguyen et al. [4] where event topol- the complexity of the recorded data, allowing keeping ogy classification, based on deep learning, is used to track of intrinsic dependencies among physics objects improve the purity of data samples selected at trig- of each collision event. ger level. The dataset is the result of a Monte Carlo 2 event generation, where three different processes (cat- egories) have been simulated: the inclusive produc- tion of a leptonically decaying W ± boson, the pair production of a top-antitop pair (tt¯), and hadronic production of multijet events. Variables of low and high level are included in the dataset. Figure 1: Scheme of the data processing pipeline used Data Pipeline For Machine for training the event topology classifier. Learning and container-based environments. The four steps of Data pipelines are of paramount importance to make the pipeline we built are (see also Figure 1): machine learning projects successful, by integrating • Data Ingestion, where we read data in ROOT multiple components and APIs used across the en- format from the CERN EOS storage system, into tire data processing chain. A good data pipeline a Spark DataFrame and save the results as a implementation can help to achieve analysis results large table stored in a set of Apache Parquet faster by improving productivity and by reducing the files. amount of work and toil around core machine learn- ing tasks. In particular, data pipelines are expected • Feature Engineering and Event Selection, to provide solid tools for data processing, a task that where the Parquet files containing all the events ends up being one of the most time-consuming for details processed in Data Ingestion are filtered, physicists, and data scientists in general, approach- and datasets with new features are produced. ing data analysis problems. Traditionally, HEP has developed custom tools for data processing, which • Parameter Tuning, where the hyperparame- have been successfully used for decades. Recently, ters for each model architecture are optimized a large range of solutions for data processing and by performing grid search. machine learning have become available from open • Training, where the neural network models are source communities. The maturity and adoption of trained on the full dataset. such solutions continue to grow both in industry and academia. Using software from open source commu- In the next sections, we will describe in detail each nities comes with several advantages, including low- step of this pipeline. ering the cost of development and support, and the possibility of sharing solutions and expertise with a Data Source large
Recommended publications
  • Leveraging Activation Sparsity for Training Deep Neural Networks
    Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks Minsoo Rhu Mike O’Connor Niladrish Chatterjee Jeff Pool Stephen W. Keckler NVIDIA Santa Clara, CA 95050 fmrhu, moconnor, nchatterjee, jpool, [email protected] Abstract—Popular deep learning frameworks require users to the time needed to copy data back and forth through PCIe is fine-tune their memory usage so that the training data of a deep smaller than the time the GPU spends computing the DNN neural network (DNN) fits within the GPU physical memory. forward and backward propagation operations, vDNN does Prior work tries to address this restriction by virtualizing the memory usage of DNNs, enabling both CPU and GPU memory not affect performance. For networks whose memory copying to be utilized for memory allocations. Despite its merits, vir- operation is bottlenecked by the data transfer bandwidth of tualizing memory can incur significant performance overheads PCIe however, vDNN can incur significant overheads with an when the time needed to copy data back and forth from CPU average 31% performance loss (worst case 52%, Section III). memory is higher than the latency to perform the computations The trend in deep learning is to employ larger and deeper required for DNN forward and backward propagation. We introduce a high-performance virtualization strategy based on networks that leads to large memory footprints that oversub- a “compressing DMA engine” (cDMA) that drastically reduces scribe GPU memory [13, 14, 15]. Therefore, ensuring the the size of the data structures that are targeted for CPU-side performance scalability of the virtualization features offered allocations.
    [Show full text]
  • NXP Eiq Machine Learning Software Development Environment For
    AN12580 NXP eIQ™ Machine Learning Software Development Environment for QorIQ Layerscape Applications Processors Rev. 3 — 12/2020 Application Note Contents 1 Introduction 1 Introduction......................................1 Machine Learning (ML) is a computer science domain that has its roots in 2 NXP eIQ software introduction........1 3 Building eIQ Components............... 2 the 1960s. ML provides algorithms capable of finding patterns and rules in 4 OpenCV getting started guide.........3 data. ML is a category of algorithm that allows software applications to become 5 Arm Compute Library getting started more accurate in predicting outcomes without being explicitly programmed. guide................................................7 The basic premise of ML is to build algorithms that can receive input data and 6 TensorFlow Lite getting started guide use statistical analysis to predict an output while updating output as new data ........................................................ 8 becomes available. 7 Arm NN getting started guide........10 8 ONNX Runtime getting started guide In 2010, the so-called deep learning started. It is a fast-growing subdomain ...................................................... 16 of ML, based on Neural Networks (NN). Inspired by the human brain, 9 PyTorch getting started guide....... 17 deep learning achieved state-of-the-art results in various tasks; for example, A Revision history.............................17 Computer Vision (CV) and Natural Language Processing (NLP). Neural networks are capable of learning complex patterns from millions of examples. A huge adaptation is expected in the embedded world, where NXP is the leader. NXP created eIQ machine learning software for QorIQ Layerscape applications processors, a set of ML tools which allows developing and deploying ML applications on the QorIQ Layerscape family of devices.
    [Show full text]
  • Oak Park Area Visitor Guide
    OAK PARK AREA VISITOR GUIDE COMMUNITIES Bellwood Berkeley Broadview Brookfield Elmwood Park Forest Park Franklin Park Hillside Maywood Melrose Park Northlake North Riverside Oak Park River Forest River Grove Riverside Schiller Park Westchester www.visitoakpark.comvisitoakpark.com | 1 OAK PARK AREA VISITORS GUIDE Table of Contents WELCOME TO THE OAK PARK AREA ..................................... 4 COMMUNITIES ....................................................................... 6 5 WAYS TO EXPERIENCE THE OAK PARK AREA ..................... 8 BEST BETS FOR EVERY SEASON ........................................... 13 OAK PARK’S BUSINESS DISTRICTS ........................................ 15 ATTRACTIONS ...................................................................... 16 ACCOMMODATIONS ............................................................ 20 EATING & DRINKING ............................................................ 22 SHOPPING ............................................................................ 34 ARTS & CULTURE .................................................................. 36 EVENT SPACES & FACILITIES ................................................ 39 LOCAL RESOURCES .............................................................. 41 TRANSPORTATION ............................................................... 46 ADVERTISER INDEX .............................................................. 47 SPRING/SUMMER 2018 EDITION Compiled & Edited By: Kevin Kilbride & Valerie Revelle Medina Visit Oak Park
    [Show full text]
  • Archive and Compressed [Edit]
    Archive and compressed [edit] Main article: List of archive formats • .?Q? – files compressed by the SQ program • 7z – 7-Zip compressed file • AAC – Advanced Audio Coding • ace – ACE compressed file • ALZ – ALZip compressed file • APK – Applications installable on Android • AT3 – Sony's UMD Data compression • .bke – BackupEarth.com Data compression • ARC • ARJ – ARJ compressed file • BA – Scifer Archive (.ba), Scifer External Archive Type • big – Special file compression format used by Electronic Arts for compressing the data for many of EA's games • BIK (.bik) – Bink Video file. A video compression system developed by RAD Game Tools • BKF (.bkf) – Microsoft backup created by NTBACKUP.EXE • bzip2 – (.bz2) • bld - Skyscraper Simulator Building • c4 – JEDMICS image files, a DOD system • cab – Microsoft Cabinet • cals – JEDMICS image files, a DOD system • cpt/sea – Compact Pro (Macintosh) • DAA – Closed-format, Windows-only compressed disk image • deb – Debian Linux install package • DMG – an Apple compressed/encrypted format • DDZ – a file which can only be used by the "daydreamer engine" created by "fever-dreamer", a program similar to RAGS, it's mainly used to make somewhat short games. • DPE – Package of AVE documents made with Aquafadas digital publishing tools. • EEA – An encrypted CAB, ostensibly for protecting email attachments • .egg – Alzip Egg Edition compressed file • EGT (.egt) – EGT Universal Document also used to create compressed cabinet files replaces .ecab • ECAB (.ECAB, .ezip) – EGT Compressed Folder used in advanced systems to compress entire system folders, replaced by EGT Universal Document • ESS (.ess) – EGT SmartSense File, detects files compressed using the EGT compression system. • GHO (.gho, .ghs) – Norton Ghost • gzip (.gz) – Compressed file • IPG (.ipg) – Format in which Apple Inc.
    [Show full text]
  • Forcepoint DLP Supported File Formats and Size Limits
    Forcepoint DLP Supported File Formats and Size Limits Supported File Formats and Size Limits | Forcepoint DLP | v8.8.1 This article provides a list of the file formats that can be analyzed by Forcepoint DLP, file formats from which content and meta data can be extracted, and the file size limits for network, endpoint, and discovery functions. See: ● Supported File Formats ● File Size Limits © 2021 Forcepoint LLC Supported File Formats Supported File Formats and Size Limits | Forcepoint DLP | v8.8.1 The following tables lists the file formats supported by Forcepoint DLP. File formats are in alphabetical order by format group. ● Archive For mats, page 3 ● Backup Formats, page 7 ● Business Intelligence (BI) and Analysis Formats, page 8 ● Computer-Aided Design Formats, page 9 ● Cryptography Formats, page 12 ● Database Formats, page 14 ● Desktop publishing formats, page 16 ● eBook/Audio book formats, page 17 ● Executable formats, page 18 ● Font formats, page 20 ● Graphics formats - general, page 21 ● Graphics formats - vector graphics, page 26 ● Library formats, page 29 ● Log formats, page 30 ● Mail formats, page 31 ● Multimedia formats, page 32 ● Object formats, page 37 ● Presentation formats, page 38 ● Project management formats, page 40 ● Spreadsheet formats, page 41 ● Text and markup formats, page 43 ● Word processing formats, page 45 ● Miscellaneous formats, page 53 Supported file formats are added and updated frequently. Key to support tables Symbol Description Y The format is supported N The format is not supported P Partial metadata
    [Show full text]
  • Data Compression and Archiving Software Implementation and Their Algorithm Comparison
    Calhoun: The NPS Institutional Archive Theses and Dissertations Thesis Collection 1992-03 Data compression and archiving software implementation and their algorithm comparison Jung, Young Je Monterey, California. Naval Postgraduate School http://hdl.handle.net/10945/26958 NAVAL POSTGRADUATE SCHOOL Monterey, California THESIS^** DATA COMPRESSION AND ARCHIVING SOFTWARE IMPLEMENTATION AND THEIR ALGORITHM COMPARISON by Young Je Jung March, 1992 Thesis Advisor: Chyan Yang Approved for public release; distribution is unlimited T25 46 4 I SECURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE la REPORT SECURITY CLASSIFICATION 1b RESTRICTIVE MARKINGS UNCLASSIFIED 2a SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION/AVAILABILITY OF REPORT Approved for public release; distribution is unlimite 2b DECLASSIFICATION/DOWNGRADING SCHEDULE 4 PERFORMING ORGANIZATION REPORT NUMBER(S) 5 MONITORING ORGANIZATION REPORT NUMBER(S) 6a NAME OF PERFORMING ORGANIZATION 6b OFFICE SYMBOL 7a NAME OF MONITORING ORGANIZATION Naval Postgraduate School (If applicable) Naval Postgraduate School 32 6c ADDRESS {City, State, and ZIP Code) 7b ADDRESS (City, State, and ZIP Code) Monterey, CA 93943-5000 Monterey, CA 93943-5000 8a NAME OF FUNDING/SPONSORING 8b OFFICE SYMBOL 9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER ORGANIZATION (If applicable) 8c ADDRESS (City, State, and ZIP Code) 10 SOURCE OF FUNDING NUMBERS Program Element No Project Nc Work Unit Accession Number 1 1 TITLE (Include Security Classification) DATA COMPRESSION AND ARCHIVING SOFTWARE IMPLEMENTATION AND THEIR ALGORITHM COMPARISON 12 PERSONAL AUTHOR(S) Young Je Jung 13a TYPE OF REPORT 13b TIME COVERED 1 4 DATE OF REPORT (year, month, day) 15 PAGE COUNT Master's Thesis From To March 1992 94 16 SUPPLEMENTARY NOTATION The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S.
    [Show full text]
  • Adventuring with Books: a Booklist for Pre-K-Grade 6. the NCTE Booklist
    DOCUMENT RESUME ED 311 453 CS 212 097 AUTHOR Jett-Simpson, Mary, Ed. TITLE Adventuring with Books: A Booklist for Pre-K-Grade 6. Ninth Edition. The NCTE Booklist Series. INSTITUTION National Council of Teachers of English, Urbana, Ill. REPORT NO ISBN-0-8141-0078-3 PUB DATE 89 NOTE 570p.; Prepared by the Committee on the Elementary School Booklist of the National Council of Teachers of English. For earlier edition, see ED 264 588. AVAILABLE FROMNational Council of Teachers of English, 1111 Kenyon Rd., Urbana, IL 61801 (Stock No. 00783-3020; $12.95 member, $16.50 nonmember). PUB TYPE Books (010) -- Reference Materials - Bibliographies (131) EDRS PRICE MF02/PC23 Plus Postage. DESCRIPTORS Annotated Bibliographies; Art; Athletics; Biographies; *Books; *Childress Literature; Elementary Education; Fantasy; Fiction; Nonfiction; Poetry; Preschool Education; *Reading Materials; Recreational Reading; Sciences; Social Studies IDENTIFIERS Historical Fiction; *Trade Books ABSTRACT Intended to provide teachers with a list of recently published books recommended for children, this annotated booklist cites titles of children's trade books selected for their literary and artistic quality. The annotations in the booklist include a critical statement about each book as well as a brief description of the content, and--where appropriate--information about quality and composition of illustrations. Some 1,800 titles are included in this publication; they were selected from approximately 8,000 children's books published in the United States between 1985 and 1989 and are divided into the following categories: (1) books for babies and toddlers, (2) basic concept books, (3) wordless picture books, (4) language and reading, (5) poetry. (6) classics, (7) traditional literature, (8) fantasy,(9) science fiction, (10) contemporary realistic fiction, (11) historical fiction, (12) biography, (13) social studies, (14) science and mathematics, (15) fine arts, (16) crafts and hobbies, (17) sports and games, and (18) holidays.
    [Show full text]
  • How to Configure Avira Virus Scanning
    Barracuda CloudGen Firewall How to Configure Avira Virus Scanning https://campus.barracuda.com/doc/43847300/ Configure Avira virus scanning on the Barracuda NG Firewall. Import a legacy license and and specify which threats that the engine should scan for. To configure Avira virus scanning, you can define settings for the following features: Archive Scanning – Define the settings for compressed scanning archives. Malware Detection – In addition to viruses, Avira can also detect malware, spyware, or bandwidth wasters. Specify which of these threats that the engine should scan for. Engine-Specific Options – Import a legacy license, specify an email address to receive license notifications, and specify a quarantine directory for Avira. HTTP Multimedia Streaming – Because the Virus Scanner service downloads an entire file before scanning and delivering it, some audio or video streams cannot be accessed. Enable content streaming by disabling virus scanning for specific DNS domains. In this article: Before you Begin Before configuring Avira virus scanning, activate the Virus Scanner service. For more information, see How to Enable the Virus Scanner. The Avira scan engine requires an additional license (file extension: *.KEY). This license file must be imported with the Avira License Button. Configure Virus Scanning 1. Go to CONFIGURATION > Configuration Tree > Box > Virtual Servers > your virtual server > Assigned Services > Virus-Scanner > Virus Scanner Settings. 2. In the left menu, select Avira. 3. Set Scan Archives to yes to enable the archive scan. 4. In the Avira Archive Scanning section, define the following archive scanning settings: Max. Scan Size (MB) – The maximum amount of data to be scanned for each file (default:1024).
    [Show full text]
  • Nczoo Society
    PERSPECTIVE Dear Members and Donors: As in 2003, the Zoo Society’s 2004 net declined slightly due to a significant drop in one category: estate gifts. Every other category increased: membership dues, gift shop sales, corporate gifts and sponsorships, direct donations and special events. As a result, the Zoo Society’s 2004 net of $3,200,761 more than kept pace with Zoo funding requests. In all, the Zoo Society transferred $2,740,894 in grants to the N.C. Zoo. These grants enriched the lives of animals and advanced the cause of conservation around the world. Society grants supported art projects, innovative education programs, and important conserva- tion research. They also made the zoo experience more meaningful and fun for visitors. Yet it would not be unreasonable to say that our greatest successes in 2004 may be in the foundations laid for the future. We made great strides toward raising the $6 million we will need for the Watani Grasslands project. This expansion of the African plains exhibits will make the N.C. Zoo a world leader in elephant and rhinoceros conservation, and will create an unequalled immersion experience for visitors. Another important initiative, the Valerie H. Schindler Wildlife Learning Center, promises virtually limitless possibilities for the future of conservation education and research, and for expanding the Zoo’s reach far beyond the Park itself. All of this is made possible by the generosity of donors like you. Ultimately, members and donors are the N.C. Zoo Society, so the Zoo and Zoo Society want to thank all of our members, donors and friends for the trust and support they have given us in 2004.
    [Show full text]
  • A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks
    A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks Youjie Li1, Jongse Park2, Mohammad Alian1, Yifan Yuan1, Zheng Qu3, Peitian Pan4, Ren Wang5, Alexander Gerhard Schwing1, Hadi Esmaeilzadeh6, Nam Sung Kim1 1University of Illinois, Urbana-Champaign, 2Georgia Institute of Technology, 3Tsinghua University, 4Shanghai Jiao Tong University, 5Intel Corporation, 6University of California, San Diego Abstract—Training real-world Deep Neural Networks (DNNs) Worker Node Aggregator Node can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art distributed training algorithms use a hierarchy of worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant (a) (b) (c) communication cost by embedding data compression accelerators Fig. 1: (a) State-of-the-art hierarchical distributed training. (b) in the Network Interface Cards (NICs). To maximize the benefits INCEPTIONN’s distributed training algorithm in the conventional hi- of in-network acceleration, the proposed solution, named erarchy. (c) Hierarchical use of INCEPTIONN’s distributed algorithm. INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines size (e.g., AlexNet and ResNet-50 consist of 232 MB and hardware and algorithmic innovations by exploiting the following 98 MB of weights). Moreover, accelerators, which further three observations. (1) Gradients are significantly more tolerant cut the computation time, make the cost of communication to precision loss than weights and as such lend themselves better more pronounced [12, 13].
    [Show full text]
  • Sad Day... GIF Patent Dead at 20 10/13/08 5:45 PM
    Sad day... GIF patent dead at 20 10/13/08 5:45 PM Sad day... GIF patent dead at 20 I just heard some sad news on talk radio - GIF patent US4,558,302 was found expired in its patent office filing cabinet this morning. There weren't any more details. I'm sure everyone in the internet community will miss it - even if you didn't enjoy the litigation, there's no denying its contribution to bandwidth conservation. Truly a compression icon. On Friday, 20th June 2003, the death knell sounded for US patent number 4,558,302. Having benefitted the Unisys Corporation for twenty years, the contents of the patent are entered into the Public Domain and may be used absolutely freely by anyone. Officially titled "High speed data compression and decompression apparatus and method", it is more commonly known as the LZW patent or "Unisys's GIF tax". Unlike most patents, the LZW patent has been the subject of much controversy during its lifetime. This is because thousands of software developers, with millions of users, were unintentionally infringing on the patent and the patent holder neglected to tell them for seven years. History A time line of the GIF/LZW patent saga runs like so: May 1977: [LZ77] published. The paper uses maths notation heavily and is unreadable to most people, but generally the technique works by looking entire repeated "strings" of data, and replacing the repeats with a much shorter pointer back to the original. Many compression algorithms build on this technique and most archivers (e.g.
    [Show full text]
  • Zipgenius 6 Help
    Table of contents Introduction ........................................................................................................... 2 Why compression is useful? ................................................................................ 2 Supported archives ............................................................................................ 3 ZipGenius features ............................................................................................. 5 Updates ............................................................................................................ 7 System requirements ......................................................................................... 8 Unattended setup .............................................................................................. 8 User Interface ........................................................................................................ 8 Toolbars ............................................................................................................ 8 Common Tasks Bar ............................................................................................ 9 First Step Assistant .......................................................................................... 10 History ............................................................................................................ 10 Windows integration ........................................................................................ 12 Folder view .....................................................................................................
    [Show full text]