Fast and Scalable Pattern Mining for Media-Type Focused Crawling

Total Page:16

File Type:pdf, Size:1020Kb

Fast and Scalable Pattern Mining for Media-Type Focused Crawling Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Fast and Scalable Pattern Mining for Media-Type Focused Crawling Author(s) Umbrich, Jürgen; Karnstedt, Marcel; Harth, Andreas Publication Date 2009 Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Publication Scalable Pattern Mining for Media-Type Focused Crawling", Information KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009. Item record http://hdl.handle.net/10379/1121 Downloaded 2021-09-27T17:53:57Z Some rights reserved. For more information, please see the item record link above. Fast and Scalable Pattern Mining for Media-Type Focused Crawling∗ [experience paper] Jurgen¨ Umbrich and Marcel Karnstedt and Andreas Harthy Digital Enterprise Research Institute (DERI) National University of Ireland, Galway, Ireland fi[email protected] Abstract 1999]) wants to infer the topic of a target page before de- voting bandwidth to download it. Further, a page’s content Search engines targeting content other than hy- may be hidden in images. pertext documents require a crawler that discov- ers resources identifying files of certain media types. Na¨ıve crawling approaches do not guaran- A crawler for media type targeted search engines is fo- tee a sufficient supply of new URIs (Uniform Re- cused on the document formats (such as audio and video) source Identifiers) to visit; effective and scalable instead of the topic covered by the documents. For a scal- mechanisms for discovering and crawling tar- able media type focused crawler it is absolutely essential geted resources are needed. One promising ap- to discover documents of the requested media type on the proach is to use data mining techniques to iden- Web and to avoid expensive HTTP lookups of irrelevant tify the media type of a resource without the need files. Thus, the crawler needs to identify the media type of for downloading the content of the resource. The a document without establishing a connection and down- idea is to use a learning approach on features loading the content. A common way to identify the format derived from patterns occuring in the resource of a file is to use the file extension of the file name or to identifier. We present a focused crawler as a use detect characteristic byte patterns in the file content itself case for fast and scalable data mining and dis- (magic number approach), which does not scale well. The cuss classification and pattern mining techniques latter approach is not suitable because it requires to retrieve suited for selecting resources satisfying specified the data which is expensive and time consuming task. We media types. We show that we can process an av- can conclude that the file extension is only for some media erage of 17,000 URIs/second and still detect the types suitable as an identifier based on a study of 22M Web media type of resources with a precision of more documents [Umbrich et al., 2008] in 2008. than 80% and a recall of over 65% for all media types. We propose to use classification or pattern mining tech- niques to discover Web documents of requested media types without analysing the content. For this, we utilise 1 Introduction information available for a crawler during the crawling pro- The number of search engines focused on specific top- cess to classify and filter URIs for pre-defined media types. ics has increased significantly over recent years. Besides Note that learning patterns without analysing the content of search engines focused on (hyper)text documents, spe- files can be applied in other scenarios as well, e.g., in genre cialised search engines are available online which collect classification. We present general data mining approaches and integrate information from files of particular media suited for this task, discuss their strengths and weaknesses, types. Seeqpod [URL, j] and Blinkx [URL, g] offer search and, based on first experiences, present a classifier-based over audio and video files, Google Scholar [URL, b] and solution. As this method still comes with several disadvan- CiteSeer [URL, a] are digital libraries of printable docu- tages, we further propose to apply frequent pattern min- ments, Technorati [URL, d] provides real-time access to ing approaches as an alternative. Here we focus on stream news-feeds, and Seekda [URL, c] offers search capabilities mining approaches, as they provide a fast and scalable solu- for web services. A common issue for all vertical search tion that is perfectly suited for the long-running and input- engines is the challenge of discovering and downloading intensive task of a Web crawler. the targeted files on the Web. Specifically, the challenge of detecting documents of a certain media type without in- The remainder of the paper is organised as follows: In specting the content is still not solved [Lausen and Hasel- Section 2 we briefly present the basics of a crawler focus- wanter, 2007]. For this task, a URI-only classifier is a good ing on media-types and discuss general data mining ap- choice, because speed is crucial and content filtering should proaches suited for this. Section 3 presents first experi- be enabled before an (objectionable) web page is down- ences that we gained using implementations available in the loaded. Basically, a focused crawler ([Chakrabarti et al., WEKA toolkit. Based on these experiences, we propose a ∗This material is based upon works jointly supported by the classifier approach in Section 4 and discuss an improve- Science Foundation Ireland under Grant No. SFI/08/CE/I1380 ment based on pattern stream mining. Section 5 contains (Lion-2) and under Grant No. 08/SRC/I1407 (Clique: Graph & an evaluation of the classifier approach. Finally, Section 6 Network Analysis Cluster) briefly presents related work and Section 7 concludes the yCurrent affiliation: Institut AIFB, Universitat¨ Karlsruhe (TH) paper. 2 Data Mining for Focused Crawling the media type video/x-msvideo. The mapped me- The principle of a crawler is to start with a set of seed dia type matches with the real media type of our example. URIs and to recursively follow the discovered links. For The QUERY component is a string of information to be in- this, crawlers follow, for instance, a breadth-first or depth- terpreted by the resource and is omitted here. We use the first approach. This means, the crawler changes domains following notation for the URI components in this paper: through the crawling job and may be returning to domains • E for the file extension of the FILE component already visited before. A focused crawler tries to follow • F for the filename only links to pages and files of a specific type – in our case the media type of files. • the single tokens in the PATH components are labeled Our basic idea is to utilise the information available for with T a crawling during the crawl loop to identify the media-type • the HOST part is labeled with D of Web resources without inspecting the content itself. The accessible information sources for a crawler are [Umbrich 2.2 Classification et al., 2008]: We can map our media-type identification problem to a 1. the URI classification task. Based on a set of predefined media 2. the link position of the URI in the HTML document types, we want to classify a URI to its real media type. Therefore, we have to split the components of a URI into 3. the information sent with the HTTP response header a feature set that serves as the input for the classification However, in this work we exclusively focus only on the algorithm. Possible classification algorithms for this task mining of features contained in the URI. URIs are also could be one of the following: The Bayesian classifier used for topic classification [Baykan et al., 2009]. Media shows convincing results for classifying text. Most spam types are registered with their related RFC description with detection applications use a Bayesian classifier to decide 2 IANA [URL, h]. The RFC description contains a general whether or not a text snippet or an email is spam [Sahami explanation of the purpose of the media type and also rec- et al., 1998]. The support vector machines (SVM) al- ommendation for the file extension(s) to use when publish- gorithm is also used for focused web crawling [Sizov et ing a document. A media type consists of two parts, 1) the al., 2002] and achieved reasonably good results in classi- content-type and 2) the sub-type, separated by “/” fying documents to topics. The C4.5. algorithm gener- (content-type/sub-type). To explain our thoughts ates a pruned or unpruned decision tree [Quinlan, 1993] we introduce an example URI1 referring to a W3C Video [Quinlan, 1996]. A common Bayesian network classifica- on the Web Workshop in 2007. tion algorithm [Pearl, 1985] is also featured. Noteworthy, it is proven that a classification algorithm based on decision 2.1 URI trees and word tokens from anchor text can significantly [ ] Every network-retrievable document has a Uniform Re- improve focused crawling Li et al., 2005 . Hence, we do source Identifier (URI) as defined in RFC2396 [URL, i], not know how it performs for the classification task based which specifies that a URI consists of five components. on a small set of features. Some disadvantages come along with the classification approach. If we use a supervised algorithm we need to gen- [PROTOCOL]://[HOST]:[PORT]/[PATH][FILE]?[QUERY] erate a training set. Given that there exist over 200 different Please note that we focus exclusively on the Hypertext media types on the Web and that 80% of the documents are Transfer Protocol (HTTP) and omit the other protocols, text/html documents, we have the problem of gathering a such as the File Transfer Protocol (FTP, RFC 959 [URL, training set with representative candidates.
Recommended publications
  • Architecture of the World Wide Web, First Edition Editor's Draft 14 October 2004
    Architecture of the World Wide Web, First Edition Editor's Draft 14 October 2004 This version: http://www.w3.org/2001/tag/2004/webarch-20041014/ Latest editor's draft: http://www.w3.org/2001/tag/webarch/ Previous version: http://www.w3.org/2001/tag/2004/webarch-20040928/ Latest TR version: http://www.w3.org/TR/webarch/ Editors: Ian Jacobs, W3C Norman Walsh, Sun Microsystems, Inc. Authors: See acknowledgments (§8, pg. 42). Copyright © 2002-2004 W3C ® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements. Abstract The World Wide Web is an information space of interrelated resources. This information space is the basis of, and is shared by, a number of information systems. In each of these systems, people and software retrieve, create, display, analyze, relate, and reason about resources. The World Wide Web uses relatively simple technologies with sufficient scalability, efficiency and utility that they have resulted in a remarkable information space of interrelated resources, growing across languages, cultures, and media. In an effort to preserve these properties of the information space as the technologies evolve, this architecture document discusses the core design components of the Web. They are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. We relate core design components, constraints, and good practices to the principles and properties they support. Status of this document This section describes the status of this document at the time of its publication.
    [Show full text]
  • Supplement 211: Dicomweb Support for the Application/Zip Payload
    5 Digital Imaging and Communications in Medicine (DICOM) Supplement 211: 10 DICOMweb Support for the application/zip Payload 15 20 Prepared by: Bill Wallace, Brad Genereaux DICOM Standards Committee, Working Group 27 1300 N. 17th Street Rosslyn, Virginia 22209 USA 25 Developed in accordance with work item WI 2018 -09 -C VERSION: 19 January 16, 2020 Table of Contents Scope and Field of Application ........................................................................................................................................ iii 30 Open Questions ....................................................................................................................................................... iii Closed Questions .................................................................................................................................................... iiii 8.6.1.3.1 File Extensions ................................................................................................................................. viv 8.6.1.3.2 BulkData URI ................................................................................................................................... viv 8.6.1.3.3 Logical Format ........................................................................................................................................ viv 35 8.6.1.3.4 Metadata Representations ...................................................................................................................... viv Scope and Field of Application
    [Show full text]
  • Describing Media Content of Binary Data in XML W3C Working Group Note 4 May 2005
    Table of Contents Describing Media Content of Binary Data in XML W3C Working Group Note 4 May 2005 This version: http://www.w3.org/TR/2005/NOTE-xml-media-types-20050504 Latest version: http://www.w3.org/TR/xml-media-types Previous version: http://www.w3.org/TR/2005/NOTE-xml-media-types-20050502 Editors: Anish Karmarkar, Oracle Ümit Yalçınalp, SAP Copyright © 2005 W3C ® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply. > >Abstract This document addresses the need to indicate the content-type associated with binary element content in an XML document and the need to specify, in XML Schema, the expected content-type(s) associated with binary element content. It is expected that the additional information about the content-type will be used for optimizing the handling of binary data that is part of a Web services message. Status of this Document This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/. This document is a W3C Working Group Note. This document includes the resolution of the comments received on the Last Call Working Draft previously published. The comments on this document and their resolution can be found in the Web Services Description Working Group’s issues list. There is no technical difference between this document and the 2 May 2005 version; the acknowledgement section has been updated to thank external contributors.
    [Show full text]
  • 2016 Technical Guidelines for Digitizing Cultural Heritage Materials
    September 2016 Technical Guidelines for Digitizing Cultural Heritage Materials Creation of Raster Image Files i Document Information Title Editor Technical Guidelines for Digitizing Cultural Heritage Materials: Thomas Rieger Creation of Raster Image Files Document Type Technical Guidelines Publication Date September 2016 Source Documents Title Editors Technical Guidelines for Digitizing Cultural Heritage Materials: Don Williams and Michael Creation of Raster Image Master Files Stelmach http://www.digitizationguidelines.gov/guidelines/FADGI_Still_Image- Tech_Guidelines_2010-08-24.pdf Document Type Technical Guidelines Publication Date August 2010 Title Author s Technical Guidelines for Digitizing Archival Records for Electronic Steven Puglia, Jeffrey Reed, and Access: Creation of Production Master Files – Raster Images Erin Rhodes http://www.archives.gov/preservation/technical/guidelines.pdf U.S. National Archives and Records Administration Document Type Technical Guidelines Publication Date June 2004 This work is available for worldwide use and reuse under CC0 1.0 Universal. ii Table of Contents INTRODUCTION ........................................................................................................................................... 7 SCOPE .......................................................................................................................................................... 7 THE FADGI STAR SYSTEM .......................................................................................................................
    [Show full text]
  • Media Type Application/Vnd.Oracle.Resource+Json
    New Media Type for Oracle REST Services to Support Specialized Resource Types O R A C L E WHITEPAPER | M A R C H 2 0 1 5 Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Contents Introduction 3 Conventions and Terminology 3 Core terminology 3 Singular Resource 4 Collection Resource 8 Exception Detail Resource 13 Status Resource 14 Query Description Resource 15 create-form Resource 16 edit-form Resource 17 JSON Schema 18 IANA Considerations 28 References 28 Change Log 28 2 | ORACLE WHITEPAPER: NEW MEDIA TYPE FOR ORACLE REST SERVICES TO SUPPORT SPECIALIZED RESOURCE TYPES Introduction This document defines a new media type, application/vnd.oracle.resource+json, which can be used by REST services to support the specialized resource types defined in the following table. Resource Type Description Singular Single entity resource, such as an employee or a purchase order. For more information, see “Singular Resource.” Collection List of items, such as employees or purchase orders. See “Collection Resource.” Exception Detail Detailed information about a failed request. See “Exception Detail Resource.” Status Status of a long running job. See “Status Resource.” Query description Query syntax description used by client to build the "q" query parameter.
    [Show full text]
  • Social Media Solution Guide
    Social Media Solution Guide Deploy Social Messaging Server with an RSS Channel 9/30/2021 Deploy Social Messaging Server with an RSS Channel Deploy Social Messaging Server with an RSS Channel Contents • 1 Deploy Social Messaging Server with an RSS Channel • 1.1 Prepare the RSS Channel • 1.2 Configure the Options • 1.3 Interaction Attributes • 1.4 Next Steps Social Media Solution Guide 2 Deploy Social Messaging Server with an RSS Channel Warning The APIs and other features of social media sites may change with little warning. The information provided on this page was correct at the time of publication (22 February 2013). For an RSS channel, you need two installation packages: Social Messaging Server and Genesys Driver for Use with RSS. The Driver adds RSS-specific features to Social Messaging Server and does not require its own Application object in the Configuration Server database. You can also create a Custom Media Channel Driver. Important Unlike some other eServices components, Social Messaging Server does not require Java Environment and Libraries for eServices and UCS. Prepare the RSS Channel 1. Deploy Social Messaging Server. 2. Run the installation for Genesys Driver for Use with RSS, selecting the desired Social Messaging Server object: Social Media Solution Guide 3 Deploy Social Messaging Server with an RSS Channel Select your Social Messaging Server Object 3. Locate the driver-for-rss-options.cfg configuration file in the \<Social Messaging Server application>\media-channel-drivers\channel-rss directory. 4. In Configuration Manager, open your Social Messaging Server Application, go to the Options tab, and import driver-for-rss-options.cfg.
    [Show full text]
  • Wrangling Messy CSV Files by Detecting Row and Type Patterns
    Wrangling Messy CSV Files by Detecting Row and Type Patterns Gerrit J.J. van den Burg1, Alfredo Nazábal1, and Charles Sutton1,2,3 1The Alan Turing Institute, London, UK 2Google, Inc. Mountain View, CA, USA 3School of Informatics, The University of Edinburgh, UK November 29, 2018 Abstract It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of real- world CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library. Keywords — Data Wrangling, Data Parsing, Comma Separated Values arXiv:1811.11242v1 [cs.DB] 27 Nov 2018 1 CSV is a textbook example of how not to design a textual file format.
    [Show full text]
  • Client-Side Web Technologies
    Client-Side Web Technologies Introduction to HTTP MIME • Multipurpose Internet Mail Extensions • Introduced in 1996 • Created to extend email to http://www.maran.com/dictionary/m/mime/image.gif support: • Text in character sets other than ASCII • Non-text content • Multi-part message Bodies • Header info in non-ASCII character sets MIME Header Fields • MIME-Version • Declares version of message Body format standard in use • Content-Type • DescriBes the data contained in the Body • Content-Disposition • DescriBes how a Body part should Be presented (e.g. inline or attachment) • There are others But we won’t discuss them… MIME Content-Type • DescriBes the data in the Body of a MIME entity • Consists of: • Top level media type • Declares the general type of data • SuBtype • Specifies a specific format for that type of data • Parameters that modify the suBtype (optional) • Due to expanded use, now known as Internet Media Types • IANA maintains the list of registered Media Types: • http://www.iana.org/assignments/media-types Top-Level Media Types • Text • Textual information • Image • Image data • Audio • Audio data • Video • Video data • Application • Some other kind of data (typically Binary, to Be processed By some application) • Multipart • Data consisting of multiple entities of independent data types • SuBtypes such as mixed, alternative, byteranges, and form-data (for HTML forms) MIME Message Examples From: John Doe <[email protected]> Subject: Hello MIME-Version: 1.0 Content-Type: text/plain; This is a message in MIME format. From:
    [Show full text]
  • Supplement 174: Restful Rendering Page 1
    9 September 2015 Supplement 174: Restful Rendering Page 1 5 10 Digital Imaging and Communications in Medicine (DICOM) Supplement 174: RESTful Rendering 15 20 DICOM Standards Committee, Working Group 27: Web Technologies 1300 N. 17th Street Suite 900 Rosslyn, Virginia 22209 USA VERSION: Final Text Developed in accordance with work item 2014-04-A. 25 9 September 2015 Supplement 174: Restful Rendering Page 2 Table of Contents 1 Scope and Field of Application 3 2 Normative References 3 4 Terms and Definitions 3 30 6 Data Communication Requirements 4 6.1.1 Media Types 4 6.1.1.1 Multipart Media Types 5 6.1.1.2 DICOM Resource Categories 5 6.1.1.3 Rendered Media Types 6 35 6.1.1.4 Acceptable Media Types 7 6.1.1.5 Accept Query Parameter 7 6.1.1.6 Accept Header Field 8 6.1.1.7 Selected Media Type 8 6.1.2 Character Sets 9 40 6.1.2.1 Acceptable Character Sets 10 6.1.2.2 Character Set Query Parameter 10 6.1.2.3 Accept-Charset Header Field 10 6.1.2.4 Selected Character Set 10 6.2.2 List of Media Types Supported Acceptable in the Response 11 45 6.2.2.1 Query Parameters 11 6.2.2.1.1 Accept Query Parameter 11 6.2.2.1.2 Character Set Query Parameter 11 6.2.2.2 Header Fields 12 6.2.2.2.1 Accept 12 50 6.2.2.2.2 Accept-Charset 12 6.5 WADO-RS Request/Response 12 6.5.8 RS Retrieve Rendered Transaction 13 6.5.8.1 Request 13 6.5.8.1.1 Target Resources 13 55 6.5.8.1.2 Query Parameters 14 6.5.8.1.3 Header Fields 16 6.5.8.1.4 Payload 16 6.5.8.2 Behavior 16 6.5.8.2.1 Presentation State Instance 17 60 6.5.8.3 Response 17 6.5.8.3.1 Status Codes 18 6.5.8.3.2 Header Fields 18 6.5.8.3.3 Payload 18 6.5.8.4 Media Types 18 65 8.1.5 MIME Acceptable Media Types of the Response 18 8.1.6 Charset of the Response 19 Annex D IANA Character Set Mapping (Informative) 20 9 September 2015 Supplement 174: Restful Rendering Page 3 70 1 Scope and Field of Application This supplement defines rendering functionality for Restful Services (RS) that is largely equivalent to the functionality in the URI and WS services.
    [Show full text]
  • MIME Media Types for GML
    Open Geospatial Consortium Inc. Date: 2010-02-08 Reference number of this OGC® document: OGC 09-144r1 Version: 1.0 Category: OGC® TC Policies and Procedures Editor: Clemens Portele Technical Committee Policies and Procedures: MIME Media Types for GML Copyright © 2010. Open Geospatial Consortium, Inc. To obtain additional rights of use, visit http://www.opengeospatial.org/legal/ Warning This document is a Policies and Procedures Document. It is subject to change based on membership requirements. This document is an official position of the OGC membership. Document type: OpenGIS® Policy Document Document subtype: NA Document stage: Proposed version 1.0 Document language: English Table of Contents 1 Scope ............................................................................................................................. 6 2 The GML MIME Type specification ............................................................................ 6 2.1 General ................................................................................................................... 6 2.2 Encoding considerations ........................................................................................ 6 2.3 Security considerations .......................................................................................... 6 2.4 Interoperability considerations ............................................................................... 7 2.5 Published standard ................................................................................................
    [Show full text]
  • Digital Messaging Server Guide
    Digital Messaging Server Guide Deploy Digital Messaging Server with an RSS Channel 9/25/2021 Contents • 1 Deploy Digital Messaging Server with an RSS Channel • 1.1 Prepare the RSS Channel • 1.2 Configure the Options • 1.3 Interaction Attributes • 1.4 Next Steps Digital Messaging Server Guide 2 Deploy Digital Messaging Server with an RSS Channel Deploy Digital Messaging Server with an RSS Channel Warning The APIs and other features of social media sites may change with little warning. The information provided on this page was correct at the time of publication (Feb. 22, 2013). For an RSS channel, you need two installation packages: • Digital Messaging Server • Genesys Driver for Use with RSS The Driver adds RSS-specific features to Digital Messaging Server and does not require its own Application object in the Configuration Server database. You can also create a Custom Media Channel Driver. Important Unlike some other eServices components, Digital Messaging Server does not require Java Environment and Libraries for eServices and UCS. Prepare the RSS Channel 1. Deploy Digital Messaging Server. 2. Run the installation for Genesys Driver for Use with RSS, selecting the desired Digital Messaging Server object. 3. Locate the driver-for-rss-options.cfg configuration file in the \<Digital Messaging Server application>\media- channel-drivers\channel-rss directory. 4. In Configuration Manager, open your Digital Messaging Server Application, go to the Options tab, and import driver- for-rss-options.cfg. Configure the Options Set the following options:
    [Show full text]
  • Williams College Archives and Special Collections File Format Recommendations
    Williams College Archives and Special Collections File Format Recommendations Based on current digital data curation best practice and recommendations, the Williams College Archives has set the following confidence levels for the longevity of commonly used file formats. Record creators are encouraged to take these recommendations into consideration when determining what file formats to save records as during their active lifecycle. The longer the active lifecycle is anticipated to be, the greater the need to use formats ranked as medium or high confidence. Media type High Confidence Level Medium Confidence Low Confidence Level Level Text - Plain text (encoding: USASCII, - Cascading Style Sheets - PDF (.pdf) (encrypted) UTF-8, UTF-16 with (.css) - Microsoft Word (.doc) BOM) - DTD (.dtd) - WordPerfect (.wpd) - XML (includes XSD/XSL/ - Plain text (ISO 8859-1 - DVI (.dvi) XHTML, etc.; with included or encoding) - All other text forMats accessible schema and - PDF (.pdf) (eMbedded not listed here character encoding explicitly fonts) specified) - Rich Text ForMat 1.x (.rtf) - PDF/A-1 (ISO 19005-1) - HTML (include a DOCTYPE (.pdf) declaration) - SGML (.sgMl) - Open Office (.sxw/.odt) - OOXML (ISO/IEC DIS 29500) (.docx) Raster Images - TIFF (uncoMpressed) - BMP (.bMp) - MrSID (.sid) - JPEG2000 (lossless) (.jp2) - JPEG/JFIF (.jpg) - TIFF (in Planar forMat) -PNG - JPEG2000 (lossy) (.jp2) - FlashPix (.fpx) - TIFF (coMpressed) - PhotoShop (.psd) - GIF (.gif) - RAW - Digital Negative DNG - JPEG 2000 Part 2 (*.jpf, (.dng) .jpx) - PNG (.png) - All other
    [Show full text]