How to Convert Html Pages to Text in Console / Terminal on GNU / Linux and Freebsd

Total Page:16

File Type:pdf, Size:1020Kb

How to Convert Html Pages to Text in Console / Terminal on GNU / Linux and Freebsd Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD Author : admin I'm realizing the more I'm converting to a fully functional GUI user, the less I'm doing coding or any interesting stuff... I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day. Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files In order to fetch a page back in the days I used lynx - (a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text - (an advanced HTML-to- text converter). Let's say I wanted to fetch a my personal home page http://www.pc-freak.net/, I did that via the command: $ lynx -source http://www.pc-freak.net/ | html2text > pcfreak_page.txt The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt The bit more advanced elinks - (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.: $ elinks -source http://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt 1 / 3 Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output. Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it: $ wget -qO- http://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera 's File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I'm about to save as a plain Text File in Mozilla Firefox: Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer's hard drive to a plain (.TXT) text format. One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls ;) Converting an html files already existing on hard drive with html2text is done with cmd: $ html2text index.html >index.txt 2 / 3 Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command: $ cd html/ html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e 's#.html#.txt#g'); done Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable ;) 3 / 3 Powered by TCPDF (www.tcpdf.org).
Recommended publications
  • The Elinks Manual the Elinks Manual Table of Contents Preface
    The ELinks Manual The ELinks Manual Table of Contents Preface.......................................................................................................................................................ix 1. Getting ELinks up and running...........................................................................................................1 1.1. Building and Installing ELinks...................................................................................................1 1.2. Requirements..............................................................................................................................1 1.3. Recommended Libraries and Programs......................................................................................1 1.4. Further reading............................................................................................................................2 1.5. Tips to obtain a very small static elinks binary...........................................................................2 1.6. ECMAScript support?!...............................................................................................................4 1.6.1. Ok, so how to get the ECMAScript support working?...................................................4 1.6.2. The ECMAScript support is buggy! Shall I blame Mozilla people?..............................6 1.6.3. Now, I would still like NJS or a new JS engine from scratch. .....................................6 1.7. Feature configuration file (features.conf).............................................................................7
    [Show full text]
  • HTTP Cookie - Wikipedia, the Free Encyclopedia 14/05/2014
    HTTP cookie - Wikipedia, the free encyclopedia 14/05/2014 Create account Log in Article Talk Read Edit View history Search HTTP cookie From Wikipedia, the free encyclopedia Navigation A cookie, also known as an HTTP cookie, web cookie, or browser HTTP Main page cookie, is a small piece of data sent from a website and stored in a Persistence · Compression · HTTPS · Contents user's web browser while the user is browsing that website. Every time Request methods Featured content the user loads the website, the browser sends the cookie back to the OPTIONS · GET · HEAD · POST · PUT · Current events server to notify the website of the user's previous activity.[1] Cookies DELETE · TRACE · CONNECT · PATCH · Random article Donate to Wikipedia were designed to be a reliable mechanism for websites to remember Header fields Wikimedia Shop stateful information (such as items in a shopping cart) or to record the Cookie · ETag · Location · HTTP referer · DNT user's browsing activity (including clicking particular buttons, logging in, · X-Forwarded-For · Interaction or recording which pages were visited by the user as far back as months Status codes or years ago). 301 Moved Permanently · 302 Found · Help 303 See Other · 403 Forbidden · About Wikipedia Although cookies cannot carry viruses, and cannot install malware on 404 Not Found · [2] Community portal the host computer, tracking cookies and especially third-party v · t · e · Recent changes tracking cookies are commonly used as ways to compile long-term Contact page records of individuals' browsing histories—a potential privacy concern that prompted European[3] and U.S.
    [Show full text]
  • Share Brother Printer DCP-1610W with Linux CUPS and Samba Windows Share
    ? Walking in Light with Christ - Faith, Computing, Diary Linux, UNIX, FreeBSD, Windows, Mac OS - Hacks, Goodies, Tips and Tricks and The True Meaning of life http://www.pc-freak.net/blog Enable printing from Windows and Macs remotely through Linux Print server - Share Brother Printer DCP-1610W with Linux CUPS and Samba Windows Share Author : admin I've recently bought a new Printer model Brother DCP 1610W and as in my home I have already a small Linux router and a web server where this blog and a couple of other websites runs and I need multiple PC / notebook / mobile phone enabled people to print on the Printer easily pretty much like a Printing server for a Small Office environment. To do that of course I needed it configured to be accessible remotely for print via LAN and Wireless network. The task is not a complex one and printing remotely over the network is a standard thing many company organizations / universities and univerities does for quite some time and hence nowadays most printers are network connect ready so you just have to place them inside your home or corporate network and use the time to configure them via their web configuration interface or even some have their own embedded wifi adapter, as well as many printers nowdays can even be ready to print directly by just 1 / 17 ? Walking in Light with Christ - Faith, Computing, Diary Linux, UNIX, FreeBSD, Windows, Mac OS - Hacks, Goodies, Tips and Tricks and The True Meaning of life http://www.pc-freak.net/blog connecting the Printer to the Wi-Fi network and installing its drivers on a Win host.
    [Show full text]
  • Adriane-Manual – Wikibooks
    Adriane-Manual – Wikibooks Adriane-Manual Notes to this wikibook Target group: Users of the ADRIANE (http://knopper.net/knoppix-adriane /)-Systems as well as people who want to install the system, configure or provide training to. Learning: The user should be enabled, to use the system independently and without sighted assistance and to work productively with the installed programs and services. This book is a "reference book" for a user in which he takes aid to individual tasks. The technician will get instructions for the installation and configuration of the system, so that he can configure it to meet the needs of the user. Trainers should be enabled to understand easily and to explain the system to users so that they can learn how to use it in a short time without help. Contact: Klaus Knopper Are Co authors currently wanted? Yes, in prior consultation with the contact person to coordinate the writing of individual chapters, please. Guidelines for co authors: see above. A clear distinction would be desirable between 'technical part' and 'User part'. Topic description The user part of the book deals with the use of programs that are included with Adriane, as well as the operation of the screen reader. The technical part explains the installation and configuration of Adriane, especially the connection of Braille lines, set-up of internet access, configuration of the mail program etc. Inhaltsverzeichnis 1 Introduction 2 Working with Adriane 2.1 Start and help 2.2 Individual menu system 2.3 Voice output functions 2.4 Programs in Adriane 2.4.1
    [Show full text]
  • UC-Elinks Direct Linking Usability Report UC-Elinks Project
    UC-eLinks Direct Linking Usability Report UC-eLinks Project March 5, 2009 Report Author: Jane Lee, CDL Editor: Felicia Poe, CDL On-Site Coordinator: Laura Calverley, UC Berkeley Report URL: http://www.cdlib.org/inside/assess/evaluation_activities/docs/2009/UC- eLinks_directLinking_jan2009.pdf UC-eLinks Direct Linking Usability Report Table of Contents 1 INTRODUCTION............................................................................................................................. 3 2 SETTING THE STAGE: GRADUATE RESEARCHERS............................................................. 4 2.1 WHAT USERS DO........................................................................................................................ 4 2.2 WHAT USERS WANT ................................................................................................................... 4 2.3 LIBRARY WEBSITE AS AN ACCESS POINT .................................................................................... 4 2.4 WHY GOOGLE SCHOLAR?........................................................................................................... 5 3 UC-ELINKS: DON’T MAKE ME THINK ....................................................................................... 7 3.1 SIMPLIFY, SIMPLIFY, SIMPLIFY .................................................................................................... 7 3.2 UC-ELINKS HEADER INFORMATION ............................................................................................. 9 3.3 THE YELLOW BUTTON................................................................................................................
    [Show full text]
  • Web Browsing and Communication Notes
    digital literacy movement e - learning building modern society ITdesk.info – project of computer e-education with open access human rights to e - inclusion education and information open access Web Browsing and Communication Notes Main title: ITdesk.info – project of computer e-education with open access Subtitle: Web Browsing and Communication, notes Expert reviwer: Supreet Kaur Translator: Gorana Celebic Proofreading: Ana Dzaja Cover: Silvija Bunic Publisher: Open Society for Idea Exchange (ODRAZI), Zagreb ISBN: 978-953-7908-18-8 Place and year of publication: Zagreb, 2011. Copyright: Feel free to copy, print, and further distribute this publication entirely or partly, including to the purpose of organized education, whether in public or private educational organizations, but exclusively for noncommercial purposes (i.e. free of charge to end users using this publication) and with attribution of the source (source: www.ITdesk.info - project of computer e-education with open access). Derivative works without prior approval of the copyright holder (NGO Open Society for Idea Exchange) are not permitted. Permission may be granted through the following email address: [email protected] ITdesk.info – project of computer e-education with open access Preface Today’s society is shaped by sudden growth and development of the information technology (IT) resulting with its great dependency on the knowledge and competence of individuals from the IT area. Although this dependency is growing day by day, the human right to education and information is not extended to the IT area. Problems that are affecting society as a whole are emerging, creating gaps and distancing people from the main reason and motivation for advancement-opportunity.
    [Show full text]
  • Ubuntu Server Guide Basic Installation Preparing to Install
    Ubuntu Server Guide Welcome to the Ubuntu Server Guide! This site includes information on using Ubuntu Server for the latest LTS release, Ubuntu 20.04 LTS (Focal Fossa). For an offline version as well as versions for previous releases see below. Improving the Documentation If you find any errors or have suggestions for improvements to pages, please use the link at thebottomof each topic titled: “Help improve this document in the forum.” This link will take you to the Server Discourse forum for the specific page you are viewing. There you can share your comments or let us know aboutbugs with any page. PDFs and Previous Releases Below are links to the previous Ubuntu Server release server guides as well as an offline copy of the current version of this site: Ubuntu 20.04 LTS (Focal Fossa): PDF Ubuntu 18.04 LTS (Bionic Beaver): Web and PDF Ubuntu 16.04 LTS (Xenial Xerus): Web and PDF Support There are a couple of different ways that the Ubuntu Server edition is supported: commercial support and community support. The main commercial support (and development funding) is available from Canonical, Ltd. They supply reasonably- priced support contracts on a per desktop or per-server basis. For more information see the Ubuntu Advantage page. Community support is also provided by dedicated individuals and companies that wish to make Ubuntu the best distribution possible. Support is provided through multiple mailing lists, IRC channels, forums, blogs, wikis, etc. The large amount of information available can be overwhelming, but a good search engine query can usually provide an answer to your questions.
    [Show full text]
  • Annex I: List of Internet Robots, Crawlers, Spiders, Etc. This Is A
    Annex I: List of internet robots, crawlers, spiders, etc. This is a revised list published on 15/04/2016. Please note it is rationalised, removing some previously redundant entries (e.g. the text ‘bot’ – msnbot, awbot, bbot, turnitinbot, etc. – which is now collapsed down to a single entry ‘bot’). COUNTER welcomes updates and suggestions for this list from our community of users. bot spider crawl ^.?$ [^a]fish ^IDA$ ^ruby$ ^voyager\/ ^@ozilla\/\d ^ÆƽâºóµÄ$ ^ÆƽâºóµÄ$ alexa Alexandria(\s|\+)prototype(\s|\+)project AllenTrack almaden appie Arachmo architext aria2\/\d arks ^Array$ asterias atomz BDFetch Betsie biadu biglotron BingPreview bjaaland Blackboard[\+\s]Safeassign blaiz\-bee bloglines blogpulse boitho\.com\-dc bookmark\-manager Brutus\/AET bwh3_user_agent CakePHP celestial cfnetwork checkprivacy China\sLocal\sBrowse\s2\.6 cloakDetect coccoc\/1\.0 Code\sSample\sWeb\sClient ColdFusion combine contentmatch ContentSmartz core CoverScout curl\/7 cursor custo DataCha0s\/2\.0 daumoa ^\%?default\%?$ Dispatch\/\d docomo Download\+Master DSurf easydl EBSCO\sEJS\sContent\sServer ELinks\/ EmailSiphon EmailWolf EndNote EThOS\+\(British\+Library\) facebookexternalhit\/ favorg FDM(\s|\+)\d feedburner FeedFetcher feedreader ferret Fetch(\s|\+)API(\s|\+)Request findlinks ^FileDown$ ^Filter$ ^firefox$ ^FOCA Fulltext Funnelback GetRight geturl GLMSLinkAnalysis Goldfire(\s|\+)Server google grub gulliver gvfs\/ harvest heritrix holmes htdig htmlparser HttpComponents\/1.1 HTTPFetcher http.?client httpget httrack ia_archiver ichiro iktomi ilse
    [Show full text]
  • Giant List of Web Browsers
    Giant List of Web Browsers The majority of the world uses a default or big tech browsers but there are many alternatives out there which may be a better choice. Take a look through our list & see if there is something you like the look of. All links open in new windows. Caveat emptor old friend & happy surfing. 1. 32bit https://www.electrasoft.com/32bw.htm 2. 360 Security https://browser.360.cn/se/en.html 3. Avant http://www.avantbrowser.com 4. Avast/SafeZone https://www.avast.com/en-us/secure-browser 5. Basilisk https://www.basilisk-browser.org 6. Bento https://bentobrowser.com 7. Bitty http://www.bitty.com 8. Blisk https://blisk.io 9. Brave https://brave.com 10. BriskBard https://www.briskbard.com 11. Chrome https://www.google.com/chrome 12. Chromium https://www.chromium.org/Home 13. Citrio http://citrio.com 14. Cliqz https://cliqz.com 15. C?c C?c https://coccoc.com 16. Comodo IceDragon https://www.comodo.com/home/browsers-toolbars/icedragon-browser.php 17. Comodo Dragon https://www.comodo.com/home/browsers-toolbars/browser.php 18. Coowon http://coowon.com 19. Crusta https://sourceforge.net/projects/crustabrowser 20. Dillo https://www.dillo.org 21. Dolphin http://dolphin.com 22. Dooble https://textbrowser.github.io/dooble 23. Edge https://www.microsoft.com/en-us/windows/microsoft-edge 24. ELinks http://elinks.or.cz 25. Epic https://www.epicbrowser.com 26. Epiphany https://projects-old.gnome.org/epiphany 27. Falkon https://www.falkon.org 28. Firefox https://www.mozilla.org/en-US/firefox/new 29.
    [Show full text]
  • The Organic Internet
    The Organic Internet Organizing History's Largest Social Movement Alfredo López Jamie McClelland Eric Goldhagen Daniel Kahn Gillmor Amanda B. Hickman Published by Entremundos Publications, 2007 Creative Commons License – 2007 by May First/People Link You are free to Share (to copy, distribute and transmit the work) and to Remix (to adapt the work) under the following conditions. • Attribution: you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). • Share Alike: if you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://mayfirst.org/organicinternet/license. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Version 1.5 ISBN - 978-1-877850-00-4 Table of Contents Preface..................................................................................................5 The Organic Internet by Alfredo López...............................................7 The Political Techie by Jamie McClelland..........................................41 Domain Names by Alfredo López.......................................................45 The Internet Protocol by Eric Goldhagen..........................................51 Technical Architecture Shapes Social Structure by Daniel Kahn Gillmor............................................................55 The Email Crisis by Jamie McClelland...............................................69 FOSS and Proprietary Software by Amanda B. Hickman..................77 About the Authors...............................................................................87 Preface Preface A note about language Discussions of Information Communications Technology often refer to Free Software or Open Source Software.
    [Show full text]
  • Web Browsers
    WEB BROWSERS Page 1 INTRODUCTION • A Web browser acts as an interface between the user and Web server • Software application that resides on a computer and is used to locate and display Web pages. • Web user access information from web servers, through a client program called browser. • A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web Page 2 FEATURES • All major web browsers allow the user to open multiple information resources at the same time, either in different browser windows or in different tabs of the same window • A refresh and stop buttons for refreshing and stopping the loading of current documents • Home button that gets you to your home page • Major browsers also include pop-up blockers to prevent unwanted windows from "popping up" without the user's consent Page 3 COMPONENTS OF WEB BROWSER 1. User Interface • this includes the address bar, back/forward button , bookmarking menu etc 1. Rendering Engine • Rendering, that is display of the requested contents on the browser screen. • By default the rendering engine can display HTML and XML documents and images Page 4 HISTROY • The history of the Web browser dates back in to the late 1980s, when a variety of technologies laid the foundation for the first Web browser, WorldWideWeb, by Tim Berners-Lee in 1991. • Microsoft responded with its browser Internet Explorer in 1995 initiating the industry's first browser war • Opera first appeared in 1996; although it have only 2% browser usage share as of April 2010, it has a substantial share of the fast-growing mobile phone Web browser market, being preinstalled on over 40 million phones.
    [Show full text]
  • Browser That Downloads Text Only 12 Best Text Only Browsers for Browsing in Slow Internet Connections
    browser that downloads text only 12 Best Text Only Browsers for browsing in slow internet connections. Text Only Browsers: – Did you ever find yourself in a situation when you are browsing something important or saving some data and the slow internet speed kills it all? Well, we all go through such situations every now and then, and we are left midway wondering what to do next! At this point, what we need is a Text Only Browser that can help us browse only in text mode minus the graphics. The interfaces of these browsers are extremely simple in layout and function quite speedily. The fact that “text only browsers” require less bandwidth to operate, they load pages much faster than the graphic based web browsers. Besides, the graphic web browsers also consume greater CPU resources for Javascript, CSS, etc. Also, apart from the speed and the data saving fact, Text Only Browsers are also beneficial for the visually impaired or those who suffer from partial blindness as these are accompanied with text-to- speech application that reads the content aloud to them. When you open some of the text based browsers you will find options to browse through different websites, and for some you will need to use shortcut keys to navigate. But wait, the benefits of these browsers don’t stop here as these browsers are occasionally used by programmers as well. Experience uncomplicated browsing and reading without the graphics, identify the viewing habits of web crawlers and track the ease of your web pages using Text Only Browsers.
    [Show full text]