Linux: Convert Recursively Files Content from WINDOWS- CP1251 to Unicode UTF-8 with Recode and Iconv

Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog Linux: Convert recursively files content from WINDOWS- CP1251 to Unicode UTF-8 with recode and iconv Author : admin Some time ago I've written a tiny article, explaining how converting of HTML or TEXT file content inside file can be converted with iconv. Just recently, I've made mirror of a whole website with its directory structure with wget cmd. The website to be mirrored was encoded with charset Windows-1251 (which is now a bit obsolete and not very recommended to use), where my Apache Webserver to which I mirrored is configured by default to deliver file content (.html, txt, js, css ...) in newer and more standard (universal cyrillic) compliant UTF-8 encoding. Thus opening in browser from my website, the website was delivered in UTF-8, whether the file content itself was with encoding Windows CP-1251; Thus I ended up seeing a lot of monkey unreadable characters instead of Slavonic letters. To deal with the inconvenience, I've used one liner script that converts all Windows-1251 charset files to UTF-8. This triggered me writting this little post, hoping the info might be useful to others in a similar situation to mine: 1. Make Mass file charset / encoding convertion with recode On most Linux hosts, recode is probably not installed. If you're on Debian / Ubuntu Linux install it with apt; apt-get install --yes recode It is also installable from default repositories on Fedora, RHEL, CentOS with: yum -y install recode Here is recode description taken from man page: 1 / 2 Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog NAME recode - converts files between character sets find . -name "*.html" -exec recode WINDOWS-1251..UTF-8 {} \; If you have few file extensions whose chracter encoding needs to be converted lets say .html, .htm and .php use cmd: find . -name "*.html" -o -name '*.htm' -o -name '*.php' -exec recode WINDOWS-1251..UTF-8 {} \; Btw I just recently learned how one can look for few, file extensions with find under one liner the argument to pass is -o -name '*.file-extension', as you can see from example, you can look for as many different file extensions as you like with one find search command. After completing the convertion, I've remembered that earlier I've also used iconv on a couple of occasions to convert from Cyrillic CP-1251 to Cyrillic UTF-8, thus for those who prefer to complete convertion with iconv here is an alternative a bit longer method using for cycle + mv and iconv. 2. Mass file convertion with iconv for i in $(find . -name "*.html" -print); do iconv -f WINDOWS-1251 -t UTF-8 $i > $i.utf-8; mv $i $i.bak; mv $i.utf-8 $i; done As you see in above line of code, there are two occurances of move command as one is backupping all .html files and second mv overwrites with files with converted encoding. For any other files different from .html, just change in cmd find . -iname '*.html' to whatever file extension. 2 / 2 Powered by TCPDF (www.tcpdf.org).

Load more