Unix Beyond the Basics (Pdf)

Total Page:16

File Type:pdf, Size:1020Kb

Unix Beyond the Basics (Pdf) Unix: Beyond the Basics BaRC Hot Topics – Oct 19, 2015 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Login • Requesting a tak account http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php • Windows PuTTY Need to setup X-windows for graphical display • Macs Access the Terminal: Go Utilities Terminal XQuartz needed for X-windows for newer OS X. 2 Login using Secure Shell ssh –Y user@tak PuTTY on Windows Terminal on Macs Command Prompt 3 user@tak ~$ Hot Topics website: http://jura.wi.mit.edu/bio/education/hot_topics/ • Create a directory for the exercises and use it as your working directory $ cd /nfs/BaRC_training $ mkdir john_doe $ cd john_doe • Copy all files into your working directory $ cp –r /nfs/BaRC_training/UnixII/* . • You should have the files below in your working directory: – foo.txt, sample1.txt, exercise.txt, datasets folder – you can check with ls command 4 Unix Review: Commands command [arg1 arg2 … ] [input1 input2 … ] $ sort –k2,3nr foo.tab -n or -g: -n is recommended, except for scientific notation or start end a leading '+' -r: reverse order $ cut –f1,5 foo.tab $ cut –f1-5 foo.tab -f: select only these fields st th -f1,5: select 1 and 5 fields st nd rd th th -f1-5: select 1 , 2 , 3 , 4 , and 5 fields $ wc –l foo.txt How many lines are in this file? 5 Unix Review: Common Mistakes • Case sensitive cd /nfs/Barc_Public vs cd /nfs/BaRC_Public -bash: cd: /nfs/Barc_Public: No such file or directory • Spaces may matter! rm –f myFiles* vs rm –f myFiles * 6 Unix Review: Pipes • Stream output of one command/program as input for another – Avoid intermediate file(s) $cut –f 1 myFile.txt | sort | uniq –c > uniqCounts.txt pipe 7 What you will learn… • sed • awk • groupBy (bedtools) • loops • join files together • scripting 8 sed: stream editor for filtering and transforming text • Print lines 10 - 15: $ sed -n '10,15p' bigFile > selectedLines.txt • Delete 5 header lines at the beginning of a file: $ sed '1,5d' file > fileNoHeader • Remove all version numbers (eg: '.1') from the end of a list of sequence accessions: eg. NM_000035.2 $ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly s: substitute g: global modifier d: delete line p: print the current pattern space -n: only print those lines matching our pattern 9 Regular Expressions • Pattern matching and easier to search • Commonly used regular expressions: Regular Matches Expression . All characters * Zero or more; wildcard + One or more ^ Beginning of a line $ End of a line Examples -list all txt files: ls *.txt -replace CHR with Chr at the beginning of each line: sed 's/^CHR/Chr/' myFile.txt Note: regular expression syntax may slightly differ between sed, awk, Unix shell, eg. \+ in sed 10 awk • Name comes from the original authors: Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan • A simple programing language • Good for short programs, including parsing files 11 awk #: comment, ignored by awk By default, awk splits each line by spaces • Print the 2nd and 1st fields of the file: $ awk ' { print $2"\t"$1 } ' foo.tab • Convert sequences from tab delimited format to fasta format: $ awk ' { print ">"$1"\n"$2 } ' foo.tab > foo.fa $ head -1 foo.tab Seq1 ACTGCATCAC $ head -2 foo.fa >Seq1 ACGCATCAC Character Description \n newline \r carriage return \t horizontal tab 12 awk: field separator • Issues with default separator: white space – one field is gene description with multiple words – consecutive empty cells • To use tab as the separator: $ awk –F "\t" '{ print NF }' foo.txt or $ awk 'BEGIN {FS="\t"} { print NF }' foo.txt BEGIN: action before read input NF: number of fields in the current record FS: input field separator OFS: output field separator END: action after read input 13 awk: arithmetic operations Add average values of 4th and 5th fields to the file: $ awk '{ print $0"\t"($4+$5)/2 }' foo.tab $0: all fields Operator Description + Addition - Subtraction * Multiplication / Division % Modulo ^ Exponentiation ** Exponentiation 14 awk: making comparisons Print out records if values in 4th or 5th field are above 4: $ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab Sequence Description > Greater than < Less than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to ~ Matches !~ Does not match || Logical OR && Logical AND 15 awk • Conditional statements: Display expression levels for the gene NANOG: if (condition1) action1 $ awk '{ if(/NANOG/) print $0 }' foo.txt else if (condition2) action2 or else action3 $ awk '/NANOG/ { print $0 } ' foo.txt sub function for substitution: or sub( regexp, replacement, target) $ awk '/NANOG/' foo.txt Add line number to the above output: for ( i = 1; i <= NF; i++ ) $ awk '/NANOG/ { print NR"\t"$0 }' foo.txt print $i NR: line number of the current row • Looping: Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript $ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt or $ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\t"avg }' foo.txt 16 bioawk* • Extension of awk for commonly used file formats in bioinformatics $ bioawk -c help bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts sam: 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute fastx: 1:name 2:seq 3:qual 4:comment *https://github.com/lh3/bioawk 17 bioawk: Examples Print start coordinates in a gff/gtf file, bioawk -c gff '{print $start}' Homo_sapiens.GRCh37.75.canonical.gtf or bioawk -c gff '{print $4}' Homo_sapiens.GRCh37.75.canonical.gtf 11869 12613 13221 Print only the sequences in a fastq file bioawk -c fastx '{print $seq}' myFastQ.txt or bioawk -c fastx '{print $2}' myFastQ.txt GGAGTTACATTGATGGTGGAGAGTTGGCTGGGTATTCCCAATGTCCCAAAGGTCATAAAGGAGGGAAAAGAAAAAAAGAAAAAAATATTCAAAACAAATC NTAACGGATCACGTTGGCCAGGTTACCCCTCCAGAAGGAGAGGAAGCCCTGCTCCTTAGGGATTCTCACCACACAATCAATGATCCCTTTGTACTGCTTC CGAAGGTGGTGATGGTGCCCGTCTTCACCAGGAACTGGTCCACGCCCACGAGGCCCACAATGTTCCCACAAGGCACATCCTCGATGGGCTCCACGTAGCG CTTGAATTTTCTAAATGCAACTTCATCATTCTGCAAATCAGCAAGACTCACTTCAAACACACGACCCTTGAGACCATCAGATGCAATTTTGGTTCCTTGG TGAGCGAGAGGTTGTGGCGGATTGTGTTTTGCCCGGCGGGGGACCTTTCCCCGGAGGAGGGGGAGCGGCCCCCGATGAACTCACAGAAACTCCTCCGCCT ATTATACTCATCGCGAATATCCTTAAGAGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCTAAAAACCATTTTTCGTCC TCCTGCAGCTTGATGGAGATACCTCTTACTGGGCCTCTCTGAATTCGCTTCATCAGATGCGTGACATAACCTGCTATCTTGTTGCGGAGCTTTTTGCTGG 18 Summarize by Columns: groupBy (from bedtools) -g grpCols column(s) for grouping -c -opCols column(s) to be summarized -o Operation(s) applied to opCol: sum, count, min, max, mean, median, stdev, collapse distinct freqdesc (i.e., print desc. list of values:freq) freqasc (i.e., print asc. list of values:freq) Print the maximum value (3rd column) for each gene (2nd field) in the file $ groupBy -g 2 -c 3 -o max foo.tab 19 Join files together • Unix join $ join -1 2 -2 3 FILE1 FILE2 Join files on the 2nd field of FILE1 with the 3rd field of FILE2, only showing the common lines. FILE1 and FILE2 must be sorted on the join fields before running join • BaRC scripts: /nfs/BaRC_Public/BaRC_code/Perl/ sorting not required $ join2filesByFirstColumn.pl file1 file2 $ submatrix_selector.pl matrixFile rowIDfile or 0 [columnIDfile] 20 Shell Flavors • Syntax (for scripting) depends on which shell echo $SHELL • Several shells available, bash is the default on tak and commonly used. • Some of the shells (incomplete listing): Shell Name sh Bourne bash Bourne-Again ksh Korn shell csh C shell 21 Shell Script • Automation: avoid having to retype the same commands many times • Ease of use and more efficient • Outline of a script: #!/bin/bash shebang: interprets how to run the script commands… set of commands used in the script #comments write comments using “#” • Commonly used extension for script is .sh (eg. foo.sh), file must have executable permission 22 Bash Shell: for loop • Process multiple files with one command • Reduce computational time with many cluster nodes for myfile in `ls dataDir` do bsub wc –l $myfile done When referring to a variable, $ is needed before the variable name ($myfile), but $ is not needed when defining it (myfile). Note: this syntax is different from awk 23 Shell Script Example #!/bin/sh # 1. Take two arguments: the first one is a directory with all the datasets, the second one is an output directory # 2. For each data, calculate average gene expression, and save the results in a file in the output directory inDir=$1 # 1st argument outDir=$2 # 2nd argument; outDir must already exist # Define variables: no spaces on either side of the equal sign for i in `/bin/ls $inDir ` # refer to variable with $ do # output file name name="${i}_avg.txt" # {}: $i_avg is not valid; prevent misinterpretation of variable as #characters # calculate average gene expression # NM_001039201 Hdhd2 5.0306 5.3309 5.4998 sort -k2,2 $inDir/$i | groupBy -g 2 -c 3,4,5 -o mean,mean,mean >| $outDir/$name done # You can use graphical editors such as nedit, gedit, xemacs to create shell scripts 24 Advanced Topics Create multiple inputs to a command without creating intermediate files Named Pipes: stream output from two different commands into another program $ mkfifo temp #named pipe $ sort myFile > temp& #data is NOT written to a file! $ mkfifo temp2 #named pipe $ sort myFile2 > temp2& $ ./someProgram –in temp
Recommended publications
  • 1 A) Login to the System B) Use the Appropriate Command to Determine Your Login Shell C) Use the /Etc/Passwd File to Verify the Result of Step B
    CSE ([email protected] II-Sem) EXP-3 1 a) Login to the system b) Use the appropriate command to determine your login shell c) Use the /etc/passwd file to verify the result of step b. d) Use the ‘who’ command and redirect the result to a file called myfile1. Use the more command to see the contents of myfile1. e) Use the date and who commands in sequence (in one line) such that the output of date will display on the screen and the output of who will be redirected to a file called myfile2. Use the more command to check the contents of myfile2. 2 a) Write a “sed” command that deletes the first character in each line in a file. b) Write a “sed” command that deletes the character before the last character in each line in a file. c) Write a “sed” command that swaps the first and second words in each line in a file. a. Log into the system When we return on the system one screen will appear. In this we have to type 100.0.0.9 then we enter into editor. It asks our details such as Login : krishnasai password: Then we get log into the commands. bphanikrishna.wordpress.com FOSS-LAB Page 1 of 10 CSE ([email protected] II-Sem) EXP-3 b. use the appropriate command to determine your login shell Syntax: $ echo $SHELL Output: $ echo $SHELL /bin/bash Description:- What is "the shell"? Shell is a program that takes your commands from the keyboard and gives them to the operating system to perform.
    [Show full text]
  • Getting to Grips with Unix and the Linux Family
    Getting to grips with Unix and the Linux family David Chiappini, Giulio Pasqualetti, Tommaso Redaelli Torino, International Conference of Physics Students August 10, 2017 According to the booklet At this end of this session, you can expect: • To have an overview of the history of computer science • To understand the general functioning and similarities of Unix-like systems • To be able to distinguish the features of different Linux distributions • To be able to use basic Linux commands • To know how to build your own operating system • To hack the NSA • To produce the worst software bug EVER According to the booklet update At this end of this session, you can expect: • To have an overview of the history of computer science • To understand the general functioning and similarities of Unix-like systems • To be able to distinguish the features of different Linux distributions • To be able to use basic Linux commands • To know how to build your own operating system • To hack the NSA • To produce the worst software bug EVER A first data analysis with the shell, sed & awk an interactive workshop 1 at the beginning, there was UNIX... 2 ...then there was GNU 3 getting hands dirty common commands wait till you see piping 4 regular expressions 5 sed 6 awk 7 challenge time What's UNIX • Bell Labs was a really cool place to be in the 60s-70s • UNIX was a OS developed by Bell labs • they used C, which was also developed there • UNIX became the de facto standard on how to make an OS UNIX Philosophy • Write programs that do one thing and do it well.
    [Show full text]
  • A First Course to Openfoam
    Basic Shell Scripting Slides from Wei Feinstein HPC User Services LSU HPC & LON [email protected] September 2018 Outline • Introduction to Linux Shell • Shell Scripting Basics • Variables/Special Characters • Arithmetic Operations • Arrays • Beyond Basic Shell Scripting – Flow Control – Functions • Advanced Text Processing Commands (grep, sed, awk) Basic Shell Scripting 2 Linux System Architecture Basic Shell Scripting 3 Linux Shell What is a Shell ▪ An application running on top of the kernel and provides a command line interface to the system ▪ Process user’s commands, gather input from user and execute programs ▪ Types of shell with varied features o sh o csh o ksh o bash o tcsh Basic Shell Scripting 4 Shell Comparison Software sh csh ksh bash tcsh Programming language y y y y y Shell variables y y y y y Command alias n y y y y Command history n y y y y Filename autocompletion n y* y* y y Command line editing n n y* y y Job control n y y y y *: not by default http://www.cis.rit.edu/class/simg211/unixintro/Shell.html Basic Shell Scripting 5 What can you do with a shell? ▪ Check the current shell ▪ echo $SHELL ▪ List available shells on the system ▪ cat /etc/shells ▪ Change to another shell ▪ csh ▪ Date ▪ date ▪ wget: get online files ▪ wget https://ftp.gnu.org/gnu/gcc/gcc-7.1.0/gcc-7.1.0.tar.gz ▪ Compile and run applications ▪ gcc hello.c –o hello ▪ ./hello ▪ What we need to learn today? o Automation of an entire script of commands! o Use the shell script to run jobs – Write job scripts Basic Shell Scripting 6 Shell Scripting ▪ Script: a program written for a software environment to automate execution of tasks ▪ A series of shell commands put together in a file ▪ When the script is executed, those commands will be executed one line at a time automatically ▪ Shell script is interpreted, not compiled.
    [Show full text]
  • Bash Crash Course + Bc + Sed + Awk∗
    Bash Crash course + bc + sed + awk∗ Andrey Lukyanenko, CSE, Aalto University Fall, 2011 There are many Unix shell programs: bash, sh, csh, tcsh, ksh, etc. The comparison of those can be found on-line 1. We will primary focus on the capabilities of bash v.4 shell2. 1. Each bash script can be considered as a text file which starts with #!/bin/bash. It informs the editor or interpretor which tries to open the file, what to do with the file and how should it be treated. The special character set in the beginning #! is a magic number; check man magic and /usr/share/file/magic on existing magic numbers if interested. 2. Each script (assume you created “scriptname.sh file) can be invoked by command <dir>/scriptname.sh in console, where <dir> is absolute or relative path to the script directory, e.g., ./scriptname.sh for current directory. If it has #! as the first line it will be invoked by this command, otherwise it can be called by command bash <dir>/scriptname.sh. Notice: to call script as ./scriptname.sh it has to be executable, i.e., call command chmod 555 scriptname.sh before- hand. 3. Variable in bash can be treated as integers or strings, depending on their value. There are set of operations and rules available for them. For example: #!/bin/bash var1=123 # Assigns value 123 to var1 echo var1 # Prints ’var1’ to output echo $var1 # Prints ’123’ to output var2 =321 # Error (var2: command not found) var2= 321 # Error (321: command not found) var2=321 # Correct var3=$var2 # Assigns value 321 from var2 to var3 echo $var3 # Prints ’321’ to output
    [Show full text]
  • Unix: Beyond the Basics
    Unix: Beyond the Basics BaRC Hot Topics – September, 2018 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Logging in to our Unix server • Our main server is called tak4 • Request a tak4 account: http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php • Logging in from Windows Ø PuTTY for ssh Ø Xming for graphical display [optional] • Logging in from Mac ØAccess the Terminal: Go è Utilities è Terminal ØXQuartz needed for X-windows for newer OS X. 2 Log in using secure shell ssh –Y user@tak4 PuTTY on Windows Terminal on Macs Command prompt user@tak4 ~$ 3 Hot Topics website: http://barc.wi.mit.edu/education/hot_topics/ • Create a directory for the exercises and use it as your working directory $ cd /nfs/BaRC_training $ mkdir john_doe $ cd john_doe • Copy all files into your working directory $ cp -r /nfs/BaRC_training/UnixII/* . • You should have the files below in your working directory: – foo.txt, sample1.txt, exercise.txt, datasets folder – You can check they’re there with the ‘ls’ command 4 Unix Review: Commands Ø command [arg1 arg2 … ] [input1 input2 … ] $ sort -k2,3nr foo.tab -n or -g: -n is recommended, except for scientific notation or start end a leading '+' -r: reverse order $ cut -f1,5 foo.tab $ cut -f1-5 foo.tab -f: select only these fields -f1,5: select 1st and 5th fields -f1-5: select 1st, 2nd, 3rd, 4th, and 5th fields $ wc -l foo.txt How many lines are in this file? 5 Unix Review: Common Mistakes • Case sensitive cd /nfs/Barc_Public vs cd /nfs/BaRC_Public -bash: cd: /nfs/Barc_Public:
    [Show full text]
  • Useful Commands in Linux and Other Tools for Quality Control
    Useful commands in Linux and other tools for quality control Ignacio Aguilar INIA Uruguay 05-2018 Unix Basic Commands pwd show working directory ls list files in working directory ll as before but with more information mkdir d make a directory d cd d change to directory d Copy and moving commands To copy file cp /home/user/is . To copy file directory cp –r /home/folder . to move file aa into bb in folder test mv aa ./test/bb To delete rm yy delete the file yy rm –r xx delete the folder xx Redirections & pipe Redirection useful to read/write from file !! aa < bb program aa reads from file bb blupf90 < in aa > bb program aa write in file bb blupf90 < in > log Redirections & pipe “|” similar to redirection but instead to write to a file, passes content as input to other command tee copy standard input to standard output and save in a file echo copy stream to standard output Example: program blupf90 reads name of parameter file and writes output in terminal and in file log echo par.b90 | blupf90 | tee blup.log Other popular commands head file print first 10 lines list file page-by-page tail file print last 10 lines less file list file line-by-line or page-by-page wc –l file count lines grep text file find lines that contains text cat file1 fiel2 concatenate files sort sort file cut cuts specific columns join join lines of two files on specific columns paste paste lines of two file expand replace TAB with spaces uniq retain unique lines on a sorted file head / tail $ head pedigree.txt 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 10
    [Show full text]
  • Frequently Asked Questions Welcome Aboard! MV Is Excited to Have You
    Frequently Asked Questions Welcome Aboard! MV is excited to have you join our team. As we move forward with the King County Access transition, MV is committed to providing you with up- to-date information to ensure you are informed of employee transition requirements, critical dates, and next steps. Following are answers to frequency asked questions. If at any time you have additional questions, please refer to www.mvtransit.com/KingCounty for the most current contacts. Applying How soon can I apply? MV began accepting applications on June 3rd, 2019. You are welcome to apply online at www.MVTransit.com/Careers. To retain your current pay and seniority, we must receive your application before August 31, 2019. After that date we will begin to process external candidates and we want to ensure we give current team members the first opportunity at open roles. What if I have applied online, but I have not heard back from MV? If you have already applied, please contact the MV HR Manager, Samantha Walsh at 425-231-7751. Where/how can I apply? We will process applications out of our temporary office located at 600 SW 39th Street, Suite 100 A, Renton, WA 98057. You can apply during your lunch break or after work. You can make an appointment if you are not be able to do it during those times and we will do our best to meet your schedule. Please call Samantha Walsh at 425-231-7751. Please bring your driver’s license, DOT card (for drivers) and most current pay stub.
    [Show full text]
  • 1 Serious Emotional Disturbance (SED) Expert Panel
    Serious Emotional Disturbance (SED) Expert Panel Meetings Substance Abuse and Mental Health Services Administration (SAMHSA) Center for Behavioral Health Statistics and Quality (CBHSQ) September 8 and November 12, 2014 Summary of Panel Discussions and Recommendations In September and November of 2014, SAMHSA/CBHSQ convened two expert panels to discuss several issues that are relevant to generating national and State estimates of childhood serious emotional disturbance (SED). Childhood SED is defined as the presence of a diagnosable mental, behavioral, or emotional disorder that resulted in functional impairment which substantially interferes with or limits the child's role or functioning in family, school, or community activities (SAMHSA, 1993). The September and November 2014 panels brought together experts with critical knowledge around the history of this federal SED definition as well as clinical and measurement expertise in childhood mental disorders and their associated functional impairments. The goals for the two expert panel meetings were to operationalize the definition of SED for the production of national and state prevalence estimates (Expert Panel 1, September 8, 2014) and discuss instrumentation and measurement issues for estimating national and state prevalence of SED (Expert Panel 2, November 12, 2014). This document provides an overarching summary of these two expert panel discussions and conclusions. More comprehensive summaries of both individual meetings’ discussions and recommendations are found in the appendices to this summary. Appendix A includes a summary of the September meeting and Appendix B includes a summary of the November meeting). The appendices of this document also contain additional information about child, adolescent, and young adult psychiatric diagnostic interviews, functional impairment measures, and shorter mental health measurement tools that may be necessary to predict SED in statistical models.
    [Show full text]
  • Niagara Networking and Connectivity Guide
    Technical Publications Niagara Networking & Connectivity Guide Tridium, Inc. 3951 Westerre Parkway • Suite 350 Richmond, Virginia 23233 USA http://www.tridium.com Phone 804.747.4771 • Fax 804.747.5204 Copyright Notice: The software described herein is furnished under a license agreement and may be used only in accordance with the terms of the agreement. © 2002 Tridium, Inc. All rights reserved. This document may not, in whole or in part, be copied, photocopied, reproduced, translated, or reduced to any electronic medium or machine-readable form without prior written consent from Tridium, Inc., 3951 Westerre Parkway, Suite 350, Richmond, Virginia 23233. The confidential information contained in this document is provided solely for use by Tridium employees, licensees, and system owners. It is not to be released to, or reproduced for, anyone else; neither is it to be used for reproduction of this control system or any of its components. All rights to revise designs described herein are reserved. While every effort has been made to assure the accuracy of this document, Tridium shall not be held responsible for damages, including consequential damages, arising from the application of the information given herein. The information in this document is subject to change without notice. The release described in this document may be protected by one of more U.S. patents, foreign patents, or pending applications. Trademark Notices: Metasys is a registered trademark, and Companion, Facilitator, and HVAC PRO are trademarks of Johnson Controls Inc. Black Box is a registered trademark of the Black Box Corporation. Microsoft and Windows are registered trademarks, and Windows 95, Windows NT, Windows 2000, and Internet Explorer are trademarks of Microsoft Corporation.
    [Show full text]
  • How to Build a Search-Engine with Common Unix-Tools
    The Tenth International Conference on Advances in Databases, Knowledge, and Data Applications Mai 20 - 24, 2018 - Nice/France How to build a Search-Engine with Common Unix-Tools Andreas Schmidt (1) (2) Department of Informatics and Institute for Automation and Applied Informatics Business Information Systems Karlsruhe Institute of Technologie University of Applied Sciences Karlsruhe Germany Germany Andreas Schmidt DBKDA - 2018 1/66 Resources available http://www.smiffy.de/dbkda-2018/ 1 • Slideset • Exercises • Command refcard 1. all materials copyright, 2018 by andreas schmidt Andreas Schmidt DBKDA - 2018 2/66 Outlook • General Architecture of an IR-System • Naive Search + 2 hands on exercices • Boolean Search • Text analytics • Vector Space Model • Building an Inverted Index & • Inverted Index Query processing • Query Processing • Overview of useful Unix Tools • Implementation Aspects • Summary Andreas Schmidt DBKDA - 2018 3/66 What is Information Retrieval ? Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an informa- tion need (usually a query) from within large collections (usually stored on computers). [Manning et al., 2008] Andreas Schmidt DBKDA - 2018 4/66 What is Information Retrieval ? need for query information representation how to match? document document collection representation Andreas Schmidt DBKDA - 2018 5/66 Keyword Search • Given: • Number of Keywords • Document collection • Result: • All documents in the collection, cotaining the keywords • (ranked by relevance) Andreas Schmidt DBKDA - 2018 6/66 Naive Approach • Iterate over all documents d in document collection • For each document d, iterate all words w and check, if all the given keywords appear in this document • if yes, add document to result set • Output result set • Extensions/Variants • Ranking see examples later ...
    [Show full text]
  • Sort, Uniq, Comm, Join Commands
    The 19th International Conference on Web Engineering (ICWE-2019) June 11 - 14, 2019 - Daejeon, Korea Powerful Unix-Tools - sort & uniq & comm & join Andreas Schmidt Department of Informatics and Institute for Automation and Applied Informatics Business Information Systems Karlsruhe Institute of Technologie University of Applied Sciences Karlsruhe Germany Germany Andreas Schmidt ICWE - 2019 1/10 sort • Sort lines of text files • Write sorted concatenation of all FILE(s) to standard output. • With no FILE, or when FILE is -, read standard input. • sorting alpabetic, numeric, ascending, descending, case (in)sensitive • column(s)/bytes to be sorted can be specified • Random sort option (-R) • Remove of identical lines (-u) • Examples: • sort file city.csv starting with the second column (field delimiter: ,) sort -k2 -t',' city.csv • merge content of file1.txt and file2.txt and sort the result sort file1.txt file2.txt Andreas Schmidt ICWE - 2019 2/10 sort - examples • sort file by country code, and as a second criteria population (numeric, descending) sort -t, -k2,2 -k4,4nr city.csv numeric (-n), descending (-r) field separator: , second sort criteria from column 4 to column 4 first sort criteria from column 2 to column 2 Andreas Schmidt ICWE - 2019 3/10 sort - examples • Sort by the second and third character of the first column sort -t, -k1.2,1.2 city.csv • Generate a line of unique random numbers between 1 and 10 seq 1 10| sort -R | tr '\n' ' ' • Lottery-forecast (6 from 49) - defective from time to time ;-) seq 1 49 | sort -R | head -n6
    [Show full text]
  • Unix Commands January 2003 Unix
    Unix Commands Unix January 2003 This quick reference lists commands, including a syntax diagram 2. Commands and brief description. […] indicates an optional part of the 2.1. Command-line Special Characters command. For more detail, use: Quotes and Escape man command Join Words "…" Use man tcsh for the command language. Suppress Filename, Variable Substitution '…' Escape Character \ 1. Files Separation, Continuation 1.1. Filename Substitution Command Separation ; Wild Cards ? * Command-Line Continuation (at end of line) \ Character Class (c is any single character) [c…] 2.2. I/O Redirection and Pipes Range [c-c] Home Directory ~ Standard Output > Home Directory of Another User ~user (overwrite if exists) >! List Files in Current Directory ls [-l] Appending to Standard Output >> List Hidden Files ls -[l]a Standard Input < Standard Error and Output >& 1.2. File Manipulation Standard Error Separately Display File Contents cat filename ( command > output ) >& errorfile Copy cp source destination Pipes/ Pipelines command | filter [ | filter] Move (Rename) mv oldname newname Filters Remove (Delete) rm filename Word/Line Count wc [-l] Create or Modify file pico filename Last n Lines tail [-n] Sort lines sort [-n] 1.3. File Properties Multicolumn Output pr -t Seeing Permissions filename ls -l List Spelling Errors ispell Changing Permissions chmod nnn filename chmod c=p…[,c=p…] filename 2.3. Searching with grep n, a digit from 0 to 7, sets the access level for the user grep Command grep "pattern" filename (owner), group, and others (public), respectively. c is one of: command | grep "pattern" u–user; g–group, o–others, or a–all. p is one of: r–read Search Patterns access, w–write access, or x–execute access.
    [Show full text]