Monday, 2 January 2017

Linux command-line basics

In order to process the STATS19 data I taught myself how to use some of the command-line tools in Linux - here are some of the basics needed to get started:

Navigating 

  • Access the command line through 'Terminal'.
  • Navigate to the data folder to run commands.
  • To move to a sub-folder the command is 'cd', for instance to swap to the Desktop from the home folder enter "cd Desktop".
  • "cd .." goes up one level in the folder tree.
  • "pwd" outputs the current folder location. 
  • "ls" lists files and sub-folders in the current folder.
  • "man insert-command-name-here" gives you information about the command, options, inputs etc.
  • Review your command line history by typing 'history'. You exit these pages by pressing 'q'.


Directing output from commands

  • ">" replaces any current file contents: "cat file1.csv > output.csv"
  • "tail -n +2 file2.csv >> output.csv" - The '>>' appends content to ouput.csv. The 'tail' command avoids bringing in the header line again.
  • If no file exists, both '>' and '>>' create one.

Sorting files with 'sort'

sort -t"," -k1,1 file.csv

Sorting with headers ('head' and 'tail' commands to keep the first line intact):
head -n 1 file.csv && tail -n +2 file.csv | sort -t"," -k1,1

Extracting columns with 'cut'

Specific columns can be extracted from a file. For example, if we wanted to extract columns 2, 4, 5, 6 and 8+ from file.csv:

cut -d , -f 2,4-6,8- file.csv

Here, the '-d ,' tells cut that columns are separated by commas, and -f 2,4-6 tells it to extract column 2 and columns 4-6. The -f argument can take a single column number or a comma-separated list of numbers and ranges. It seems to only work if numbers are sequential, so you can't use 'cut' to rearrange columns.

AWK


NR - the current line number
NF - the number of fields (columns) on the line, $NF references the last field
$0 - the entire line
|| - logical OR
&& - logical AND
! - logical NOT
<=, >=, == - less than or equal/more than or equal/equal

AWK examples:
  • awk '$4 <= 10 && $4 >= 1 { print $1 }' file.txt
  • Rearranging and filtering columns (NR == 1 retains the header): awk -F ',' -v OFS=',' 'NR == 1 ||$8 == 0{print $1,$18,$16,$19,$20}' file.csv > rearranged.csv
  • Combining columns with text, the 'for' does this for every .csv file in the folder: "for file in *.csv; do awk -F ',' -v OFS=',' 'NR!=1{$2 = $2 " of " $10; $3 = $3 " of " $11}1' $file > "$(basename "$file" .csv)_1.csv"; done"
  • Replace values in a particular column using awk to avoid sed potentially picking up the wrong data / columns: "awk -F ',' -v OFS=',' '$17=="empty" {$17="Not known"}1' file.csv"

More useful websites: