Monday, 30 January 2017

The danger of 'Agricultural Vehicles' in Central London?

The map below shows where pedestrians and cyclists have been killed or injured in collisions involving 'agricultural vehicles' as recorded on police road collision reports for the years 2005-2015.

The cluster of casualties in central London highlights problems with the current reporting of collisions - vehicles are put in to categories which do not adequately describe their purpose. In this instance it's likely most or all of these 'agricultural vehicles' were in fact diggers and other construction vehicles lumped in to the same category (details here).

This is in addition to the 'other' vehicle category which encompasses many of the most dangerous construction vehicles and a plethora of other vehicles, masking the large number of deaths caused by a these vehicles. We also know from insurers that the construction industry has a poor record when it comes to vans and small lorries.

It's been clear for a long time that the construction industry has a problem when it comes to road safety, unfortunately the data collected hides the extent of the problem. In order to understand the real causes of road danger we need better data to be made available, including detailed vehicle types and maybe also the industry sector.

Pedestrians and cyclists killed or injured in collisions involving agricultural vehicles (2005-2015): fatal - black, serious - yellow, slight - blue. 

  • If you want the full details of specific a specific incident, or to search by area, try
  • Where possible casualty records have been linked to the vehicle which hit them. For pedestrians, the data records which vehicle (first) hit them. For other road users we inferred a collision where there was just one other vehicle involved. When there were more vehicles it was not possible to infer which vehicles collided with which.
  • Care has been taken to avoid errors or emissions while creating these maps, however if you spot an issue please let us know.

Tuesday, 3 January 2017

Processing STATS19 data (big CSV files) using Linux

The STATS19 road collision data published by the UK government consists of large 'comma separated value' (.csv) files, with hundreds of thousands of lines. In order to create the maps on this blog I used the command-line tools built in to the Linux operating system (it isn't possible to open files that size in ordinary spreadsheet programmes such as Excel).

Here are some examples of the commands I put together using various websites. This may be helpful to others working with STATS19 data or other large data-sets.

The data:

The first line of the file is the header with column titles. Each subsequent line represents a separate collision, a casualty, or a vehicle, depending on which file we're looking at.

To avoid losing the header, 'head' and 'tail' can be used. For instance, sorting looks like "head -n 1 file.csv && tail -n +2 file.csv | sort -t"," -k1,1" - the first line is kept separate and the remainder of the file is sorted. For awk commands 'NR == 1||' can be added to the logic to ensure the first line line is printed. I have ignored headers in some of the examples to make them easier to follow.

Match entries from two data-sets using 'join'

'Join' can be used to join the casualty data with the collision record. Both files must be sorted on the join field otherwise the output is unreliable. Join has the following options:
  • -a1 - keep all records from the first file, even where there is no match 
  • -e “0” - put zeros in any missing fields - only works if you set -o 
  • -t, - sets the delimiter as a comma 
  • -v1 - display the non matching records of file one 
  • -o sets the format of the output line, option "auto" 
  • Join default joins based on the first field, but this can be changed.
  • More info:

I followed these steps:
  1. Check to see if any lines don’t match - "join -t, -v1 <(tail -n +2 file1.csv) <(tail -n +2 file2.csv)"
  2. Join, keeping the headers intact - "(join -t, <(head -n 1 file1.csv) <(head -n 1 file2.csv) && join -t, <(tail -n +2 file1.csv) <(tail -n +2 file2csv)) > combinedfile.csv"
  3. After joining carriage returns need to be removed from files, otherwise lines split - "sed 's/\r//' combinedfile.csv"
To join the make and model information (which came in a separate file) on to the main vehicle data file a different join command was used. In this case the make and model information was incomplete and we only needed the join field '0' followed by the 24th and 25th fields in the second file. This was then joined to the vehicle data in a separate operation: "join -t, -a1 -o 0 2.24 2.25 -e empty <(tail -n +2 vehsort.csv) <(tail -n +2 mmsort.csv) > mmjoinfilt.csv"

There was an issue with files starting with an incorrect first column header, this uses sed to replace the first entry: "(head -n 1 file.csv | sed 's/[^,]*,/Key,/' && tail -n +2 file.csv) > output.csv" [^,]*, matches anything except a comma, and replaces it with "Key". By default it only substitutes the first occurrence.

Linking vehicle data on to the casualty record

After the collision data is appended to the casualty data, vehicle data can be appended to this. This produces a file with details of the casualty and the circumstances of the collision and relevant vehicle information combined.

For different situations the vehicle data was joined differently - For pedestrians, the data records which vehicle (first) hit them. All other road users are defined with vehicle records (horses are vehicles) and I inferred a collision where there was just one other vehicle involved. When there were more vehicles it was not possible to infer which vehicles collided with which. When there was only one vehicle involved it was not necessary to link the occupants' records.

In order to match data relevant to a specific vehicle, I added a column which was a combination of the unique accident reference (column 1) and the vehicle identifier (column 3): "awk -F ',' -v OFS=',' '{print $1$3,$0}' file.csv > filewithkey.csv"

The pedestrian data records the vehicle which first hit them, so a similar key can be added and the files can be joined directly.

For two vehicles, the vehicle reference for the colliding vehicle is three minus the current vehicle reference. Because the vehicles will be numbered '1' and '2' this gives the other option. The command to add the key field is: "awk -F ',' -v OFS=',' '$22 ==2|| NR ==1 { print $1(3-$2),$0 }' file.csv" Join it with vehs and remove returns: "join -t, -a1 -o auto -e empty <(tail -n +2 2veh0515kst.csv) <(tail -n +2 vehmm0515fix.csv)) | sed 's/\r//' > 2vehlinked0515.csv"

For three or more vehicles involved (recorded in column 22) we can only match to every vehicle involved "awk -F ',' -v OFS=',' '$22 >=3 { print $0 }') file.csv | join -t, -1 1 -2 2 <vehicles.csv" It is important to note that if we include this data we describe it as collisions where certain vehicles types 'were involved in the incident' rather than describing them as colliding with the casualty. The alternative is not to join this data which means these casualties would not show up on the maps.

Look-up values from a list (column 1) and replace with text (column 2)

In order to make the output more human-readable I wanted to replace the numeric codes with text wherever possible. This can be done using a look-up file which has the codes in the first column and the text entries in the second column (the header text should be in both columns to avoid messing up the header).

In this case the code is taken from column 15 and replaced with the corresponding text:

awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{$15=a[$15];print}' journeypurpose.csv file.csv > file.csv 

Two files are passed to awk. Firstly the look-up file is loaded in to an array and secondly this array is used to replace the values in column 15. 'NR' is the total number of lines processed, and 'FNR' are the number of lines processed in the current file, therefore the first statement triggers for the first file and the second one triggers for the second file.

Alternatively the looked up values can be added as an additional column: "awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$7]}' castype2.csv val2.csv > val3.csv"

Lookup all the files in a folder: "for file in *.csv; do awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$16]}' vehtype.csv $file > "$(basename "$file" .csv)_1.csv"; done"

More information on using awk to do a 'look-up' between two data-sets: (note that this is for data using a different delimiter, not comma separated values)

Google Fusion Tables - marker key

In order to display different markers based on severity/caualty type/vehicle type, we need to add a column to tell Google Fusion Tables which marker to use for each data point. Available markers:

First add a column for the markers: "awk -F , -v OFS=, '{print $0,”Fusion Tables icon”}' file.csv"

Secondly set values depending on what you want to display, for instance to add man and woman icons based on the gender in column 4: "awk -F , -v OFS=, '$4=="Female"{$NF="woman"} $4=="Male"{$NF="man"} 1' "

Here are some of the other values which could be used:
  • $X=="Cyclist"{$NF="cycling"} 
  • $X=="?"{$NF="motorcycling"} 
  • $X=="Horse rider"{$NF="horsebackriding"} 
  • $X=="?"{$NF="truck"} 
  • $X=="Bus or coach (17 or more pass seats)"{$NF="bus"} 
  • $X=="Fatal"{$NF="large_red"} 
  • $X=="Serious"{$NF="large_yellow"} 
  • $X=="Slight"{$NF="small_blue"} 
  • $X=="?"{$NF="small_green"} 
  • $X=="?"{$NF="small_purple"}
Examples (operate on all the files in the folder):

  • for file in *.csv; do awk -F , -v OFS=, '{print $0,"Fusion Tables icon"}' $file > "$(basename "$file" .csv)_1.csv"; done 
  • for file in *_1.csv; do awk -F , -v OFS=, '$4=="Female"{$NF="woman"} $4=="Male"{$NF="man"} $2=="Cyclist"{$NF="cycling"} $2=="Horse rider"{$NF="horsebackriding"} 1' $file > "$(basename "$file" .csv)_2.csv"; done 
  • for file in *.csv; do awk -F , -v OFS=, '$3=="Serious"{$NF="small_yellow"} $3=="Slight"{$NF="small_blue"} 1' $file > "$(basename "$file" .csv)_3.csv"; done
Update - Fusion tables vehicle icons added using lookup type list:
awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$8]}' vehicles2.csv file.csv > file.csv 

vehicles2.csv (last line represents the header):
Agricultural vehicle,large_green
Bus or coach (17 or more pass seats),bus
Data missing or out of range,large_blue
Goods 7.5 tonnes mgw and over,truck
Goods over 3.5 tonnes and under 7.5 tonnes,truck
Goods vehicle - unknown weight,truck
Minibus (8 - 16 passenger seats),bus
Mobility scooter,large_purple
Motorcycle 125cc and under,motorcycling
Motorcycle 50cc and under,motorcycling
Motorcycle over 125cc and up to 500cc,motorcycling
Motorcycle over 500cc,motorcycling
Motorcycle - unknown cc,motorcycling
Other vehicle,large_blue
Pedal cycle,cycling
Ridden horse,horsebackriding
Taxi/Private hire car,cabs
Van / Goods 3.5 tonnes mgw or under,truck
Vehicle Type,Fusion Tables icon

Uploading the data to Google Maps

Checking files

It's good to check files as you go along. :
  • Less - You can send data to the 'less' command to see it one screen of data at a time and scroll up and down 
  • View non-printing characters "cat -v" 
  • Compare file headers - For all the .csv files in the directory, put the headers into one file “headers.csv” so you can review differences: "find . -name "*.csv" -exec awk -F ',' -v OFS=',' 'NR==1{print FILENAME,$0}' {} \; >> headers.csv" 
  • Count the fields - Print out how many fields there are in each line, if two or more numbers are outputted it indicates different lines are different lengths (which is probably not good): "awk -F ',' -v OFS=',' '{print NF}' file.csv | uniq" 
  • Count the fields for each file in the folder - Prints out how many fields there are in each line for all the files in the folder: "find . -name "*.csv" -exec awk -F ',' -v OFS=',' '{print FILENAME,NF}' {} \; | uniq" 
  • Count the lines for each file in the folder - "find . -name "*.csv" -exec wc -l {} \;" 
  • Check the unique values in a column (column 16 in this case): "cut -d , -f 16 file.csv | sort | uniq"

Monday, 2 January 2017

Linux command-line basics

In order to process the STATS19 data I taught myself how to use some of the command-line tools in Linux - here are some of the basics needed to get started:


  • Access the command line through 'Terminal'.
  • Navigate to the data folder to run commands.
  • To move to a sub-folder the command is 'cd', for instance to swap to the Desktop from the home folder enter "cd Desktop".
  • "cd .." goes up one level in the folder tree.
  • "pwd" outputs the current folder location. 
  • "ls" lists files and sub-folders in the current folder.
  • "man insert-command-name-here" gives you information about the command, options, inputs etc.
  • Review your command line history by typing 'history'. You exit these pages by pressing 'q'.

Directing output from commands

  • ">" replaces any current file contents: "cat file1.csv > output.csv"
  • "tail -n +2 file2.csv >> output.csv" - The '>>' appends content to ouput.csv. The 'tail' command avoids bringing in the header line again.
  • If no file exists, both '>' and '>>' create one.

Sorting files with 'sort'

sort -t"," -k1,1 file.csv

Sorting with headers ('head' and 'tail' commands to keep the first line intact):
head -n 1 file.csv && tail -n +2 file.csv | sort -t"," -k1,1

Extracting columns with 'cut'

Specific columns can be extracted from a file. For example, if we wanted to extract columns 2, 4, 5, 6 and 8+ from file.csv:

cut -d , -f 2,4-6,8- file.csv

Here, the '-d ,' tells cut that columns are separated by commas, and -f 2,4-6 tells it to extract column 2 and columns 4-6. The -f argument can take a single column number or a comma-separated list of numbers and ranges. It seems to only work if numbers are sequential, so you can't use 'cut' to rearrange columns.


NR - the current line number
NF - the number of fields (columns) on the line, $NF references the last field
$0 - the entire line
|| - logical OR
&& - logical AND
! - logical NOT
<=, >=, == - less than or equal/more than or equal/equal

AWK examples:
  • awk '$4 <= 10 && $4 >= 1 { print $1 }' file.txt
  • Rearranging and filtering columns (NR == 1 retains the header): awk -F ',' -v OFS=',' 'NR == 1 ||$8 == 0{print $1,$18,$16,$19,$20}' file.csv > rearranged.csv
  • Combining columns with text, the 'for' does this for every .csv file in the folder: "for file in *.csv; do awk -F ',' -v OFS=',' 'NR!=1{$2 = $2 " of " $10; $3 = $3 " of " $11}1' $file > "$(basename "$file" .csv)_1.csv"; done"
  • Replace values in a particular column using awk to avoid sed potentially picking up the wrong data / columns: "awk -F ',' -v OFS=',' '$17=="empty" {$17="Not known"}1' file.csv"

More useful websites: