The Participants Were: Nigel Swier,Liz Metcalfe, Adam Pohl and Vidhya Shekar

Sprint 2:

Introduction: The objective was to match enterprise names collected from job portals to the Job vacancy statistics sample from the Business Register. The participants were: Nigel Swier, Liz Metcalfe, Adam Pohl and Vidhya Shekar Data Sources: In preparation for the second virtual sprint the UK collected a sample of data containing the number of job vacancies per companies from the following job portals:

 Totaljobs

 Reed

 The Guardian

 Careerjet

 Telegraph

 Universal job match For each job portal, we did not collect any information on the job advertisements but rather an aggregated sum of the number of job vacancies. This is the information that is most readily available and puts the least pressure on the website. Although it would be possible to collect the complete job advertisements data, this is more complicated, would put a lot more pressure on each website during the collection and we felt that this information would be sufficient for initial experimentation. Data from the Business Register for the Job vacancy statistics sample was also extracted. The Job vacancy statistics sample is 6,000 enterprises per month, 1,300 of the larger enterprises of which are consistently within the sample. Within this sprint the largest 1,300 enterprises from the Job vacancy statistics sample were used for simplicity. The approach An iterative approach was taken to matching the survey enterprise names to the job portals enterprise names. This included multiple stages of data wrangling; after each stage a matching attempt was made and a subset of the non-matched data taken to be used within the next step. This increased the efficiency of the matching. The following data wrangling steps were used: Step 1: Match data An attempt at matching the portal enterprise names and the survey enterprise names was made by using a left merge. Step 2: Lower case The job portal enterprise names and the survey enterprise names were put into lower case and matched. Step 3: Alphabetical order The job portal enterprise names and the survey enterprise names were put into alphabetical order and matched Step 4: Unwanted words Words that reduced matching potential were removed from the datasets one at a time. Between each removal an attempt was made at matching. Removed words:  &  and  ‘  plc  ltd  vat  uk  (  )  the  of  for  .  ;  limited  retail  service  group  nhs  trust  your  you’re  cos Step 5: White spaces Multiple white spaces were replaced with single white spaces and then the data was matched. Step 6: Levenstein Levenstein distance matrices were used to compare the survey enterprise names with each of the job portal enterprise names. The highest scoring comparison of the job portal enterprise names to with each of the survey enterprise names, above a set threshold was matched (python module fuzzywuzzy). Step 7: Ngrams Ngrams were used to search for matches using by N-gram string similarity using the same method as Step 6 (python module ngram). Step 8: Jaro Jaro distances were used to search for matches using by Jaro string similarity using the same method as Step 6 (python module jellyfish). Step 9: Match Rating Approach Match Rating Approach was used to match on pronunciation. This did not produce any sensible results and was therefore removed from the process. Step 10: Supervised machine learning Supervised machine learning will be used to match the job enterprise names and the survey enterprise names (python module dedupe). This is still in progress.

Results: N (%) matched Step Careerjet Reed Total jobs Telegrap Universal n=9594 n=2877 n=4355 h n=156 n=200

1: Match 3 (0.23%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2: Lower case 127 (9.55%) 0 (0%) 34 (2.56%) 0 (0%) 4 (0.30%)

3: Alphabetise 127 (9.55%) 36 (2.56%) 34 (2.56%) 0 (0%) 4 (0.30%)

4: Removes words 262 119 (8.95%) 92 (6.92%) 5 (0.38%) 15 (1.13%) (19.70%)

5: Remove white spaces 283 146 (10.98%) 111 (8.35%) 5 (0.38%) 19 (1.43%) (21.28%)

6: Levenstein 302 161 (12.18%) 122 (9.17%) 5 (0.38%) 22 (1.65%) (22.71%)

7: Ngrams 306 162 (12.18%) 124 (9.32%) 5 (0.38%) 22 (1.65%) (23.01%) 8: Jaro 308 163 (12.26%) 126 (9.47%) 6 (0.45%) 22 (1.65%) (23.16%)

A number of problems were identified in the survey enterprise names which made the automated matching process less successful. These included:  Enterprises not trading under their legal name for example, Liverpool Victoria trades under the name LV. This difference cannot be picked up using data wrangling or comparisons.  Enterprises in the survey own, or part-owner a number of other enterprises. For example, within the survey the numbers of job vacancies within United Biscuits are collected. No job vacancies for United Biscuits could be found within the job portals however, United Biscuits owns Mcvities and Jacobs both of which are advertising job vacancies on job portals. Data wrangling would no discover these matches.

Company attributes

The Job Vacancy survey data contains information on the location of each enterprise. This information was not available for the portal enterprises. Location name for the survey data enterprises was sort from the Company house API to increase our ability to match enterprises do not trade under their legal name.

First the location information was collected by looking up the Companies house API using the portal enterprise names and then matched using two attributes - enterprise location and name. This proved difficult because the Company house returned a number of different company details for each enterprise. However to see feasibility of this idea, a subset of data (enterprise names starting with A&B) and the first twenty results from the Companies House API have been used for the analysis and outcome I shown in the following table:

Number of survey Number of portal Number of Number of postcode enterprises starting enterprises starting with CompaniesHouse API matches found between with A or B A or B look up results for Portal the survey data and data starting with A or B portal-CompaniesHouse result 162 1414 18279 25

Information about enterprise structure was researched on the internet. This was a time consuming process and uncovered very complex interactions between enterprises.

Conclusion:

We were able to develop an approach that can match survey enterprise names to job portal enterprise names where they were similar but not necessarily identical. The various tools employed to match the survey and portal data were reasonably successful, matching 1522 portal names of which 362 were unique (27.5% of the survey data). Further refinement of this method will increase the number of matches.

However, the problems identified on the differences between trading and legal names; and enterprise ownership which in cases can be very complicated, means that the matching process cannot currently be entirely automated.

The next step is to use supervised machine learning (step 10) to match enterprise names. Following this matching will take place manually.