AIT 580 Team 4 Project: Competitive Intelligence Using Data Analytics Project

Total Page:16

File Type:pdf, Size:1020Kb

AIT 580 Team 4 Project: Competitive Intelligence Using Data Analytics Project

AIT 580 Team 4 Project: Competitive Intelligence Using Data Analytics Project Team 4 Members:

 Chen, Yu An

 Ghazwani, Aisha

 Gryphon, Natalie E.

 Wright, Wayne A.

Template used: Data Project Plan Outline, Version 4.0, 18 January 2016

1 Table of Contents

Contents Introduction We are interested in exploring the value of mining public data to get insight about our industry rivals and other competitors in order to gain a competitive advantage.

Additionally, our company is seeking to unveil the risk factors of our own data leakage to have a better understanding of the steps that we have to take to remediate the exposure of our employees, IPs, clients, technologies, and the work we do.

Our company expects this information to help in strategic planning, updating our corporate polices on what our employees are allowed to share on public websites, and in identifying the need for any additional data protection policies.

We know that Career websites, like LinkedIn, allow job seekers to post information such as skills, experience, and education. And, they allow employers to post information about job openings and the skills, experience, and education they desire.

We believe that extracting details about competitors’ current openings can help CyberSmart predict our competitors’ plans; what contracts they might be bidding on, what kind of capabilities (knowledge, skills, and abilities) they are missing, but need, in order to deliver on those contracts. Additionally, we believe we can extract information shared by job seekers to glean current and past activities of their employers.

Our success metrics are to improve our competitive intelligence so that we increase our win rate on government contract bids from the current 45% up to 65%. The hypothesis is that with the additional insights we can glean from this “Competitive Intelligence Using Data Analytics Project” we can better identify RFP/contract opportunities where we are in a position to win over our competitors, and to better identify those opportunities where we should team with another competitor, or not bid at all.

Through this project, we will also discover what kind of similar insights our competitors can find out about us. So another area for a success metric is to reduce any competitive disadvantage because of CyberSmart data exposure (e.g. how much intelligence about us is out there, and could be mined in a similar fashion and used to our disadvantage).

Risks include the following:

1) Getting the right resources to work on this project. CyberSmart talent is in high demand and most of our work is time and materials (T&M) based contracts. So, time spent working on internal projects, does not generate revenue, and may, in fact, represent lost revenue.

2) Having good measures and statistical methods to show the cause and effect relationship of improved competitive intelligence and a higher contract win percentage. There are many subjective factors that go into a bid/team/no-bid decision that can affect the outcome and many of those other subjective factors could dilute or work counter to the cause and effect we are trying to engineer with this project.

3 3) Tipping our hand, so to speak. We work in an industry of dynamic coopetition, where one day a company is a partner collaborating with us on a joint win. And, the next day they are a competitor willing to do anything to beat us. Our partners and competitors could develop very negative perceptions about this data mining project we are doing. So, there are risks related to the business partnerships that are required to compete in this marketplace.

4) Violating Federal Acquisition Regulation (FAR) procurement rules without realizing it.

Team Planning To have a successful project, the goal is to focus on the skills that apply to individual project members who will participate and contribute to it. Commonly, team members in any team or group projects are made up of professionals and experts such engineers, analysts, managers, developers, and QA specialists. All of them, have various backgrounds, skills, and strengths in which they can contribute to the project effectively. In this project, it is needed to divide the team’s job depending on their backgrounds to make advantage of the strengths and essential skills that the team possesses, and that will help to reduce time-consuming and enhance the quality of the work in order to successful delivery of project.

Computer science and statistics are the two main backgrounds of the team, which they will be helpful for the important parts of the project since the majority of the skills that the project required can apply for those two. The table below shows project stages with initials.

Role Team Members Project Managers / Project Defenders Wright, Wayne A. Users / Analysts Ghazwani, Aisha; Wright, Wayne A. Architects / Designers Chen, Yu An Developers Gryphon, Natalie E. SME/ Analy Statis DBA GUI Progr Ethici Archit devel User st tician Desig am st ect oper ner Mana ger

Define WW the Problem Experim WW ental Design Data YC Collectio n, Storage and Retrieval Data YC Curation and Quality Assuranc e Solution YC Design Analysis AG Visualiza AG tion and Reportin g Evaluati NG on Method ology Security NG Privacy / NG Ethics Effort WW Initiation (Project Startup) Project WW Impleme ntation

Problem Definition

Problem Definition There are two problems we are trying to solve. They are related and might even be considered the obverse of each other. The first objective (a.k.a. problem) is to improve our competitive position by using data analytics to better understanding what our competitors are doing, and to predict our competitors’ plans; what contracts they might be bidding on, what kind of capabilities (knowledge, skills, and abilities) they are missing, but need, in order to deliver on those contracts..

The second objective is to explore and understand what kind of data and information our competitors could discover about us, our people, our IP, who our clients are, and the contracts or types of work we are involved in and doing for our clients.

We want to find out if there is value in focusing our expertise in mining public data to learn more about our industry rivals and competitors and gain a competitive advantage.

Project Goals: What do we plan to achieve in this project?

The hypothesis is, we can increase our win rate on government contract bids from the current 45% to 65%. We believe that with the additional insights we can glean from this “Competitive Intelligence

5 Using Data Analytics Project” we can better identify contracts opportunities where we are in a position to win over our competitors, and better identify those opportunities where we should team with another competitor/partner, or not bid at all.

We also want to determine what kind of data can be mined about us that could be used to put us at a competitive disadvantage.

Success Criteria: How do we measure success? What metrics?

Success criteria includes:

1. Increase CyberSmart win rate on public sector contracts that we pursue, from the current 45% win rate up to a 65% win rate. 2. Reduce the B&P dollars spent on unsuccessful bids, from the current $8.3M to less than $7M. 3. Broader awareness of the publically available data on CyberSmart that could be used by our competitors to give them a competitive advantage over us. As soon as we get the data collection and data curation processes started and begin to evaluate the quality of data we have and can derive, we will add another activity to turn the analysis on our own company. That is, to pull data from the same sources, but to analyze, derive, or infer what our competitors could know/learn about us, and the pursuits we are chasing, using the same techniques. We will use what we learn about our competitors as a benchmark for comparing the quantity and quality of what our competitors could know about us. External Constraints: Policy? Technical? Structural? Legal? Cultural? Resource?

1. Policy – Policies of data collection, reporting, and standers in exchanging data, which are considered as the biggest constraints on big data project, will be applied to us since we will use social media sites and some other job seeker websites. An obvious example of that is how anonymized data files can be used and merged or combined. 2. Data Use Agreements (DUA’s) and Terms and Conditions (T’s and C’s) – Most social media web sites have DUA’s and terms and conditions for the use of the data they provide. Social media site DUA’s vary. a. Some web sites, who provide access to their data, restrict the use of any Personal Identifying Information (PII) or HIPAA protected (PII and health related) data. b. Some web sites provide PII data, and only restrict how it can be used. For example, Linked in API T’s and C’s says “You may use the APIs if your Application is designed to help LinkedIn members be more productive and successful. You may not use the APIs if your Application: exceeds a reasonable amount of API calls; relies fundamentally on the APIs; stores more than a LinkedIn member’s profile data; or is used for hiring, marketing or selling.” c. Some web sites provide access to any data the user makes “public.” For example, any data that an individual makes public, can show up when someone does a search on Facebook or on a public search engine; and is accessible to anyone who uses Facebook APIs (e.g. Facebook Graph API). 3. Technical – are there API’s to access the data sources we plan to use? Are they well documented? And there technical constraints built into their API mechanism. For example, to limit the number of API calls per minute, hour, or day. 4. Legal – what does the Federal Acquisition Regulations (FAR) say/imply about having information about competitors bidding on government contracts? To quote Shlomo D. Katz, Counsel, Brown Rudnick LLP, “As with most other endeavors in life, there’s a right way and a wrong way to gather competitive intelligence. Indeed, some of the wrong ways can be career-ending.” a. There are some obvious and clear violations of the FAR related to gathering and using competitive intelligence. For example, hiring (poaching) a competitor’s employee and making them a key personnel on the re-compete of the contract they were working might be underhanded, but probably legal (unless the employee had signed a non- compete agreement). b. However, if that employee shows up for work on the first day with a trunk full of proprietary manuals describing how his former employer did the work that would not be Ok. The FAR has very specific language about unauthorized disclosure of contractor bid or proposal information. And, a bidder having that kind of information and not declaring it and mitigating against the further dissemination and use of that information can be disqualified from a competition based on the appearance of impropriety, even if no actual impropriety can be shown, so long as the determination of an unfair competitive advantage is based on hard facts and not mere innuendo or suspicion. 5. Cultural – this has the potential to be viewed in a very negative way. Individuals will think we are gathering and collating information about them that might have a negative impact on their job, career, and livelihood. And, in the current industry model of coopetition, we team with some of the companies that we will be collecting data about. When those companies find out what we are doing, they may view this a breach of trust. They might also wish they had done it first, but they might also determine to never team/partner with us again. 6. Resources – Costs to access to social media sources. Internal Constraints 1. Budget – CyberSmart talent is in high demand and most of our work is time and materials (T&M) based contracts. So, time spent working on internal projects, does not generate revenue, and may be lost revenue; developer costs; infrastructure costs 2. CyberSmart legal will need to be involved throughout the project to protect the company from any potential law suits from competitors, employees, former employees, FAR contracting oversight bodies, 3. Human Resources will need to be involved as we discover what our data is available about CyberSmart through public sources. We expect that the same mining techniques and correlation algorithms we use to mine data about our competitors, if turned on us, could unlock and expose information about us that we don’t want exposed. And, that some of the data being exposed or correlated comes from employees and former employees. The implication is the need to change the current policies about what employees and former employees are allowed to place in public forums.

7 4. There will be privacy protection concerns about the data we collect and correlate. We need appropriate data management oversight and control over project, tools, data collected, data use, data disclosure, etc. Data Set Selection Potential data sources include LinkedIn, Twylah, Opprtunity, PartnerUp, VisualCV, Meetup, Zerply, AngelList, Biznik, Entrepreneur Connect, Ecademy, Twitter, Facebook, Ryze, JAZEzone, and BranchOut.

For data on Job Openings we will look at several of the public sites like SimplyHired, Monster, JobFinder, and we will look on our competitor’s company website.

For data on government contracts and competitors, we will start with Deltek’s GovWin tool, to search and extract awarded contracts or upcoming Pre-RFPs by customer, search on RFP/contract numbers or search for programs across clients for particular companies. We will also look at FBO, GovTribe, and USASpending.gov (see table below for details on these data sources). We will scan those sites for all federal government contracts over $25,000.

Data will be both structured and unstructured. We expect the data types to be text and numeric values. We will need to parse text data for key words that indicate skills, technologies, employers, agencies, programs/contracts, etc.

We will need to develop some natural language processing (NLP) capability that can identify key words and then point to (make the association to) the object we are looking for. For example, we might look for the verb “worked for” (the keyword) and know that the employer name (the object) would follow.

We also need a strength of association scoring method/model. In the parsing process, the NLP algorithm may find some less direct associations that we want to capture, but not attribute a strong 100% association to a keyword.

There are API’s for accessing data from LinkedIn. There are four different LinkedIn Partnership programs that provide access to different types of data that LinkedIn makes available: Talent Solutions, Marketing Solutions, Sales Solutions, and Consumer Solutions. We believe the first three are of primary interest for our project. Here are LinkedIn’s description of those Partnership Programs:

 Talent Solutions – Leverage the world’s largest professional network to extend recruiting and talent solutions to your customers.

 Marketing Solutions – Strengthen your customer relationships, campaign performance, and company presence using the powerful technology and expertise of our partners

 Sales Solutions – Social Selling powered by LinkedIn member network drives sales results.

 Consumer Solutions – Create personalized and engaging user experiences for your audience by combining LinkedIn data and functionality into customer websites and mobile applications. As an example of some of the structured data we expect to pull from LinkedIn, here are a few of the look-up tables and reference information they provide:

 Member Profile field descriptions

 Company Profile field descriptions

 Geography Codes

 Language Codes

 Currency Codes

 Industry Codes

 Company Size Codes

 Seniority Codes

 Job Function Codes

We expect data formats to be API specific. That is, we expect each social media site will have its own set of API’s that will each have a distinct format of the data fields and any relationships between those fields. We also expect that most API’s will return data in an XML and/or JSON structure. We expect the data types to be text and numeric values.

Acquisition Cost? What is involved in acquisition? Not just dollar cost, but licensing and use restrictions?

At this point, we do not know what all of the costs and restrictions will be. We expect there will be some restrictions on how we can use of the data, the number of API calls/day (or per period), and if we can store the data.

There are restrictions imposed in the LinkedIn T’s and C’s that specifically prohibit what this project is attempting to do. For example Excluded use #4 “Commingle or supplement Content from the APIs with any other LinkedIn data. For example, you cannot supplement the data you have received via a LinkedIn API with data scraped from our Website (whether that scraping was done by you or someone else) or from any other source.”

Collection Cost? What level of effort is involved in physically collecting the data and moving it into a storage environment?

There will be costs for data collection, data processing (XML parsing), and data storage. More detail on each category are provided in the Solution Design section below.

We expect data formats to be API specific. That is, we expect each social media site will have its own set of API’s that will each have a distinct format of the data fields and any relationships between those

9 fields. We also expect that most API’s will return data in an XML and/or JSON structure. We expect the data types to be text and numeric values.

Complexity? How complex is the structure of the data? How much pre-processing may be required?

We expect that the data is well documented. So, while it may be complex, there is documentation to ease the understanding of the complex data relationships.

However, there is another level of complexity in creating the set of keywords for the NLP, and the association algorithm to find/select the right object (as the example above described; the verb “worked for” (the keyword) and knowing that the employer name (the object) would follow). That will take some work, and may require some automated mechanism to create and continually refine the keyword database.

There will be some preprocessing required. For example, XML parser1. And, in the early phases of testing the data collection, there may need to be some manual/visual inspection of results and time for investigating why it does not generate the expected results.

Documentation? Is the data well documented?

Yes, it appears that social media sites provide good documentation along with the API’s. Although more research needs to be done.

Pedigree? Does the data come from a reputable source?

At this point, we think the data will come from a reputable sources. For now we are only considering social media sites like LinkedIn. And, that individuals who are putting out their information in this very public forum, would be wary of making inaccurate or false statements, because they would be quickly seen and questioned (by their colleagues or connections as LinkedIn calls them). And, companies using LinkedIn for marketing, sales, or recruiting would only publish accurate information. So, we put some reasonable level of trust in the data being true.

Prior Use? Does the data have a history of successful prior use?

Absolutely. For example, LinkedIn has already found ways to make this data available for their Partnership Programs: Talent Solutions, Marketing Solutions, Sales Solutions, and Consumer Solutions. So, it has a history of successful prior use. Not exactly the same way we plan to use it, but there are some similarities in the way those partnership programs use the data in LinkedIn

Of the 15+ potential data sources we identified, we selected 4 sets that are most relevant to our problem, and we describe them here:

1 For example, there are various types of parsers which are commonly used to parse XML documents: Dom (document) Parser; SAX (event trigger based) Parser; JDOM Parser; StAX Parser; XPath (XSLT expression based) Parser; and there are JAXB and XSLT APIs available to handle XML parsing in Object Oriented way. Data Source Competitive Intel Which Things to Bid On LinkedIn Data Type: Job Postings, Text, JSON Data Type: Employees Profiles, Acquisition Cost: Cost per record JSON estimated $0.50 Acquisition Cost: Free Collection Cost: Cost of retrieving Collection Cost: Cost of retrieving from API from API Processing Complexity: Mid-level, Processing Complexity: Mid-level, JSON is structured, but content JSON is structured, but content may not be may not be Documentation: Excellent Documentation: Excellent (LinkedIn( (LinkedIn( Pedigree: Excellent Pedigree: Excellent History of Prior Use: High prior use History of Prior Use: High prior use Deltek’s GovWin tool, to Data Type: search and extract The absolutely great news is the search and extract awarded contracts or upcoming data we get from GovWin will help awarded contracts or Pre-RFPs, RFPs, and existing on both aspects of the objectives upcoming Pre-RFPs by agency-contractor relationships for this project: figuring out which customer, search on (incumbent) RFP’s to bid on and gathering intel RFP/contract numbers or Acquisition Cost: $10,000 annual about other competi-mates search for programs license fee, plus a few custom (competitors and team across clients for reports we may ask Deltek to mates/partners) particular companies. prepare for us annually. Collection Cost: Developer part- We did not duplicate the data type, time to write interface for Acquisition Costs, etc. here in this collection and document it. Then column. But the information in the periodic updates as needed to keep “Competitive Intel” cell just to the it working correctly left is applicable here too. Processing Complexity: Yes, the data is in document format and will need to be parsed, restructured, and meta data tagged Documentation: Yes, there is good documentation on GovWin available Pedigree: Solid, there are many government contractors using it today, and it is an important source of Deltek’s revenue (so fidelity and accuracy are important to their continued business History of Prior Use: Yes

11 Data Source Competitive Intel Which Things to Bid On FedBizOpps (FBO), is the Data Type: Multiple: Screen Again, the good news is the data Federal government's scraping, JSON, XML, Document we get from FedBizOpps will help web site (fedbizopps.gov) types, and URLs on both aspects of our project that posts all Federal Acquisition Cost: N/A objectives: figuring out which RFP’s procurement Collection Cost: writing the to bid on and gathering intel about opportunities over interface to search, scrape, and other competi-mates (competitors opportunities with a download data will be complex and and team mates/partners) value over $25,000. It may require ongoing developer can be thought of as the maintenance. We did not duplicate the data type, government’s proposal Processing Complexity: Yes, the Acquisition Costs, etc. here in this central, or better yet, combination of screen capture, column. But the information in the RFP central. URL’s document downloads, and “Competitive Intel” cell just to the understanding and capturing all of left is applicable here too. the critical context sensitive meta data will be important, complex, and may drift over time. Documentation: The User Interface is well documented. And there are training classes on how to use the interface, what kinds of data and documents can be posted, how companies can place their own information online, etc. Pedigree: Solid, the government uses this site as their central RFP publishing and dissemination repository. And, the data/information is used by many in industry, so there is a kind of crowd sourcing action going on to make sure the best information is out there on FedBizOpps History of Prior Use: Yes Data Source Competitive Intel Which Things to Bid On USASpending.gov Data Type: Prime Contract Awards, This data will also help on both Contains all prime Sub-Awards, Type of Spending aspects of our project objectives: recipient contract (Contracts, Grants, Loans, Other figuring out which RFP’s to bid transactions over $3,000. Financial Assistance) in CSV, XML (based on OMB Exhibit 53 and 300 All grant, loan, and other Acquisition Cost: N/A? reports and OMB Circular A-130 financial assistance Collection Cost: writing the data) on and gathering intel about transactions over interface to search, scrape, and other competi-mates. $25,000. And, all first- download data will be complex, but tier sub-recipient once written should be relatively We did not duplicate the data type, contract, grant, and loan stable. Acquisition Costs, etc. here in this transactions over Processing Complexity: Some, but column. But the information in the $25,000 not as complex as FedBizOpps. “Competitive Intel” cell just to the Varying data formats like CSV, XML, left is applicable here too. screen scraping. Documentation: Yes there is a USAspending.gov Data Dictionary, a Federal Government Procurement Data Quality Summary, and General Questions and Answers and Reporting Requirements available, in addition to documentation on how to access or download the data in several formats Pedigree: It is well known that there are data quality issues in the data. The Government openly states; transactions showing a $0 amount may occur and there are several caveats about the timing and currency of the data. History of Prior Use: Yes, some. There are some neat graphs and charts available, but Congress and other oversight bodies do not rely solely on this data.

Experimental Design What do we know about the system? Stable, unstable? Scalable, non-scalable?

The Competitive Intelligence Using Data Analytics system is generally unstable, but scalable. It is unstable because business changes; business operating models change, business functions and processes change, government contracts change, requirements in those contracts will change as technologies and business models change. So, by definition, the system is unstable.

13 There will be periods of stability and periods of change depending on the nature of the cybersecurity business. So, collecting, parsing, NLP text processing, indexing, storing and analyzing the data will need to be a continuous effort. We will need to find an automated way (using data analytics techniques) to update and refine the keywords, and the constantly changing set of contracts and competitors that we are looking at.

The system should be scalable. That is, we don’t see any obvious “physical” constraints like the height of a person. We do have some concerns about the logical scalability. For example, there will be a need to create a finite set of contracts and competitors in order to have a chance of staying within some of the logical constraints (terms and conditions for example, the number of API’s calls to LinkedIn). More contracts and more competitors means more complexity, more processing, more relationships to graph, more data to store, more compute capacity required, and so on.

Appropriate Endpoints for Evaluation? How do I measure my success criteria?

The overarching goal of the system is to improve our competitive intelligence so that we increase our win rate on government contract bids from the current 45% up to 65%. The hypothesis is that with the additional insights we can glean from this “Competitive Intelligence Using Data Analytics Project” we can better identify RFP/Contract opportunities where we are in a position to win over our competitors, and to better identify those opportunities where we should team with another competitor, or not bid at all.

We have an endpoint goal of improving our win rate, reducing the B&P dollars spent on losing proposal bids, and developing a better understanding what kind of data our competitors might be collecting about us and how they might then use it to put us at a competitive disadvantage in the government proposal bidding process.

We will measure our success by comparing our proposal win rate for 2015 to our proposal win rate in 2016. Every bid decision in 2016 (bid, team, or no-bid) will include documentation about if and how insights from our Competitive Intelligence Data Analytics Project were used to make the bid decision. We will highlight where the actual bid decision was different than the decision suggested by the Competitive Intelligence Data Analytics system.

We will also measure B&P dollars spent on a failed bid/pursuit.

Tests for Significance / Measurements for Evaluation? How do I validate my results? Statistical test? Independent measurement or test?

Since our data will be mixed with different data type’s numerical, geospatial, and text, we will use classics statistics and predictive analysis to check and verify our goals. The table (1) below provides examples on what types of statistical test and analysis might be applied. Statistical Test Test or Analysis Name Classical Statistical Analysis Descriptive Statistics and Distribution Curves Predictive Algorithm Clustering (Nonlinear Algorithms) Significance/ Quality Test A chi-square Test Hypothesis Testing T-Test F-Test Table (1): Statistical Test Types

These are some example statistical tests and data mining techniques that we believe will be appropriate.

We will do hypothesis testing for a proportion to compare the primary goal which is the increase of winning rate from 45% to 65%. Not only the proportion test we might use but also a prediction model can be developed or build to predict winning rate outcome by using statistical techniques such as a multiple regression or logistic regression analysis.

We will do a categorization of win proportions and use a chi-square test to determine if the result is statistically significant.

And, we will use a z-test to compare the proportion of wins in 2016 vs 2015.

A different approach might have us do an exploratory analysis of large-scale keywords from social media sites to see if there is any pattern or interesting information about the publically available data about our competitors or our Company. The best approach, in this case, is clustering and network analysis to identify the characteristics of competitors.

We need a management process around how we are going to measure the desired outcomes

Here is how our management process will work to collect, manage, and measure the results for this project. In descriptive language, we need the following data:

1. A count of all of the RFP’s in our marketplace that we could bid on (or be a teammate on) plus other relevant data elements.

2. A count of All RFP’s in our marketplace that we decide at the outset to no-bid. Along with a reason code for the no-bid. By “at the outset,” I mean before we spend any B&P dollars to put the opportunity into our Sales- Opportunity life cycle and pipeline. (Our Sales-Opportunity life cycle has five stages 1) Understand Customer. 2) Validate Opportunity. 3) Qualify the Opportunity. 4a) Develop Solution. 4b) Propose Solution. 5) Negotiate & Close.

3. All RFP’s that we decide to bid on will be documented and tracked in Salesforce.com. All opportunities will follow our approved sales-opportunity life cycle stage process, deal heath checklist, time recording, and cost accounting policies. Full accounting of B&P dollars spent will be captured and associated to each Sales-Opportunity.

4. All Sales-Opportunities that we decided to no-bid before final submission will record the reason code (new additional reason codes will be added to our current list to give attribution to the use of data/inputs from this competitive analysis project being used in the no-bid decision). And the

15 total saved B&P dollars will be calculated and attributed to the ROI of this competitive analysis project. That is, B&P dollars not spent because this competitive analysis project projected a loss or otherwise influenced management decision to no-bid, will be tracked in Salesforce. The value will be calculated as total estimated B&P spend for the full Sales-Opportunity life cycle minus the total spent up to the point of no-bid decision = ROI from this Big Data competitive analysis project ).

5. All Sales-Opportunities where we prime and win or team and win, and where this competitive analysis project influenced the bid or teaming decision, will be assigned a reason code that attributes the win to the this Big Data competitive analysis project.

We need a mechanism to create and maintain that finite list of contracts and competitors.

For Government contracts: we will start with Deltek’s GovWin tool, to search and extract awarded contracts or upcoming Pre-RFPs by customer, search on RFP/contract numbers or search for programs across clients for particular companies. We will also look at FBO, GovTribe, and USASpending.gov.

We will scan those sites for all federal government contracts over $25,000. Using a configurable script, we will scan every two days for new postings (posted in the past two days). The script will search for contracts by classification codes and NAICS codes as follows:

Classification Code Classification Description A Research and Development D Information technology services 70 General Purpose information technology equipment R Professional, Administrative and Management Support Services U Education and Training Services NAICS Code NAICS Code Description 334*** Compute and Electronic Product Manufacturing 518*** Data Processing, Hosting and related Services 5415** Computer System Services 5416** Administration Management and General Management Consulting Services 611*** Educational Services 927*** Space Research and Technology 928*** National Security and International Affairs

To get a list of competitors, we will also use Deltek’s GovWin tool. And, as appropriate, we will scan a competitor’s website and utilize their website search functions to find information on their customers or their capabilities. If/when we need additional inputs, we will use Technology Business Research Inc. (TBR) Professional Services Business publications.

We will curate and store all the contract information in our SQL database.

This data is applicable to, and used by, this big data project in the following ways: 1) It bounds the list of contracts we would bid on and the contracts that our competitors would bid on. And, provides a current list and historical database of every RFP released by the government that might be relevant for our company to prime or team with a partner on.

2) Any contractor who bids on these will be considered a competitor or potential teaming partner. We will use the list of companies as the names of companies we search for in LinkedIn and other sites we us to gather competitive intelligence from.

3) We will apply NLP and categorization techniques to identify changing characteristics and parameters of RFP’s over time. For example, changes from buying products or people (staff augmentation), to buying services. And, changes in technologies, types of services, contract types, desired or required certifications (i.e. ITIL, ISO 27001, ISO 20000, ISO 9000, CMMI, etc.).

4) Changes in the categories will change the keywords that we would need to look for in the social media sites we pull data from on individuals and companies. This will give us an automated discovery and reporting capability that can help detect and understand drift in any of the keywords.

This system is in a very complex and dynamic environment that is constantly changing.

Lifecycle Management? How do I reincorporate what I learn into my program?

We need a mechanism for getting feedback on the changing business environment and the changing IT environment. Business operating models change, business functions and processes change, organizations and roles change. The IT environment changes include the IT systems and data assets used, and the knowledge, skills and abilities required, and the types of contractor services involved, and how they change in response to the changes to the business environment. In all, a very complex and dynamic environment; constantly changing.

Initial thoughts might be to set up governance processes so the owners/designers/engineers of this system would be notified whenever there is a significant shift noted in any of the topical areas we are interested in and using in our system. But, just knowing that a new skill set has started to appear in RFP’s and in applicant CV’s would not necessarily aid in understanding the business change that triggered the role change. We need to understand what motivated the change, so we can interpret what that portends for our business, our competitive situation, and our clients changing needs. But that sets up a manual intensive process that could not keep up with a dynamic business and IT environment change (unstable and not scalable).

We will need to find an automated way (using data analytics techniques) to updated and refine those profiles of Government requirements, RFP’s, Talent Management, and competitors.

Solution Design In this section, we introduce the Solution Design Introduction and describe the technical architecture of the solution. We separate the solution into three distinct parts as follows:

17 First part, we will interview our customer, CyberSmart, to make sure our goals and scope. For example, what the data they really concern? How much budget we have?

Second part, we need to collect data at website which related for the competitive company information, so we may collect the data from Career websites, Facebook, twitter. We will also process the data. According our customer, CyberSmart, to decide Data Curation’s rule to decide which data need to store in our database. Also, we need to evaluate data quality.

Third part, we will analyze data and give our customer our results to help them make decision.

Design Overview The diagram below is a high level view of our solution design. There are four phases. Starting on the left hand side and working to the right: Data Input and Collection phase, Data Processing phase, Analytic Execution phase, and Data Visualization (reporting) phase. Within each phase are activities as follows:

1. Our Data source is form website connect by APIs. Those API output format was XML or JSON so we need to translate format to RDF. The reason we use RDF is it more clearly than XML and JSON to describe metadata. RDF can clearly describe attribute in metadata.

2. We will store structured data, such as graphics, text and numeric in the database so we need SQL database, graph database, key-value stores, Hadoop and HBase.

3. We use Apache Jena Elephas on top of Hadoop for data Curation Tool. Apache Jena Elephas is a binary serialization and deserialization for RDF.

4. We use R language for analytic execution because it is free to license and easy to use.

5. We will use Tableau because it can help reveal through graphics esoteric information that might otherwise be hidden data. And, the user interface is friendly. Data Collection, Storage, and Retrieval In this section, we discuss our technical approach for collecting, storing, and making the data available.

The following plan describes a method of collecting data from career websites such as LinkedIn.

The data collection plan begins by building a list of keywords based on skills from job seeker’s resumes and job openings. We can use OpenNLP. The tools offer tokenizer sentence detection, POS tagging, Named Entity Recognition (NER), chunker, parser and co-reference. It help us to find the keywords we need. The list will be created manually through research.

Next a script will scan career websites, compiling a list of job postings and job seekers matching the keywords.

The data will be stored in (URL, Keyword, and Source) triplets using a RDF tool on top of Hadoop Ecosystem. The data will be query using SPARQL. After the data is extracted from RDF, it will be analyzed using statistical tools to input information by location, skills, current projects, technologies, job openings, current software and systems by companies, and projects.

We expect data formats to be API specific. That is, we expect each social media site will have its own set of API’s that will each have a distinct format of the data fields and any relationships between those fields. We also expect that most API’s will return data in an XML and/or JSON structure. We expect the data types to be text and numeric values.

Our technical requirements for data collection include the following:

19 1. We need to write scripts to connect to the LinkedIn, GlassDoor, GovWin, and FedBizOpps APIs

2. A virtual machine (VM)

3. Running Java and a JavaEE Application Server (like Apache Tomcat, Glassfish, JBoss Enterprise Application Platform, etc.)

4. Internet network connectivity (with sufficient bandwidth to handle the transaction volume we expect at the desired latency/response times)

5. XML parser

6. JSON parser

7. A java developer to create the application we need to call the social media API’s, note any error codes, parse the returned data, and store it in the correct data store.

8. We will also use Python to collect data in text. There are 3 types of data we need to collect:

a. URL pointers: some profiles will contain pointers to other web locations (URL).

b. Key value pairs for keywords such as “Name”, “Experience” Employer/company information.

c. Data Source: where the data was collected from (like https://www.linkedin.com/)

We will translate all the text to RDF form because the information can have different formats, it has to be store in multiple net flows in RDF. In our project we use graph analysis is stored in a graph data structures through nodes by using mapping entities in computer network for example, an IP address connected via TCP session and their relations to other network components. The information can be represented through a tree graph analysis of IP addresses and other components by demographic location. For example, the information collected will be network flow activity, domains, emails, http:// requests and responses, autonomous system information, ports, and static and scalable data. Information we are obtaining from sites such as LinkedIn and indeed will be used to gather Intel about software, project information, and other valuable data that can help us understand the vulnerabilities of our systems, information about our company, and competitors.

Storage Tool?

For data storage, we will need:

1. SQL and NoSQL data stores (e.g. MySQL/Postgres (SQL database), Neo4j (graph database), Riak/Redis (key-value stores), Hadoop (HDFS), and HBase (column-oriented database) because we will need to store structured data and records provided in JSON and XML 2. Sufficient storage capacity (disk) with the appropriate performance characteristics we need to support the transaction volume and desired throughput rates). We have not done the calculation yet, but we anticipate we will need somewhere around 2TB of storage for SQL, 5TB for graph database, and 30TB for key value stores. And, we expect the data storage requirements to increase 40% next year. The information we are collecting can be in the scale of millions and billions of records in a dataset. A flexible way to store the information we are collecting for analysis is using a RDF platform on top of Hadoop ecosystem. For the transactional data sets, which are collected daily, we require a high processing capacity that can allow retrieving information in seconds. For this reason, the hardware used for storage needs to have a scalable I/O system and have the flexibility to expand memory capacity over time.

Curation Tool(s)? What tools and/or methods will I need to curate my data (will be adjusted as you get into the curation section)?

Turtle consists of a textual format: subject, predicate, and objects. N-triplets uses more words than turtle, but it is convenient when handling millions of records.

The RDF tool we have chosen is Apache Jena Elephas on top of Hadoop. Apache Jena Elephas has several API available built against Hadoop Map-Reduce 2.x. Common API is a binary serialization and deserialization for RDF. It uses nodewritables, triplewritable, quadwritable, and nodetuplewritable in conjunction with RDF Thrift.

The Apache Thrift is a tool that is efficient for defining data types in a definition file like a data dictionary. It takes the file as input and generates a code through a compiler to build a remote procedure call to clients and servers to communicate in multiple programming languages.

The benefits of using turtle and n-triplets is the capability of being human readable and faster to parse. However, Apache Thrift provides format exchange in co-operating processes through disk and networks. It is especially designed for machine processing, but because it uses binaries, it is not human readable.

The use of Intel Graph Builder would allow us to build large scale graphs through the use of MapReduce and Pig language scripts. All the tools we have mentioned are open source No-SQL tools.

Documentation? Is the data well documented? If not, how am I going to document it?

Yes. All variables are defined and well documented.

Sharing/Security? What is my plan for making data sharable within the community? What is my plan for restricting to appropriate access?

The data will normally not be shared. And, when shared it will be done in an abstract way, and only when a company is a partner collaborating with us on a joint win. We do not want them to develop negative perceptions about this data mining project we are doing. Therefore, we need to separate information from level 1 to level 5. The higher risk in the higher level.

Data Set 1 Text data on web side which provide API to extract data Collection Methods/Tools? API

21 Storage Tools? SQL Hadoop/HDFS/HBase Curation Tools? Apache Jena Elephas Documentation? Yes Sharing/Security? We do not need to share our data to other community

In selecting the tools for the above matrix, we need to consider to optimal traits of the tools we are selecting. To help with this, here is a list of some criteria appropriate to storage tools, as well as sample pairings for tradeoffs you may have to make. One suggested scoring method is to score each of the criteria on a scale of 1-10 – and then calculate a total score for the tool and take the absolute difference between pairs. In this case, a “perfect” solution would score 100 with a difference score of 0.

 Scalability vs. Flexibility

 Openness vs. Security

 Interoperability vs. Support/Robustness

 Applicability to problem vs. Reusability

 Maturity vs. Innovation

 Cost

Using those criteria and scale, we compare different database from SQL and Hadoop/HDFS/HBase. SQL is Relational database. Users have to scale relational database on powerful servers that are expensive and difficult to handle. Hadoop/HDFS/HBase is much Scalability because it can auto elasticity (Map Reduce) to different server’s memory. Also it does not need schema which means it provides more flexibility than SQL.

SQL Matrix

Scalability Flexibility 3 3 Openness Security 8 9 Interoperability Support 10 7 Applicability Reusability 8 4 Maturity Innovation 9 3 Total Score: 64/100 Trade-Off Score: 21

Hadoop/HDFS/HBase Matrix Scalability Flexibility 10 9 Openness Security 5 9 Interoperability Support 10 7 Applicability Reusability 10 6 Maturity Innovation 9 10 Total Score: 85/100 Trade-Off Score: 21

Data Curation and Quality Assurance In this section, we detail the plan for how we document and QA our data. The objective is to better understand your data, so that you can maximize quality of input, select appropriate tools/tests, and be able to explain where and why you got the answers you did.

In terms of tools selection, the curation and QA tools will be a mix of your Collection/Storage/Retrieval and your analytical tools. You will use your CSR tools to query and format the data, and possibly even do data census, while you will use your analytical tools to do data census and ask very specific questions of the data.

For the analysis, graph algorithms are applied to computer network data. As we collect data from the World Wide Web, the utilization of graph analysis is essential to understand the geographical source of the data (network addresses correlated to geographical location).

Graph analysis is stored in a graph data structures through nodes by using mapping entities in computer network for example, an IP address connected via TCP session and their relations to other network components. The information can be represented through a tree graph analysis of IP addresses and other components by demographic location. Because of the information can have different formats, it has to be store in multiple net flows in RDF.

There are several algorithms that can be used for analysis. The ones we have selected are subgraph isomorphism, also known as pattern matching, and badness propagation. The pattern matching algorithm uses string patterns for text analysis commonly used in web search engines and natural language processing. The exact matching pattern can be written in java to extract information from a website. On the other hand, badness propagation identifies the error of the output using the observed error analysis. Described by an input being the network in a multilayer network, which sets an input and output set as pair. When it goes through the algorithm runs the network calculating the output doing the following: computes the error, updates the weight, computes the error in hidden layers, updates the weight of hidden layer errors, and returns the output. This helps to mapping character strings in a character recognition analysis.

23 One of the most important characteristics of graph analytics is the exploitation of visualization factors when utilizing an unstructured construct like RDF.

Once the information is stored in a semantic-graph database such as RDF, the data will be extracted using SPARQL, a query language that is supported by the World Wide Web Consortium. The structure of RDF makes the URLs machine readable by storing data using turtle or n-triplets.

Expected QA Challenges? Qualitative assessments, research pedigree and lineage of the data, understand history of use.

1. How to interpret the data from social media network?

In SSN there are all unstructured data, people post information in their nature language. Natural language processing is one important challenge. Also, if we need to extract other language website like Chinese, how to translate language is a challenge.

2. How to create and maintain content validation rules?

We need to decide which keywords are the key words. It may change in the future. Therefore, we need to decide if the key words is changed or added.

3. How to make sure data is clean?

In the Internet how to make sure the data is true is difficult. Therefore, we need to think how to decide data clean rule.

4. Understand how to use those data.

We need to interview our user which information they are really want

5. Data Aggregation

Data will be good for aggregate analysis, it can be used to classify/define the larger population of some field such as computer science, business, accounting, and so on. It will be possible to organize the data into groups and do analysis of the sub populations.

6. Sampling bias

It may happen for example, when we try to use information at web side to find competitor’s advantage and disadvantage. However, some of position may not release on website. It may has some collection bias. We can interview our user to figure out the result and sample is reasonable or not.

7. Spelling error It may happen because of job seeker’s error input. We can use fuzzy grouping to solve this problem. Ir can group typically misspelled words or closely spelled word to recognize those words. If we still cannot realize what the word really mean, we just drop those entries.

What are the types of data we expect? Review documentation, develop a plan to perform data census based on documentation, report census results and extend/amend documentation as appropriate.

Field Definitions Data Type Validation Rules Comments Number of times a Count Numeric Number word matches

Source URL Need to be illigle URL

Date/Timestamp Timestamp Date/time

Text, Source, Evaluate profile strength. The more information in Profile picture profile than we can trust this is true file.

Derived from 1 to 12 Between 1 to 12 Timestamp Month Time Timestamp ( HH:MM:SS)

Year Numeric Derived from Timestamp

File form like File format should be image files jpg, png Picture Skills Text Need to interview our customer to decide rules technologies Text Need to interview our customer to decide rules

Employee number Numeric Less than 5000

Agencies Text Same as Company name programs/contract Text Need to interview our customer to decide rules s Company name Text Need to interview our customer to decide rules

Employee s’ name Text Text length should less than 40

Pre-RFPs Text Need to interview our customer to decide rules

RFPs Text Need to interview our customer to decide rules

Amount float Bigger than 0

QA Methods? What are the methods we propose to address quality issues in the data? How do we document these?

1. Interpret the data from social media network: Developing some natural language processing (NLP) to interpret data from SSN.

25 2. Create and maintain content validation rules: Interview our user to and reference data documentation to decide content validation. For example, are we need to extract data created five years ago? If the vacancies already open for two years, that company really want that ability people?

3. Data Aggregation: To define how to aggregation data make sure we can easy analyze data

Curation Plan? How do we implement QA while allowing for reversion and multiple approaches? How do we ensure documentation? What methods to prepare data for analysis (flattening, selection, dis- aggregation or aggregation)?

1. Review data documentation - can be obtained from the sources

2. Develop tests to describe data and verify documentation

3. Modify documentation as necessary

4. Conduct data cleaning/modification

5. Profiling the data to Improve the ability to search the data

6. Ensure reversion, if necessary - never make changes to the raw data (duplicate data before curation, store raw data)

7. Integrate and format data for analysis

Descriptive Statistics: runs the network calculating

Pattern Analysis: computes the error, updates the weight, computes the error in hidden layers, updates the weight of hidden layer errors, and returns the output. Modeling: network calculating model.

Reference Comparison: compare each company requires job openings and the skills, experience, and education they desire

Profile:

Data Set 1 Profile Expected QA Challenges? How to extract data from social network sites (SNS), Career websites to find Job seeker’s, employee profiles. Missing Data: It has missing data because not everyone has their own profile complete and not every company have full information in their web site. Entry and collection Bias: We try to use information at web side to find competitor’s advantage and disadvantage. However, some of position may not release on website. It may has some collection bias. Data Aggregation: Data is well labeled and organized, although may present a challenge for statistical analysis. Typing Errors: Our data may have some typing error because of job seeker’s error input. Contextually Acceptable Values: Some value need follow some rules, like age it may not possible over 70 years old. Error Patterns: It may happen because some input mistake. Duplication: It may happen because job seeker or competitor may post their information in different website. Legitimacy: Some of privacy data cannot be collected because of law. Expected Data Types? Text, picture, date, numeric, source, Skills Technologies, Company name, Employee s’ name QA Methods? 1.Developing some natural language processing (NLP) 2. Documentation Review: Use each website data documentation to figure out which data is reasonable 3. Descriptive Statistics: Number of sources, minimum, maximum, mean, median and standard deviation of the data. 4. Interview users to assess the data quality and find the drawback to modify. Curation Plan? We can use every data they used to work company to curate data. When we need to search some company, we can easy find people who used to work at that company and work for which position.

Contract:

Data Set 2 contract

27 Expected QA Challenges? How to extract contract file From website. Missing Data: This do not have missing data problem. Entry and collection Bias: This do not have Entry and collection Bias problem. Data Aggregation: Data is well labeled and organized, although may present a challenge for statistical analysis. Typing Errors: Our data may have some typing error because of error input. Contextually Acceptable Values: Some value need follow some rules, like agencies name, RFP content. Error Patterns: It may happen because some input mistake. Duplication: It may happen because same contract information may public in different website. Legitimacy: Some of privacy data cannot be collected because of law. Expected Data Types? Agencies Pre-RFPs RFPs Amount QA Methods? 1.Developing some natural language processing (NLP) 2. Documentation Review: Use each website data documentation to figure out which data is reasonable 3. Descriptive Statistics: Number of sources, minimum, maximum, mean, median and standard deviation of the data. Curation Plan? We can curate data with who offer a contract and who is Agencies and the contract budget. It help us to find out the contract we want to bid.

Statistical and Analytical Plan/Approach What analytical methods are appropriate to the data? Make sure you have methods that match your data types and structures.

Since our data will be mixed with different data type’s numerical, geospatial, and text, we will use classics statistics and predictive analysis to check and verify our goals. The table (1) below provides information on what types of statistical test and analysis might be applied.

Statistical Test Test or Analysis Name Classical Statistical Analysis Descriptive Statistics and Distribution Curves Predictive Algorithm Clustering (Nonlinear Algorithms) and Latent Semantic Indexing Significance/ Quality Test A chi-square Test Hypothesis Testing T-Test / F-Test Table (1): Statistical Test Types

To make sense of data, we need ways to summarize and visualize it. The first step is to describe the basic features of the data by running descriptive statistics for each data set or variable we have. For example, function codes are categorical data, so one of the ways is to find the most common value in the data, which is the mode. Here is an example of numerical and non-numerical data description. Variables or Data set Descriptive statistics Industry Codes frequencies /Mode/bar chart or pie Seniority Codes frequencies /normal curve Table (2): selected descriptive statistics measurements.

A good starting point, before the Latent Semantic Analysis, is to create a word cloud of the job postings. A word cloud will helps us to visualize the most common in the all job postings and have a general feeling of the data set. Also, we can make a time series graph of the number of job postings over time (last view months) to see there is any pattern of the number of job postings for all competitors or one of them.

Which of these methods are appropriate to the defined problem? Produce a useful output relative to the question you are asking.

Which methods are relevant to the understanding the User? Users can be analysts, administrators, or field personnel. They can be quantitative or qualitatively oriented. Produce use relevant technology.

What analyses can be measured? Can they be mapped to your success endpoints? Your significance and evaluation tests? Make sure that the analyses you choose can be tested/measured.

A second step before LSA, is selecting which competitor we will be collecting information about.

Data set Competitive Intelligence What Do I Bid On Deltek’s GovWin tool, to search and For all three data sets, we will use From the graphs and the extract awarded contracts or them as filter to select our target table will be obvious which upcoming Pre-RFPs by customer. competitors, who have a high chance competitors will be collect FedBizOpps (FBO), is the Federal to win a contract by using historical data about. government's web site data from the three websites. (fedbizopps.gov) that posts all Federal Analysts: Descriptive statistics procurement opportunities over Summary table(Percentage and opportunities with a value over frequencies) $25,000. Executive Audience The Summary table or pie chart

29 Data set Competitive Intelligence What Do I Bid On USASpending.gov includes Contains all prime recipient contract  How many contracts each transactions over $3,000. company win by company’s All grant, loan, and other financial name? assistance transactions over $25,000.  How many contracts from the And, all first-tier sub-recipient collected data the might bid contract, grant, and loan transactions on? over $25,000  What is the value of all contracts that they won in the past and the one they might bid on? Make classification table with all information above depending on the size of the company.

LinkedIn Job Postings Analysts: Latent Semantic Indexing Proposal Managers: Executive Audience: Simple showing the results : Clustering Plot to describe trends What is the final plan to increase our chance to win? Including  Which competitor we should pair up with?  Which competitor we should keep collecting and do more analysis?

As starting point in the analysis, we will use an approach the Latent Semantic Analysis (LSA), which is a method to analyze keywords to find the underlying meanings in the data set. In our case, we are trying to extract details about competitors’ current openings that can help CyberSmart predict our competitors’ plans. LSI approach will help us builds clusters by generating a vector for each CV and job posting by applying the LSA in two steps. First step, building a matrix by transforming the data set of job posting by competitors and CVs of job seekers into an M×N matrix, where m is the number of CVs, and N is the number job postings. Second step, implementing the mathematical algorithms Singular Value Decomposition (SVD) to cluster on statistically significant subset, where each CVs and job postings divide into clusters, and each cluster contains similar CVs and job postings. Diagram (): clustering

CyberSmart wants to infer the results from the Big Data to strength its positions against the competitors and use that for planning. Table (2) is a hypothetical data set that we would collect and process to serve this goal.

Categorical Data

Record skills experience education competitor Data … 1 Strong analytical 11 Master A 1/12/2015 3 Social Engineering 9 Bachelor C 1/14/2015 4 Programming 7 Bachelor D 1/15/2015 Table (2): Data set

The third type of statistical tests is significance or quality test which is, in our case, a chi square test. We will use chi-square to compare the percentage of fit to each cluster for each competitors and CyberSmart to see which of them have a high statistics significantly percentage. The Statistical analysis that will be used is a clustering analysis which is a method to of partitioning a set of data into a set of meaningful sub-classes. Clustering by Nonlinear algorithms for non-numerical data will be used to analysis the job posting from our competitors. The K-means is used for mix data, and hierarchical clustering is used for categorical data. To find what kind of professionals they are looking for and seeing if these postings indicate something. For example:

Goodness-of-fit Test Competitor A Competitor B CyberSmart Cluster (1) 66 70 400 Cluster (2) 250 150 300 Total

The fourth type of statistical tests is hypothesis testing. In our case, a F-Test (the ANOVA analysis) to compare a month’s worth of an individual job posting activities to determine if that month’s distribution of activities is the same as, or different from, the competitors “normal” profile. To keep tracking our competitors’ activities, we can use this monthly data set to do many analyses since we have detailed information for each one. First, we can do a T-test to compare the average number of monthly postings and hires for each competitor, and estimate the confidence interval for the average of posting and hiring. A monthly data set of postings and hires can be used to make a comparison with other future data set as an indicator to identify which specific skills and certificates they still looking for.

Tools for Text analysis and Statistical analysis

31 Tools for Text analysis

NLP tools like Stanford core NLP, NTLK with python, and Apache OpenNLP on R are undoubtedly a necessary technology of the modern day world. Stanford core NLP is one of the most effective software that provides the user with the NLP analysis tools reliable, fast, higher quality, and an open source text parsing tools. Fortunately, this tool offers the continuous development environment by housing the vast programming language (Agre, et al, 2014)2. It seems that Stanford core NLP is our best tools for the data that we have. Yes, that true especially that the main data set is text-heavy, however, R is still a good tool for us. It is an open source tool like Stanford core NLP, but R has a vast number of libraries implementing and models, and methods that we can use them for all other tests not only in text analysis, R has a good libraries for NLP, which is Apache OpenNLP, that is use to NLP text using tokenization, partial speech tagging, sentence segmentation, and parsing. This toolkit is simple and easy to us for the end user. It is indeed the simple modern technology that every user will admire to manage certain data.

Tools for statistical analysis

Clustering can be done using various Statistical tools including R, SAS, and SPSS. Many people think SAS procedures that perform cluster analysis are varied. For example, Hierarchical clustering of multivariate data or distance, K-means and hybrid clustering for large multivariate data, and ….etc. Also, if we compare particular method or test like K-Means clustering is best done in SAS as compared to R. However, R is one of the most popular tools for clustering. Yes, not like SAS, but for the purpose of our data because all we want is to map categorical data into a group to detect any patterns about our competitors. Another reason why R not SAS as a main tool is R growth trend will continue because it is free and easy to use not like SAS. R has many packages to server one single test.

Most of these tools used for big data analysis promise to save us time, money and help us discover first hand business insights. Despite how true that promise may be, it is of paramount importance that we understand what tools are right for our skill set and also the project that we are trying to do. We are looking for statistical tools around the following areas; significance tests and most importantly, answering the problem we are trying to solve. R and Tableau statistical tools that we will be considering for the analysis at CyberSmart.

Data Visualization Data visualization helps to make sense of the huge amounts of data that can otherwise be mind boggling. Visualization provides a way to see how RFP’s, and competitors and labor skills are all connected and to see the trends and patterns from different angles.

2 Agre, G. et.al. (2014). Artificial Intelligence: Methodology, Systems, and Applications: 16th International Conference, AIMSA 2014, Varna, Bulgaria, September 11-13, 2014, Proceedings. Springer. One of the most difficult jobs for a data scientist is to convey the insights from big data to the stakeholders. It is not possible to use databases like MySQL or spreadsheets like Excel every time one needs to create a complex report. Data visualization is so important in the current age of big data because it reduces the complexity through simple and easy to understand visualizations.

Many visualization tools do not require specialized coding skills. And, there are several useful and inexpensive proprietary and open source data visualization tools available today.

We selected two basic statistical and visualization tools and described them here:

1. The first tool for data visualization is Tableau. Tableau focuses primarily on business intelligence, but is useful for creating maps for geographical data, bar charts, and scatter plots without any programming skills. CyberSmart already owns licenses for Tableau and has recently added Tableau’s web connector product that enables real time connectivity to a SQL database or API calls to other data sources, which means it is possible to see live data in our visualization. Since we will use R for the clustering analysis Tableau allows the user to connect R environment to the tool, so the user can easily build interactive visualization. Tableau is relatively inexpensive, it has an easy to learn/use intuitive interface, and has the concept of a “story” that uses a progression of worksheets showing different aggregations, drill downs, or views to tell a story that executives can easily understand. 2. R Project – R is an essential tool for this project for the following reasons: a. There are over 2M users of R today who use it for statistical analysis, data visualization, and predictive modeling (implies that are skilled resources available, and lots of users attests to its capabilities and quality of results)

b. R is a programming language designed by statisticians, for statisticians. The language provides objects, operators and functions that make the process of exploring, modeling, and visualizing data easy.

c. R is an environment for statistical analysis: The R language has statistical analysis functions for virtually every data manipulation, statistical model, or chart that the data analyst could ever need.

d. R is an open-source software project. Like other successful open-source projects such as Linux and MySQL, R has benefited for over 15 years from the "many-eyes" approach to code improvement, and as a result has an extremely high standard of quality and numerical accuracy. The project leadership has grown to include more than 20 leading statisticians and computer scientists from around the world. As a result, there is a strong and vibrant community of R users on-line, with a rich set of community- maintained resources for the beginning to the expert R user. Also, as with all open- source systems R has open interfaces, meaning that it readily integrates with other applications and systems.

We identified three basic data visualizations that are appropriate to your problem. Explain why: Example of visualizations for our data

33 1. Wordle Word Cloud creator is a useful tool to visualize the categories of job postings. A word cloud, in this case, will helps us to visualize the most common of the job postings and have a general feeling of the data set. Also, we can make a time series graph of the number of job postings over time (last view months) to see there is any pattern of the number of job postings for all competitors or one of them.

Graph (1): a word cloud of the job postings 3 2. A line graph will be useful in visualizing the distribution of job posting activities over time by a competitor. This technique will also be good for visualizing which job categories) follow different distribution patterns over time. .

Graph (2) the number of job postings over time 3. Clustering partitions data into a set of meaningful sub-classes. We applied clustering by nonlinear algorithms for non-numerical data to analysis the job posting from our competitors. The K-means is used for mix data, and hierarchical clustering is used for categorical data. The visualization below shows an example of clustering job postings by competitors.

3 http://www.kdnuggets.com/2015/08/data-science-data-mining-online-certificate-degrees.html 4. A timeline could be used to show a progression of change in a user’s normal activities over the past 10, 30, 60 days. For example, to show how this one user has increased their files reads from “normal” to 5 times normal, to 20 times normal, to 50 times the normal volume/quantity of data files being read. The progression might indicate the user was working on a special research or discovery project, or it could indicate they are downloading company documents for nefarious purposes. Obviously, we would need more data points and other types of evidence before singling out a user.

Evaluation Methodology and Feedback Process For the evaluation criteria, we will implement an approach that measures progress overtime against the main objectives of the work. Using this approach, we can set a range of objectives from the beginning of the project. This will help us decrease spending and focus on specific contracts with similar requirements. Working on projects with similar requirements will allows us to become more proficient and narrow our expertise. Client’s satisfaction will bring more business and helps us win rate on public sector contracts.

For the collection of data we will used quality metrics that indicate possible problems with data, in which we will concentrate in dynamic constraints that are operational. We want to have a directional metric that will improve the way we collect the data and the objective of gathering that data. We will establish a methodology to clean data as follow:

 We will use descriptive statistics seeking a minimum prevalence of zero which indicates that there must be an error

 Frequency and logistic checks to detect errors

 Association between variables

35  Create a data dictionary to use as a reference

 Correlation index to identify missing information, errors, and improper information

Some of the factors that we will consider in the evaluation are conformance, accuracy, accessibility, interpretability, and the success of completion based on cost and schedule.

The process will require the collection of data, doing ETL, data profiling, data validation and constrains, integration, the quality of the data, establishing guidelines for the interpretation of the data, among other factors.

Additional measures are reflected by addressing the requirements for impact and outcome. This can help us establish better success factors. Impact will be the activities and events needed in the short- term. Another methodology can be assessments that are measuring achievable goals and other aspects of what needs improvement.

Success factors against objectives will require matching data specification against the data. Verify whether there are gaps in the data collected. Viewing the granularity of the analysis and the collection of data necessary to mitigate damage. For example, the data can be truncated, have missing values, be incomplete, and have indistinguishable metadata. This will help us measure the success of our projects from beginning to end by analyzing the progress against the original objectives. This method is helpful because it allows us to measure how well we are doing to meet client’s expectations throughout the whole process.

Additionally, we will establish measurements such as mean, median, and distribution. For example, we could use arbitrary missing patterns of analysis to simulate the missing information using Markov Chain Monte Carlo analysis.

The evaluation method will consist of quantitative and qualitative methodologies. The qualitative method will use a questionnaire after the final results of the analysis have been submitted. Feedback will allows us to improve our methodologies, research process, analysis, requirement gathering, performance, and client’s expectations.

Security The data that we are collecting from social media and other websites is not sensitive data, however after the data is analyzed and transformed into information it will become sensitive data for the client. For this reason, it is important to establish security procedures that would help us to keep the information secure. If the information is at risk, it could be used as a way to explode vulnerabilities of a client’s system, to get a competitive advantage based on their projects, target specific employees, etc. In order to protect the client’s information and to protect our company, we have established the following procedures to ensure ongoing compliance while keeping minimizing drag to maintain our systems.

 Complex passwords and encryption  Monitoring software

 Bottom-up security approach – Limits the access of employees to specific information and job functions.

 Continues updates to security patches

 Threat assessments: 1. Monitoring for potential risks, collect and review logs, intrusion and penetration testing, and testing of security protocols.

For data collection and analysis, we will be implanting the following guidelines:

 Data and transaction logs are stored in multiple servers for redundancy

 Input validation and filtering to refine the data we are collecting from multiple sources

 Improvements to our algorithms for data collection in order to create valid inputs

 Encrypted framework

 Provenance metadata

 Embed security in the middleware to integrate security aspects of Hadoop and our RDF system.

Failure to comply will affect our company with the loss of a customer goodwill and trust. Loss of future revenues due to reputation damage, and employee downtime. Steps to protect our reputation are the following:

 Response planning includes the analysis and protocols.

 Incident analysis includes the containment of the breach.

 Incident disclosure includes notifications to third parties.

 Loss mitigation includes customer retention procedures.

 Remediation includes public relations.

Privacy and Ethics The project is about protecting data and company privacy. The information will be collected ethically from the World Wide Web without violating people’s personal information. We will only use information that is freely available through social media and resumes. Companies are hiring us to identify the exposure of information about their company and competitors.

As it was mentioned earlier, the data that would be collected is not sensitive, but the information that would be captured from the data will be sensitive. We have to take into consideration that there can be

37 legal risks involved if the information is exposed. Lack of data management safeguards can put a company out of business or suffer financial losses due to a data leak. We should consider State laws, Federal laws, and common laws, among others to protect the client.

 Deceptive Trade Practices-Consumer Protection Act helps clients to ensure that their data is guaranteed to be protected by our business.

The privacy and ethics safeguard would be the following:

 Incident response plan

 Security awareness among employees

 Security risk assessments

 Internal and external communications

 Trainings about security, privacy, ethics to employees.

Another way to protect the privacy of the data is by educating our employees about the risk of data exposure, as well as the importance of protecting the data and information is essential. It is important to limit an employee from sharing or distributing information.

Are there cost implications to maintaining appropriate privacy and ethics?

From an ethical standpoint, the project will require a contract to collect data because the data collected is going to become valuable information that is sensitive for the client. It will require documentation for the funding and how the money will be utilized. And any other legal issue that requires some approval for the budget spending. Validation and verification requirement for the collection of data.

One possible way to protect information is to give it to the client and delete it from our servers. The information is not reusable because we will be signing a non-disclosure agreement with the client. This way we won’t be able to share information between clients and it gives peace of mind to the client that a competitor won’t acquire information about their company from us. We will also have a non- disclosure agreement about our client list to protect ourselves and our client’s privacy. Data will be collected for each individual engagement and won’t be reuse on other projects. This will help us to keep a level of confident with our clients and mitigate losses from security breaches.

Cost and Project Implementation We developed a project management plan to guide and manage this project from startup through a period of time (20 months).We believe that will be sufficient period of time to determine if the investment made and the results achieved actually deliver the anticipated (and desired) results. The picture of the high-level WBS below only shows the project plan through May 2017 (the months of June 2017 through November 2017 are the same as May 2017). We provide a more detailed view of the Project WBS in an embedded Excel file in the appendix).

Project Project Solution Sprint WBS # Project Phases (correlates to Project Plan Phase) Prep Init Build Interate 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 1 Competitive Intelligence Project May June July August September October November December January February March April May 1.1 Project Preparation 1.2 Project Initiation 1.3 Architect, Design, and Build 1.3.1 Prepare for Data Collection 1.3.2 Prepare For Analytics Establish Management Process for Measuring 1.3.3 Results 1.3.4 Solution Design and Build 1.3.4.1 Execute the analytic trials and evaluate 1.3.4.2 Design Data Visualization Reports Iterate Sprints to Improve data quality, analytical 1.3.4.3 models, and Visualizations 1.3.4.4 Prepare Results for Board Review

Project Startup (Initiation) For the project startup we developed a business case along with the typical supporting documentation and estimates. We created a list of personnel we would need for this project including job roles, estimated level of effort (using Full Time Equivalent (FTE) counts), and estimated hourly rates. We also created a list of other costs for the project. Those other costs include infrastructure (virtual machines, storage, network capacities), software (for example, Cloudera Hadoop, and Tableau), and access to data sources.

The following series of graphics show the steps and the pieces that went into building a business case for this project.

First, we had to understand and estimate the costs; the personnel costs and other costs for this project; (the investment that we would need to make). We provide an extract of several tables below that show the estimated counts and costs of personnel and other items required for the project.

Second, we needed to identify the sources of return on the investment – Reduced B&P spend, and increased profits from higher win rates – and estimate how much those cash flows would be. We provide two tables showing the estimated cash flows from those two sources Reduced B&P spend and increased profits from higher win rates.

We wrap this business case up with summary of the business value parameters; Total Investment, Total Return, NPV, and IRR. We also show an extract of the NPV Cash Flow table.

An extract of the Personnel roles and LOE (in Full Time Equivalent – FTE – values) over time is shown below:

39 Project Phases (correlates Project Project Solution Sprint to Project Plan Phase) Preparation Initiation Build Interate 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 Job Code May June July August September October November December January February March April May Project Executive 0.5 0.5 0.1 0.1 0.1 0.1 Project Manager 0.5 0.5 1.0 0.25 0.25 Agile Master 1.0 1.0 1.0 1.0 1.0 1.0 Data Analyst II 0.25 1.0 1.0 1.0 1.0 1.0 0.75 Data Analyst I 0.25 0.75 0.75 0.75 0.75 0.75 Statistician II 0.25 1.0 1.0 1.0 1.0 1.0 0.75 Statistician I 0.25 0.75 0.75 0.75 0.75 0.75 DBA 0.25 1.0 0.8 0.5 0.25 0.25 0.1 0.1 0.1 NoSQL SME 0.25 1.0 0.8 0.5 0.25 0.25 0.1 0.1 0.1 Data Quality SME 0.25 0.5 0.5 0.5 0.5 0.5 0.1 0.1 0.1 Data Visualization SME 0.25 0.1 0.1 0.1 GUI Designer 0.1 0.1 0.1 Java developer II 1.0 1.0 1.0 1.0 1.0 0.75 Java developer I 0.25 0.75 0.75 0.75 0.75 0.75 R developer II 0.5 0.5 0.5 0.5 0.25 0.25 0.25 0.25 Infrastructure Architect 1.0 0.5 Application Architect 0.5 0.5 Data Architect 1.0 0.5 Security Architect 1.0 0.5 Legal Advisor 0.25 0.1 0.1 0.1 0.1 0.1 Privacy and Ethics SME 0.25 0.1 0.1 0.1 0.1 0.1 Total Labor Hours 600.0 1020.0 684.0 600.0 564.0 510.0 444.0 300.0 354.0 300.0 354.0 300.0 1.0 FTE = 120 hours/month

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document). We translated the LOE model into costs using an estimate for the hourly rate of the role. The table below is an extract of the estimated cost for labor (FTE value * Hourly Rate * 120 = LOE cost per month).

Project Phases (correlates to Hours per Project Project Solution Sprint Project Plan Phase) Month Preparation Initiation Build Interate 120 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 Hourly Job Code Rate May June July August September October November December January February March April May Project Executive 174 $ 10,440 $ 10,440 $ 2,088 $ - $ - $ 2,088 $ - $ - $ 2,088 $ - $ - $ 2,088 $ - Project Manager 97 $ 5,820 $ 5,820 $ 11,640 $ - $ - $ - $ - $ - $ 2,910 $ - $ - $ 2,910 $ - Agile Master 135 $ - $ - $ 16,200 $ 16,200 $ 16,200 $ 16,200 $ 16,200 $ 16,200 $ - $ - $ - $ - $ - Data Analyst II 102 $ - $ 3,060 $ 12,240 $ 12,240 $ 12,240 $ 12,240 $ 12,240 $ 9,180 $ - $ - $ - $ - $ - Data Analyst I 73 $ - $ - $ - $ - $ - $ - $ - $ 2,190 $ 6,570 $ 6,570 $ 6,570 $ 6,570 $ 6,570 Statistician II 111 $ - $ 3,330 $ 13,320 $ 13,320 $ 13,320 $ 13,320 $ 13,320 $ 9,990 $ - $ - $ - $ - $ - Statistician I 102 $ - $ - $ - $ - $ - $ - $ - $ 3,060 $ 9,180 $ 9,180 $ 9,180 $ 9,180 $ 9,180 DBA 82 $ - $ 2,460 $ 9,840 $ 7,380 $ 4,920 $ 2,460 $ 2,460 $ 984 $ - $ 984 $ - $ 984 $ - NoSQL SME 92 $ - $ 2,760 $ 11,040 $ 8,280 $ 5,520 $ 2,760 $ 2,760 $ 1,104 $ - $ 1,104 $ - $ 1,104 $ - Data Quality SME 89 $ - $ 2,670 $ 5,340 $ 5,340 $ 5,340 $ 5,340 $ 5,340 $ 1,068 $ - $ 1,068 $ - $ 1,068 $ - Data Visualization SME 73 $ - $ 2,190 $ - $ - $ - $ - $ - $ 876 $ - $ 876 $ - $ 876 $ - GUI Designer 820 $ - $ - $ - $ - $ - $ - $ - $ 9,840 $ - $ 9,840 $ - $ 9,840 $ - Java developer II 82 $ - $ - $ 9,840 $ 9,840 $ 9,840 $ 9,840 $ 9,840 $ 7,380 $ - $ - $ - $ - $ - Java developer I 73 $ - $ - $ - $ - $ - $ - $ - $ 2,190 $ 6,570 $ 6,570 $ 6,570 $ 6,570 $ 6,570 R developer II 95 $ - $ - $ 5,700 $ 5,700 $ 5,700 $ 5,700 $ 2,850 $ - $ 2,850 $ - $ 2,850 $ - $ 2,850 Infrastructure Architect 102 $ - $ 12,240 $ 6,120 $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - Application Architect 121 $ - $ 7,260 $ 7,260 $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - Data Architect 102 $ - $ 12,240 $ 6,120 $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - Security Architect 121 $ - $ 14,520 $ 7,260 $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - Legal Advisor 174 $ - $ - $ 5,220 $ 2,088 $ - $ 2,088 $ - $ 2,088 $ - $ 2,088 $ 2,088 $ - $ 2,088 Privacy and Ethics SME 155 $ - $ - $ 4,650 $ 1,860 $ - $ 1,860 $ - $ 1,860 $ - $ 1,860 $ 1,860 $ - $ 1,860 Totals by Month $ 16,260 $ 78,990 $ 133,878 $ 82,248 $ 73,080 $ 73,896 $ 65,010 $ 68,010 $ 30,168 $ 40,140 $ 29,118 $ 41,190 $ 29,118

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document). There are also other costs for the project like infrastructure capacities, software, and access to data sources. The table below is an extract of the counts and quantities of those elements over time. Project Project Solution Sprint Preparation Initiation Build Interate 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 Non- Monthly Recurring Description Unit Cost One Time May June July August September October November December January February March April May Virtual Machine Small Elastic (Web $ 35 $ 250 2 2 2 2 2 2 3 3 3 3 3 3 Server) Virtual Machine Medium Elastic (App $ 85 $ 250 2 2 2 2 2 2 3 3 3 3 3 3 Server) Virtual Machine Large Cluster (DB SQL $ 145 $ 250 2 2 2 2 2 2 2 2 2 2 2 2 Server) Storage in GB $ 0.018 $ 250 1000 3000 3000 5000 5000 5500 6050 6655 7321 8054 8860 9746 Network Capacities in Gigabytes/month $ 175 $ 250 5 5 5 5 5 5 8 8 8 8 8 8 (includes data in/out) Hadoop SaaS (Cloudera) / environment $ 1,200 $ 450 1 1 1 1 1 1 1 1 1 1 1 1 Tableau SaaS / user $ 120 $ 250 2 2 2 2 2 2 2 2 2 2 2 2 LinkedIn Access Costs $ 0.050 $ 5,000 10000 11000 12100 13400 14800 16300 18000 19800 21800 24000 26400 29100 Deltek’s GovWin tool $ 833.33 $ 2,500 1 1 1 1 1 1 1 1 1 1 1 1

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document). We translated those counts/capacities into costs using an estimate of the cost. The table below shows an extract of the other costs for the project.

Project Project Solution Sprint Description PlaceholderPlaceholderPreparation Initiation Build Interate 2016 2016 2016 2016 2016 2016 2016 2016 2017 2017 2017 2017 2017 May June July August September October November December January February March April May Virtual Machine Small Elastic (Web Server) $ - $ 320 $ 70 $ 70 $ 70 $ 70 $ 70 $ 105 $ 105 $ 105 $ 105 $ 105 $ 105 Virtual Machine Medium Elastic (App Server) $ - $ 420 $ 170 $ 170 $ 170 $ 170 $ 170 $ 255 $ 255 $ 255 $ 255 $ 255 $ 255 Virtual Machine Large Cluster (DB SQL Server) $ - $ 540 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 $ 290 Storage in GB $ - $ 268 $ 54 $ 54 $ 90 $ 90 $ 99 $ 109 $ 120 $ 132 $ 145 $ 159 $ 175 Network Capacities in Gigabytes/month (includes data in/out) $ - $ 1,125 $ 875 $ 875 $ 875 $ 875 $ 875 $ 1,400 $ 1,400 $ 1,400 $ 1,400 $ 1,400 $ 1,400 Hadoop SaaS (Cloudera) / environment $ - $ 1,650 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 $ 1,200 Tableau SaaS / user $ - $ 490 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 $ 240 LinkedIn Access Costs $ - $ 5,500 $ 550 $ 605 $ 670 $ 740 $ 815 $ 900 $ 990 $ 1,090 $ 1,200 $ 1,320 $ 1,455 Deltek’s GovWin tool $ - $ 3,333 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 $ 833 Monthly Totals $ - $ 13,646 $ 4,282 $ 4,337 $ 4,438 $ 4,508 $ 4,592 $ 5,332 $ 5,433 $ 5,545 $ 5,668 $ 5,803 $ 5,954

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document). For the next step, we needed to identify the sources of return on the investment – Reduced B&P spend, and increased profits from higher win rates – and estimate how much those cash flows would be. The two tables below show the estimated cash flows from those two sources: Reduced B&P spend and increased profits from higher win rates.

Finally, the most important values in the business case are presented here:

 We have a total cost (investment) for the project of $1,034,534

 We have a total return on the project of $3,162,192

 The NPV for the project is $753,468 (using a discount rate of 8%. We assumed that 8% is a reasonable cost of capital or an Expected Rate of Return; a.k.a. minimum or hurdle rate. Increasing the discount rate will reduce the NPV).

 The Internal Rate of Return (IRR) for all 20 months IRR is 35%. The IRR for just the cash flows in the first 12 months is 32%.

The table below is an extract of Estimate of B&P Savings (showing a positive cash flow of $753,468 for the project NPV)

41 2015 B&P Spend $ 8,300,000 2015 Deals Pursued (passed into Stage 3) 29 2015 P(Win) 45% 2016 Deals Pursued (passed into Stage 3) 2300% 2016 P(Win) desired 65% 2016 Estimated B&P Spend $ 7,645,342 2016 Estimated B&P Savings $ 654,658 7.89%

2015 Number of 2015 Estimated 2015 2015 Average 2016 Estimate Number 2016 Estimated deals Completing Percentage of B&P Estimate of Spend Per of deals Completing Percentage of B&P 2016 Estimate of Stage Stage spend per Stage B&P Spend Pursuit Stage spend per Stage B&P Spend 1 Understand Customer 47 5% $ 415,000 $ 8,829.79 47 5% $ 415,000 Validate Opportunity (this is the primary point of impact from the Competitive Intelligence 34 7% $ 581,000 $ 17,088.24 34 7% $ 581,000 2 Project) Qualify the Opportunity (pursuits passing this 29 12% $ 996,000 $ 34,344.83 23 12% $ 789,931 3 stage are counted as "We Bid") 4a Develop Solution 23 34% $ 2,822,000 $ 122,695.65 21 34% $ 2,576,609 4b Propose Solution 23 37% $ 3,071,000 $ 133,521.74 21 37% $ 2,803,957 5 Negotiate and Close (deal won) 13 5% $ 415,000 $ 31,923.08 15 5% $ 478,846 Percentage of Pursuits Bid Versus Won 44.8% 100% $ 8,300,000 65.2% 100% $ 7,645,342

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document). The table below is an extract of Estimate of Profits from more Wins

Description Estimated Value 2015 Average TCV of Deals Won $ 63,846,154 2015 Average Period of Performance of Deals Won 5.0 2015 Average Annual Revenue of Deals Won $ 12,769,231 2015 First Fiscal Year Revenue (Deals won) $ 6,384,615 2015 Average Profit Margin (Deals Won) 34.0%

2016 Average TCV of Deals Won $ 67,657,897 2016 Average Period of Performance of Deals Won 5 2016 Average Annual Revenue of Deals Won $ 13,531,579 2016 First Fiscal Year Revenue (Deals won) $ 6,765,790 2016 Average Profit Margin (Deals Won) 34% Additional Average Monthly Profit $ 76,235

(The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document).

The table below is an extract of the NPV cash flows.

Rate Monthly/Period 8.00% 0 1 2 3 4 5 6 7 8 9 10 11 12 Personnel Cost $ (16,260) $ (78,990) $ (133,878) $ (82,248) $ (73,080) $ (73,896) $ (65,010) $ (68,010) $ (30,168) $ (40,140) $ (29,118) $ (41,190) Other Stuff Cost $ - $ (13,646) $ (4,282) $ (4,337) $ (4,438) $ (4,508) $ (4,592) $ (5,332) $ (5,433) $ (5,545) $ (5,668) $ (5,803) Reduced B&P (counts as positive cash flow) $ 130,932 $ 130,932 $ 130,932 $ 130,932 $ 130,932 $ 130,932 $ 130,932 $ 130,932 $ 130,932 Increase in Profit $ 76,235 $ 76,235 $ 76,235 $ 76,235 $ 76,235 $ 76,235 $ 76,235

Net Cash Flow (for this period) $ - $ (16,260) $ (92,636) $ (138,160) $ 44,346 $ 53,413 $ 128,762 $ 137,564 $ 133,824 $ 171,565 $ 161,481 $ 172,380 $ 160,174 PV of Cash Flows $ - $ (15,056) $ (79,421) $ (109,676) $ 32,596 $ 36,352 $ 81,142 $ 80,267 $ 72,301 $ 85,825 $ 74,797 $ 73,931 $ 63,607 NPV $753,468.67 IRR (all Periods) 35% IRR (first year) 32% (The table above is a tab in the Excel spreadsheet that is embedded in the appendix of this Word document).

Conclusion Summary of lessons learned during the progression of the project. Things we might do differently.

Data Analytics has given companies opportunities that previously have not had to explore data about their products, their customers, their employees, their supply chain, and more, to discover interesting insights they may have not known before. For our project, Competitive Intelligence Using Data Analytics, revealed the vast amount to data that is easily available that can provide insights about our competitors, our own company, and the marketplace we all compete in.

With this project, we wanted to discover things about us that make us competitive and things about our competitors that might give us an advantage over them when it comes to winning contracts. What we discovered is there is gold to be found in data mining. Gold in terms of valuable information that we still believe can bring real positive impact to the financial bottom line.

Through this course, we did discover that there is much more to consider and much more work to do for a successful data analytics project than running a few queries against a massive source of publicly available data.

One of the more insightful experiences was working through the privacy and ethics considerations for this project. Perhaps already known to most data scientists, new information can be gleaned by piecing together single data points about an individual, a company, or an agency, that might result in information that could be considered a violation of privacy laws or be in the gray area of ethical data use.

In all, this was a very useful course in terms of providing a methodology and overview of how to approach a big data project. I think we would all agree that we gained a practical knowledge that we can apply to many other projects in our data analytics future.

43 Appendix The embedded Excel Spreadsheet includes several worksheets that are the source of all the tables shown in Cost and Project Implementation section above.

Workbooks in the Excel file are as follows:

 Project WBS – this is the WBS for the project plan

 Personnel Counts – this is Personnel roles and LOE over time for the project

 Personnel Cost – this is Personnel costs over time for the project

 Other Stuff Counts – this is infrastructure capacities, software, and access to data sources; the counts and quantities of those elements over time for the project

 Other Stuff Costs – this is the cost for those infrastructure capacities, software, and access to data sources; over time for the project

 Est of B&P Savings – this is an estimate of B&P Savings (which is used as a positive cash flow generated by the project in the NPV calculation)

 Est of Profit from more Wins – this is an estimate of additional profits generated because of this project helping to increase the win probability (more Wins).

 Project NPV – this is a calculation of the project’s Net Present Value generated from the several positive and negative cash flows.

Recommended publications