Categorical Variable Consolidation Tables
Total Page:16
File Type:pdf, Size:1020Kb
CATEGORICAL VARIABLE CONSOLIDATION TABLES FlossMole Data Name Old number of codes New number of codes Table 1: Intended Audience 19 5 Table 2: FOSS Licenses 60 7 Table 3: Operating Systems 59 8 Table 4: Programming languages 73 8 Table 5: SF Project topics 243 19 Table 6: Project user interfaces 48 9 Table 7 DB Environment 33 3 Totals 535 59 Table 1: Intended Audience: Consolidated from 19 to 4 categories Rationale for this consolidation: Categories that had similar characteristics were grouped together. For example, Customer Service, Financial and Insurance, Healthcare Industry, Legal Industry, Manufacturing, Telecommunications Industry, Quality Engineers and Aerospace were grouped together under the new category “Business.” End Users/Desktop and Advanced End Users were grouped together under the new category “End Users.” Developers, Information Technology and System Administrators were grouped together under the new category “Computer Professionals.” Education, Religion, Science/Research and Other Audience were grouped under the new category “Other.” Categories containing large numbers of projects were generally left as individual categories. Perhaps Religion and Education should have be left as separate categories because of they contain a relatively large number of projects. Since Mike recommended we get the number of categories down to 50, I consolidated them into the “Other” category. What was done: I created a new table in sf merged called ‘categ_intend_aud_aug06’. This table is a duplicate of the ‘project_intended_audience01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_intend_aud.py. I added the fields ‘ia1’, ‘ia2’, ‘ia3’, ‘ia4’ and ‘ia5’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-5) in the intended audience table below. I updated the ia fields with the new codes from the ‘categ_intend_aud_aug06’ table using five sql queries having the general form: UPDATE statistical_analysis_0607 s INNER JOIN categ_intend_aud_aug06 c ON s.proj_unixname = c.proj_unixname SET s.ia5 = 1 WHERE c.new_code = 5 The ‘ia’ fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code for intended audience. NOTE: 6,033 projects that exist in the project_intended_audience01_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate. New code New description num_projec code Old description ts 2 1 End Users End Users/Desktop 35483 3 2 Computer professionals Developers 41337 359 3 Business Customer Service 1586 360 4 Other Education 6108 3 Business Financial and Insurance 361 954 Industry 362 3 Business Healthcare Industry 733 363 2 Computer professionals Information Technology 7408 364 3 Business Legal Industry 286 365 3 Business Manufacturing 717 366 4 Other Religion 329 367 4 Other Science/Research 6292 3 Business Telecommunications 368 1810 Industry 4 2 Computer professionals System Administrators 15812 5 4 Other Other Audience 6816 536 1 End Users Advanced End Users 5658 537 3 Business Quality Engineers 638 569 5 Government/Non-Profit Government 314 599 3 Business Aerospace 129 5 Government/Non-Profit Non-Profit 618 532 Organizations Table 2: FOSS Licenses used by projects: Consolidated from 60 categories to 7 categories using OSI categories (http://www.opensource.org/licenses/category) . Rationale for this consolidation: The OSI model groups most licenses under the category “Widely used/strong communities.” Using this categorization would essentially test whether projects using widely used licenses have a higher project success rate than projects using specialty or other types of licenses. However, this categorization does not test our essential hypotheses about whether “business friendly” or “reciprocal” licenses produce a higher percentage of successful projects. What was done: Nothing. I’ll pick up book in my mailbox at Holdsworth on licenses and consider dividing by “restrictive,” “non-restrictive.” I’ll also plot licenses against our dependent variable in R and see if that sheds any light. Old New New description Old description num_projects code code 14 3 Other/Miscellaneous OSI-Approved Open Source 430 Widely used/strong 15 1 GNU General Public License (GPL) 54842 communities Widely used/strong 16 1 GNU Library or Lesser General Public License (LGPL) 9429 communities 17 3 Other/Miscellaneous Artistic License 1317 187 1 Widely used/strong BSD License 5968 communities Widely used/strong 188 1 MIT License 1616 communities 189 5 Non-reusable Python Software Foundation License 184 190 3 Other/Miscellaneous Qt Public License (QPL) 235 191 5 Non-reusable IBM Public License 98 Retired MITRE Collaborative Virtual Workspace License 192 7 2 (CVW) 193 5 Non-reusable Ricoh Source Code Public License 8 194 5 Non-reusable Python License (CNRI Python License) 138 195 3 Other/Miscellaneous zlib/libpng License 420 196 3 Other/Miscellaneous Other/Proprietary License 1259 197 3 Other/Miscellaneous Public Domain 2134 296 6 Superceded Apache Software License 1077 297 5 Non-reusable Vovida Software License 1.0 5 298 7 Retired Sun Industry Standards Source License (SISSL) 44 299 7 Retired Intel Open Source License 24 300 7 Retired Jabber Open Source License 45 301 5 Non-reusable Nokia Open Source License 13 302 5 Non-reusable Sleepycat License 23 303 5 Non-reusable Nethack General Public License 31 304 6 Superceded Mozilla Public License 1.0 (MPL) 246 Widely used/strong 305 1 Mozilla Public License 1.1 (MPL 1.1) 1114 communities 306 5 Non-reusable Apple Public Source License 64 Widely used/strong 307 1 Common Public License 763 communities 316 2 Special purpose Open Group Test Suite License 18 317 4 Redundant X.Net License 11 318 5 Non-reusable Sun Public License 97 319 6 Superceded Eiffel Forum License 7 320 5 Non-reusable W3C License 45 321 5 Non-reusable Motosoto License 1 322 5 Non-reusable Zope Public License 42 323 4 Redundant University of Illinois/NCSA Open Source License 44 324 4 Redundant Academic Free License (AFL) 565 325 4 Redundant Attribution Assurance License 29 388 3 Other/Miscellaneous Open Software License 557 389 5 Non-reusable Sybase Open Watcom Public License 1 390 5 Non-reusable OCLC Research Public License 2.0 1 391 5 Non-reusable WxWindows Library Licence 88 392 4 Redundant Eiffel Forum License V2.0 20 393 4 Redundant Historical Permission Notice and Disclaimer 25 395 5 Non-reusable RealNetworks Public Source License V1.0 2 396 5 Non-reusable Reciprocal Public License 34 397 5 Non-reusable Entessa Public License 2 398 6 Superceded Lucent Public License (Plan9) 2 399 5 Non-reusable PHP License 320 400 5 Non-reusable Frameworx Open License 6 Widely used/strong 401 1 Apache License V2.0 1119 communities 402 5 Non-reusable CUA Office Public License Version 1.0 2 403 5 Non-reusable EU DataGrid Software License 4 404 4 Redundant Fair License 31 405 4 Redundant Lucent Public License Version 1.02 4 Widely used/strong 406 1 Eclipse Public License 241 communities 407 2 Special purpose NASA Open Source Agreement 12 628 3 Other/Miscellaneous Adaptive Public License 26 629 2 Special purpose Educational Community License 69 Widely used/strong 630 1 Common Development and Distribution License 114 communities 631 5 Non-reusable Computer Associates Trusted Open Source License 3 Table 3: Operating Systems Projects are developed for: Consolidated 59 categories to 8 categories Rationale for this consolidation: Categories that had similar characteristics were grouped together. For example, the categories All 32-bit MS Windows (95/98/NT/2000/XP), 32-bit MS Windows (NT/2000/XP), WinXP, 32-bit MS Windows (95/98), Win2K, Microsoft Windows Server 2003, WinNT, WinME, Win98 and Microsoft Windows 3.x were all consolidated to the new category “Windows.” Categories containing large numbers of projects were generally left as individual categories. Categories containing small numbers of projects were consolidated to the new category “other.” . What was done: I created a new table in sf_merged named ‘categ_oper_sys_aug06’. This table is a duplicate of the ‘project_operating_system01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_oper_sys.py. I added the fields ‘os1’, ‘os2’, ‘os3’…, ‘os8’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-8) in the operating system table below. I updated the os fields with the new codes from the ‘categ_oper_sys_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code. NOTE: 5,881 projects that exist in the project_operating_system01_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net