<<

CATEGORICAL VARIABLE CONSOLIDATION TABLES

FlossMole Data Name Old number of codes New number of codes Table 1: Intended Audience 19 5 Table 2: FOSS Licenses 60 7 Table 3: Operating Systems 59 8 Table 4: Programming languages 73 8 Table 5: SF Project topics 243 19 Table 6: Project user interfaces 48 9 Table 7 DB Environment 33 3 Totals 535 59

Table 1: Intended Audience: Consolidated from 19 to 4 categories

Rationale for this consolidation: Categories that had similar characteristics were grouped together. For example, Customer Service, Financial and Insurance, Healthcare Industry, Legal Industry, Manufacturing, Telecommunications Industry, Quality Engineers and Aerospace were grouped together under the new category “Business.” End Users/Desktop and Advanced End Users were grouped together under the new category “End Users.” Developers, Information Technology and System Administrators were grouped together under the new category “Computer Professionals.” Education, Religion, Science/Research and Other Audience were grouped under the new category “Other.” Categories containing large numbers of projects were generally left as individual categories. Perhaps Religion and Education should have be left as separate categories because of they contain a relatively large number of projects. Since Mike recommended we get the number of categories down to 50, I consolidated them into the “Other” category.

What was done: I created a new table in sf merged called ‘categ_intend_aud_aug06’. This table is a duplicate of the ‘project_intended_audience01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_intend_aud.py. I added the fields ‘ia1’, ‘ia2’, ‘ia3’, ‘ia4’ and ‘ia5’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-5) in the intended audience table below. I updated the ia fields with the new codes from the ‘categ_intend_aud_aug06’ table using five sql queries having the general form:

UPDATE statistical_analysis_0607 s INNER JOIN categ_intend_aud_aug06 ON s.proj_unixname = c.proj_unixname SET s.ia5 = 1 WHERE c.new_code = 5

The ‘ia’ fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code for intended audience.

NOTE: 6,033 projects that exist in the project_intended_audience01_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate.

New code New description num_projec code Old description ts 2 1 End Users End Users/Desktop 35483 3 2 Computer professionals Developers 41337 359 3 Business Customer Service 1586 360 4 Other Education 6108 3 Business Financial and Insurance 361 954 Industry 362 3 Business Healthcare Industry 733 363 2 Computer professionals Information Technology 7408 364 3 Business Legal Industry 286 365 3 Business Manufacturing 717 366 4 Other Religion 329 367 4 Other Science/Research 6292 3 Business Telecommunications 368 1810 Industry 4 2 Computer professionals System Administrators 15812 5 4 Other Other Audience 6816 536 1 End Users Advanced End Users 5658 537 3 Business Quality Engineers 638 569 5 Government/Non-Profit Government 314 599 3 Business Aerospace 129 5 Government/Non-Profit Non-Profit 618 532 Organizations

Table 2: FOSS Licenses used by projects: Consolidated from 60 categories to 7 categories using OSI categories (http://www.opensource.org/licenses/category) . Rationale for this consolidation: The OSI model groups most licenses under the category “Widely used/strong communities.” Using this categorization would essentially test whether projects using widely used licenses have a higher project success rate than projects using specialty or other types of licenses. However, this categorization does not test our essential hypotheses about whether “business friendly” or “reciprocal” licenses produce a higher percentage of successful projects.

What was done: Nothing. I’ll pick up book in my mailbox at Holdsworth on licenses and consider dividing by “restrictive,” “non-restrictive.” I’ll also plot licenses against our dependent variable in R and see if that sheds any light.

Old New New description Old description num_projects code code 14 3 Other/Miscellaneous OSI-Approved 430 Widely used/strong 15 1 GNU General Public License (GPL) 54842 communities Widely used/strong 16 1 GNU Library or Lesser General Public License (LGPL) 9429 communities 17 3 Other/Miscellaneous Artistic License 1317 187 1 Widely used/strong BSD License 5968 communities Widely used/strong 188 1 MIT License 1616 communities 189 5 Non-reusable Python Foundation License 184 190 3 Other/Miscellaneous Public License (QPL) 235 191 5 Non-reusable IBM Public License 98 Retired MITRE Collaborative Virtual Workspace License 192 7 2 (CVW) 193 5 Non-reusable Ricoh Public License 8 194 5 Non-reusable Python License (CNRI Python License) 138 195 3 Other/Miscellaneous zlib/libpng License 420 196 3 Other/Miscellaneous Other/Proprietary License 1259 197 3 Other/Miscellaneous Public Domain 2134 296 6 Superceded Apache 1077 297 5 Non-reusable Vovida Software License 1.0 5 298 7 Retired Sun Industry Standards Source License (SISSL) 44 299 7 Retired Intel Open Source License 24 300 7 Retired Jabber Open Source License 45 301 5 Non-reusable Nokia Open Source License 13 302 5 Non-reusable Sleepycat License 23 303 5 Non-reusable Nethack General Public License 31 304 6 Superceded Mozilla Public License 1.0 (MPL) 246 Widely used/strong 305 1 Mozilla Public License 1.1 (MPL 1.1) 1114 communities 306 5 Non-reusable Apple Public Source License 64 Widely used/strong 307 1 Common Public License 763 communities 316 2 Special purpose Open Group Test Suite License 18 317 4 Redundant X.Net License 11 318 5 Non-reusable Sun Public License 97 319 6 Superceded Eiffel Forum License 7 320 5 Non-reusable W3C License 45 321 5 Non-reusable Motosoto License 1 322 5 Non-reusable Zope Public License 42 323 4 Redundant University of Illinois/NCSA Open Source License 44 324 4 Redundant Academic Free License (AFL) 565 325 4 Redundant Attribution Assurance License 29 388 3 Other/Miscellaneous Open Software License 557 389 5 Non-reusable Sybase Open Watcom Public License 1 390 5 Non-reusable OCLC Research Public License 2.0 1 391 5 Non-reusable WxWindows Library Licence 88 392 4 Redundant Eiffel Forum License V2.0 20 393 4 Redundant Historical Permission Notice and Disclaimer 25 395 5 Non-reusable RealNetworks Public Source License V1.0 2 396 5 Non-reusable Reciprocal Public License 34 397 5 Non-reusable Entessa Public License 2 398 6 Superceded Lucent Public License (Plan9) 2 399 5 Non-reusable PHP License 320 400 5 Non-reusable Frameworx Open License 6 Widely used/strong 401 1 Apache License V2.0 1119 communities 402 5 Non-reusable CUA Office Public License Version 1.0 2 403 5 Non-reusable EU DataGrid Software License 4 404 4 Redundant Fair License 31 405 4 Redundant Lucent Public License Version 1.02 4 Widely used/strong 406 1 Eclipse Public License 241 communities 407 2 Special purpose NASA Open Source Agreement 12 628 3 Other/Miscellaneous Adaptive Public License 26 629 2 Special purpose Educational Community License 69 Widely used/strong 630 1 Common Development and Distribution License 114 communities 631 5 Non-reusable Computer Associates Trusted Open Source License 3

Table 3: Operating Systems Projects are developed for: Consolidated 59 categories to 8 categories

Rationale for this consolidation: Categories that had similar characteristics were grouped together. For example, the categories All 32-bit MS Windows (95/98/NT/2000/XP), 32-bit MS Windows (NT/2000/XP), WinXP, 32-bit MS Windows (95/98), Win2K, Windows Server 2003, WinNT, WinME, Win98 and 3.x were all consolidated to the new category “Windows.” Categories containing large numbers of projects were generally left as individual categories. Categories containing small numbers of projects were consolidated to the new category “other.” .

What was done: I created a new table in sf_merged named ‘categ_oper_sys_aug06’. This table is a duplicate of the ‘project_operating_system01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_oper_sys.py. I added the fields ‘os1’, ‘os2’, ‘os3’…, ‘os8’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-8) in the table below. I updated the os fields with the new codes from the ‘categ_oper_sys_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code.

NOTE: 5,881 projects that exist in the project_operating_system01_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate.

num_ code New code New description description projec ts 200 1 POSIX All POSIX (/BSD/-like OSes) 30244 2 Independent OS Independent (Written in an interpreted 235 28388 language) 201 3 Linux Linux 24049 435 4 Windows All 32-bit MS Windows (95/98/NT/2000/XP) 21654 219 4 Windows 32-bit MS Windows (NT/2000/XP) 7986 419 4 Windows WinXP 6493 218 4 Windows 32-bit MS Windows (95/98) 5985 420 4 Windows Win2K 5432 309 5 Mac OS X 3918 2 Independent OS Portable (Source code to work with many OS 436 3490 platforms) 6 BSD All BSD Platforms 202 3451 (FreeBSD/NetBSD/OpenBSD/Apple Mac OS X) 207 7 Unix-like Solaris 2196 203 6 BSD FreeBSD 1645 236 8 Other Other Operating Systems 950 212 8 Other Other 857 215 8 Other MS-DOS 619 223 8 Other PalmOS 554 224 8 Other BeOS 407 222 8 Other WinCE 403 427 8 Other (Emulator) Cygwin (MS Windows) 384 311 5 Mac Apple Mac OS Classic 377 448 4 Windows Microsoft Windows Server 2003 377 205 6 BSD OpenBSD 371 209 7 Unix-like HP-UX 294 315 8 Other Handheld/Embedded Operating Systems 291 204 6 BSD NetBSD 259 210 7 Unix-like IBM AIX 248 211 7 Unix-like SGI IRIX 248 423 4 Windows WinNT 199 445 8 Other MinGW/MSYS (MS Windows) 176 220 8 Other IBM OS/2 161 438 8 Other Project is an Operating System Distribution 150 424 4 Windows WinME 133 437 8 Other Project is an Operating System Kernel 132 422 4 Windows Win98 120 444 8 Other SymbianOS 98 425 4 Windows Win98 OSR2 92 206 6 BSD BSD/OS 83 439 8 Other Project is OS Distribution-Specific 83 240 8 Other GNU Hurd 80 430 8 Other (Emulator) WINE 74 434 8 Other AmigaOS 72 208 8 Other SCO 57 421 4 Windows Win95 53 634 8 Other Console-based Platforms 47 217 4 Windows Microsoft Windows 3.x 47 429 8 Other (Emulator) Fink (Mac OS X) 46 8 Other (fork of linux 440 kernel for uClinux 44 embedded systems) 447 8 Other MorphOS 36 635 8 Other Microsoft Xbox 23 8 Other (unix-like os 442 for embedded QNX 12 systems) 428 8 Other (Emulator) DOSEMU 11 431 8 Other (Emulator) EMX (OS/2 and MS-DOS) 10 446 8 Other OpenVMS 10 637 8 Other Sega Dreamcast 9 441 8 Other eCos 7 443 7 Unix-like VxWorks 7 636 8 Other Sony Playstation 2 6 8 Other Modern (Vendor-Supported) Desktop Operating 418 1 Systems

Table 4: Programming languages for projects: Consolidated 73 codes to 8 codes

Rationale for this consolidation: Categories with similar characteristics were grouped together. For example, C#, , Visual Basic .NET, ASP, ASP.NET, VBScript and Visual FoxPro were all grouped together under the new category "MS," since they are all Microsoft products. Categories containing large numbers of projects were generally left as individual categories. Many obscure programming languages with few projects were placed in the new category “other.”

What was done: I created a new table in sf_merged named ‘categ_prog_lang_aug06’. This table is a duplicate of the ‘project_programming_language01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_oper_sys.py. I added the fields ‘pl1’, ‘pl2’, ‘pl3’…, ‘pl8’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-8) in the programming language table below. I updated the pl fields with the new codes from the ‘categ_prog_lang_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code.

NOTE: 6,149 projects that exist in the project_programming_language01_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate.

New code New description num_ code description projec ts 198 1 Java Java 19838 165 2 C C++ 18500 164 2 C C 17039 183 3 PHP PHP 14107 176 4 Perl Perl 6498 178 5 Python Python 5362 271 6 MS C# 4051 280 1 Java JavaScript 3694 186 6 MS Visual Basic 2272 265 7 Other Delphi/Kylix 2103 185 7 Other Unix Shell 2078 162 8 Assembly Assembly 1653 254 7 Other PL/SQL 1187 182 7 Tcl Tcl 944 174 2 C Objective C 859 293 7 Ruby Ruby 584 453 6 MS Visual Basic .NET 578 184 6 MS ASP 539 572 1 Java JSP 439 175 7 Other Pascal 401 560 7 Other XSL (XSLT/XPath/XSL-FO) 393 170 7 Other Lisp 326 258 7 Other Object Pascal 322 589 6 MS ASP.NET 304 242 7 Other Scheme 212 169 7 Other Fortran 193 450 7 Other Lua 183 584 7 Other ActionScript 176 267 7 Other Zope 142 262 7 Other Cold Fusion 136 539 7 Other BASIC 125 172 7 Other Standard ML 124 177 7 Other Prolog 116 163 7 Other Ada 108 548 6 MS VBScript 87 454 7 Other OCaml (Objective Caml) 84 166 7 Other Eiffel 72 626 7 Other MATLAB 67 181 7 Other Smalltalk 61 168 7 Other Forth 60 547 7 Other AppleScript 53 451 7 Other Haskell 53 544 7 Other Yacc 47 179 7 Other Rexx 41 538 7 Other AWK 37 552 7 Other D 37 553 7 Other REALbasic 37 540 7 Other Common Lisp 37 264 7 Other Erlang 32 261 7 Other XBasic 32 255 7 Other PROGRESS 30 551 7 Other VHDL/Verilog 27 542 7 Other Emacs-Lisp 27 624 7 Other IDL 26 598 7 Other AspectJ 24 573 7 Other S/R 24 263 7 Other Euphoria 18 543 7 Other Groovy 18 273 7 Other Pike 16 171 7 Other Logo 16 608 7 Other MUMPS 13 161 7 Other APL 13 281 7 Other REBOL 13 545 7 Other LabVIEW 12 452 6 MS Visual FoxPro 11 549 7 Other LPC 9 173 7 Other Modula 8 632 7 Other COBOL 8 550 7 Other Oberon 5 625 7 Other Simulink 4 541 7 Other Dylan 3 167 7 Other Euler 1 180 7 Other Simula 1

Table 5: SF Project topics: Consolidated 243 categories into 23 categories used by the FSF http://directory.fsf.org/ (note: the consolidation into FSF categories is strictly my best guess. I did not try to discern how FSF defines each of their categories)

Rationale for this consolidation: SF categories the 243 subcategories listed in the table above into 23 more general categories. We simply used the more general categorization.

What was done: I created a new table in sf_merged named ‘categ_toopic_aug06’. This table is a duplicate of the ‘project_topic01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the second table below (New Codes Using Sourceforge.net Higher Level Topic Categories) using a python script I (Bob English) wrote called add_categ_topic.py. I added the fields t1’, t2’, …,t19 to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-19) in the table below. I updated the t fields with the new codes from the ‘categ_topic_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code.

NOTE: 6255 projects that exist in the project_topic_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate.

(note: The coding system highlighted in red was not used, but remains in this document for informational purposes)

New FSF Codes: 1 Audio 2 Business and productivity 3 Database 4 Education 5 Email 6 Games 7 Graphics 8 Hobbies 9 Interface 10 Internet applications 11 Live communications 12 Localization 13 Mathematics 14 Printing 15 Science 16 Security 17 Software development 18 Software libraries 19 System administration 20 Text creation and manipulation 21 Video 22 Web authoring

23 Other (not an FSF category)

New Codes Using FSF Categories: code New code New description description num_proj 45 17 Software Development Software Development 6621 92 10 Internet applications Dynamic Content 6500 243 22 Web authoring Site Management 4114 80 6 Games Games/Entertainment 3104 87 10 Internet applications Internet 3043 253 19 Systems administration Systems Administration 2627 66 3 Database Database 2487 20 11 Live communications Communications 2483 90 10 Internet applications WWW/HTTP 2296 68 9 Interface Front-Ends 2264 84 8 Hobbies Role-Playing 2261 71 4 Education Education 2165 234 23 Other Other/Nonlisted Topic 2085 150 19 Systems administration Networking 1944 259 17 Software development Code Generators 1885 129 2 Business and productivity Office/Business 1850 46 17 Software development Build Tools 1842 43 16 Security Security 1801 22 11 Live communications Chat 1767 606 17 Software development Frameworks 1684 251 10 Internet applications File Sharing 1545 152 19 System administration Monitoring 1458 133 15 Science 1426 98 13 Mathematics Mathematics 1401 55 9 Interface Desktop Environment 1384 97 15 Science Scientific/Engineering 1313 96 17 Software development CGI Tools/Libraries 1310 135 4 Education Visualization 1308 24 11 Live communication Internet Relay Chat 1304 85 4 Education Simulation 1269 49 17 Software development Interpreters 1201 113 1 Audio Sound/Audio 1179 100 7 Graphics Graphics 1156 48 17 Software development Compilers 1151 86 6 Games Multi-User Dungeons (MUD) 1144 93 20 Text creation and manipulation Indexing/Search 1117 28 5 Email Email 1068 110 7 Graphics 3D Rendering 1057 83 23 Other Turn Based Strategy 1027 287 6 Games Board Games 1016 250 10 Internet applications HTTP Servers 921 95 11 Live communication Message Boards 888 308 10 Internet applications Distributed Computing 872 288 6 Games Side-Scrolling/Arcade Games 871 82 6 Games First Person Shooters 851 130 2 Business and productivity Scheduling 823 44 16 Security Cryptography 807 136 19 System administration System 783 67 3 Database Database Engines/Servers 778 21 11 Live communication BBS 768 148 19 System administration Logging 755 112 9 Interface Viewers 753 385 15 Science Information Analysis 749 81 2 Business and productivity Real Time Strategy 749 147 19 System administration Installation/Setup 744 252 15 Science Bio-Informatics 734 561 9 Interface User Interfaces 721 285 20 Text creation and manipulation Text Processing 717 257 17 Software development Software Distribution 706 142 19 System administration Filesystems 671 58 9 Interface Gnome 660 143 17 Software development Linux 655 91 10 Internet applications Browsers 651 268 6 Games Puzzle Games 649 65 17 Software development Integrated Development Environments (IDE) 633 146 19 System administration Hardware 623 123 1 Audio MP3 619 247 1 Audio Telephony 616 111 2 Business and productivity Presentation 611 72 4 Education Computer Aided Instruction (CAI) 606 27 2 Business and productivity Conferencing 599 144 17 Software development Operating System Kernels 596 125 21 Video Video 594 292 17 Software development Hardware Drivers 593 99 21 Video Multimedia 592 63 20 Text creation and manipulation Text Editors 590 47 17 Software development Debuggers 574 122 21 Video Players 560 89 10 Internet applications File Transfer Protocol (FTP) 552 151 19 System administration Firewalls 536 76 2 Business and productivity Accounting 527 562 17 Software development Object Oriented 523 74 19 System administration Emulators 523 109 7 Graphics 3D Modeling 516 137 19 System administration Backup 511 31 5 Email Email Clients (MUA) 510 19 19 System administration Archiving 502 272 9 Interface Human Machine Interfaces 493 559 10 Internet applications XML 491 57 9 Interface K Desktop Environment (KDE) 491 75 2 Business and productivity Financial 467 576 2 Business and productivity Enterprise 464 245 13 Mathematics Log Analysis 462 29 18 Software libraries Filters 448 387 15 Science Physics 447 600 4 Education Simulations 445 154 14 Printing Printing 436 141 19 System administration Clustering 432 575 17 Software development Testing 432 554 3 Database Data Formats 395 266 15 Science Medical Science Apps. 391 601 19 System administration File Management 387 73 17 Software development Testing 382 294 23 Other System Shells 372 383 2 Business and productivity GIS 358 105 7 Graphics Graphics Conversion 356 52 17 Software development Version Control 352 79 2 Business and productivity Point-Of-Sale 346 131 2 Business and productivity Office Suites 341 50 2 Business and productivity Object Brokering 319 312 17 Software development Symmetric Multi-processing 316 246 2 Business and productivity Electronic Design Automation (EDA) 314 128 9 Interface Display 305 620 13 Mathematics Algorithms 292 120 20 Text creation and manipulation Editors 272 607 2 Business and productivity Project Management 271 39 5 Email Usenet News 268 249 1 Audio Sound Synthesis 263 149 10 Internet application Name Service (DNS) 263 32 5 Email Mail Transport Agents 262 132 4 Education Religion and Philosophy 258 565 2 Business and productivity Quality Assurance 256 40 10 Internet application Internet Phone 255 127 2 Business and productivitiy Conversion 252 119 2 Business and productivity Conversion 251 26 11 Live communications AOL Instant Messenger 249 126 21 Video Video Capture 244 248 1 Audio MIDI 243 114 15 Science Analysis 242 134 15 Science Astronomy 237 69 20 Text creation and manipulation Documentation 236 56 9 Interface Window Managers 232 70 20 Word Processors 229 582 2 Design 226 564 20 Documentation 225 34 5 POP3 222 106 20 Editors 222 139 17 Boot 217 42 23 Other Compression 215 282 15 Science Sociology 213 563 15 Science Modeling 211 41 2 Packaging 209 384 15 Chemistry 208 23 11 ICQ 208 138 17 Benchmark 207 579 2 CRM 203 115 1 Capture/Recording 202 556 10 HTML/XHTML 202 124 1 Speech 198 386 9 Interface Engine/Protocol Translator 194 289 19 Authentication/Directory 194 53 17 CVS 190 291 19 LDAP 190 77 2 Investment 190 156 19 Terminals 187 107 7 Vector-Based 181 588 2 To-Do Lists 179 577 2 ERP 173 30 5 Mailing List Servers 170 581 18 Software libraries Library 170 590 10 Streaming 165 270 10 WAP 162 587 2 Time Tracking 159 602 15 Robotics 157 38 8 Ham Radio 155 597 6 Card Games 154 103 7 Digital Camera 151 158 19 Terminal Emulators/X Terminals 144 585 2 Calendar 143 118 1 CD Ripping 141 566 19 Wireless 140 159 10 Telnet 138 610 2 Virtual Machines 135 591 15 Intelligent Agents 129 244 22 Link Checking 125 108 7 Raster-Based 124 64 17 Emacs 123 35 5 IMAP 121 615 10 RSS 112 286 10 Gnutella 111 121 1 Mixers 109 570 23 CASE 108 567 15 Earth Sciences 105 408 12 I18N (Internationalization) 104 627 2 Search 103 94 10 Page Counters 102 619 17 Usability 101 622 10 BitTorrent 98 284 8 Genealogy 98 116 1 CD Audio 98 603 6 Profiling 96 51 17 CORBA 93 574 11 MSN Messenger 89 101 9 Capture 88 157 17 Serial 88 78 2 Spreadsheet 87 409 12 L10N (Localization) 86 609 15 Molecular Science 84 613 10 SOAP 83 33 5 Post-Office 81 36 2 Fax 80 633 6 Console-based Games 76 256 20 Non-Linear Editor 74 580 3 Data Warehousing 72 117 1 CD Playing 70 593 17 Cross Compilers 68 616 10 XML-RPC 64 140 17 Init 64 62 9 Screen Savers 64 241 10 Napster 60 37 10 FIDO 58 586 2 Resource Booking 57 104 9 Screen Capture 56 145 23 BSD 55 102 2 Scanners 54 621 15 Genetic Algorithms 52 558 20 TeX/LaTeX 52 283 4 History 49 155 19 Hardware Watchdog 45 153 19 Power (UPS) 43 568 15 Ecosystem Sciences 38 623 2 Realtime Processing 37 578 2 OLAP 29 25 19 Unix Talk 27 54 17 RCS 23 88 10 Finger 23 59 4 Enlightenment 23 61 4 Themes 20 595 7 Special Effects 18 596 1 Codec 18 594 21 Still Capture 16 605 4 MARC and Book/Library Metadata 15 571 23 New Age 14 555 20 DocBook 13 614 10 NNTP 11 290 19 NIS 8 592 19 Log Rotation 7 557 20 SGML 7 239 23 GNU Hurd 6 313 19 Mainframes 6 604 10 OPAC 6 60 4 Themes 6 260 17 SCCS 5

The table below shows the coding system used to map new codes to old codes in the statistical_analysis_0607 table. New Codes Using Sourceforge.net Higher Level Topic Categories code New code New description description num_proj 45 16 Software Development Software Development 6621 92 7 Internet Dynamic Content 6500 243 7 Site Management 4114 80 6 Games/Entertainment Games/Entertainment 3104 87 7 Internet 3043 253 17 System Systems Administration 2627 66 2 Database Database 2487 20 1 Communications Communications 2483 90 7 WWW/HTTP 2296 68 2 Front-Ends 2264 84 6 Role-Playing 2261 71 4 Education Education 2165 234 10 Other/Nonlisted Topic Other/Nonlisted Topic 2085 150 17 Networking 1944 259 16 Code Generators 1885 129 9 Office/Business Office/Business 1850 46 16 Build Tools 1842 43 14 Security Security 1801 22 1 Chat 1767 606 16 Frameworks 1684 251 1 File Sharing 1545 152 17 Monitoring 1458 133 13 Scientific/Engineering Artificial Intelligence 1426 98 13 Mathematics 1401 55 3 Desktop Environment Desktop Environment 1384 97 13 Scientific/Engineering 1313 96 7 CGI Tools/Libraries 1310 135 13 Visualization 1308 24 1 Internet Relay Chat 1304 85 6 Simulation 1269 49 16 Interpreters 1201 113 8 Multimedia Sound/Audio 1179 100 8 Graphics 1156 48 16 Compilers 1151 86 6 Multi-User Dungeons (MUD) 1144 93 7 Indexing/Search 1117 28 1 Email 1068 110 8 3D Rendering 1057 83 6 Turn Based Strategy 1027 287 6 Board Games 1016 250 7 HTTP Servers 921 95 7 Message Boards 888 308 17 Distributed Computing 872 288 6 Side-Scrolling/Arcade Games 871 82 6 First Person Shooters 851 130 9 Scheduling 823 44 14 Cryptography 807 136 17 System 783 67 2 Database Engines/Servers 778 21 1 BBS 768 148 17 Logging 755 112 8 Viewers 753 385 13 Information Analysis 749 81 6 Real Time Strategy 749 147 17 Installation/Setup 744 252 13 Bio-Informatics 734 561 16 User Interfaces 721 285 19 Text Editors Text Processing 717 257 17 Software Distribution 706 142 17 Filesystems 671 58 3 Gnome 660 143 17 Linux 655 91 7 Browsers 651 268 6 Puzzle Games 649 65 16 Integrated Development Environments (IDE) 633 146 17 Hardware 623 123 8 MP3 619 247 1 Telephony 616 111 8 Presentation 611 72 4 Computer Aided Instruction (CAI) 606 27 1 Conferencing 599 144 17 Operating System Kernels 596 125 8 Video 594 292 17 Hardware Drivers 593 99 8 Multimedia 592 63 19 Text Editors 590 47 16 Debuggers 574 122 8 Players 560 89 7 File Transfer Protocol (FTP) 552 151 17 Firewalls 536 76 9 Accounting 527 562 16 Object Oriented 523 74 17 Emulators 523 109 8 3D Modeling 516 137 17 Backup 511 31 1 Email Clients (MUA) 510 19 17 Archiving 502 272 13 Human Machine Interfaces 493 559 5 Formats and Protocols XML 491 57 3 K Desktop Environment (KDE) 491 75 9 Financial 467 576 9 Enterprise 464 245 7 Log Analysis 462 29 1 Filters 448 387 13 Physics 447 600 13 Simulations 445 154 11 Printing Printing 436 141 17 Clustering 432 575 4 Testing 432 554 5 Data Formats 395 266 13 Medical Science Apps. 391 601 17 File Management 387 73 4 Testing 382 294 17 System Shells 372 383 13 GIS 358 105 8 Graphics Conversion 356 52 16 Version Control 352 79 9 Point-Of-Sale 346 131 9 Office Suites 341 50 16 Object Brokering 319 312 17 Symmetric Multi-processing 316 246 13 Electronic Design Automation (EDA) 314 128 8 Display 305 620 16 Algorithms 292 120 8 Editors 272 607 9 Project Management 271 39 1 Usenet News 268 249 8 Sound Synthesis 263 149 7 Name Service (DNS) 263 32 1 Mail Transport Agents 262 132 12 Religion and Philosophy Religion and Philosophy 258 565 16 Quality Assurance 256 40 1 Internet Phone 255 127 8 Conversion 252 119 8 Conversion 251 26 1 AOL Instant Messenger 249 126 8 Video Capture 244 248 8 MIDI 243 114 13 Analysis 242 134 13 Astronomy 237 69 16 Documentation 236 56 3 Window Managers 232 70 19 Word Processors 229 582 16 Design 226 564 16 Documentation 225 34 1 POP3 222 106 8 Editors 222 139 17 Boot 217 42 17 Compression 215 282 15 Sociology Sociology 213 563 16 Modeling 211 41 17 Packaging 209 384 13 Chemistry 208 23 1 ICQ 208 138 17 Benchmark 207 579 9 CRM 203 115 8 Capture/Recording 202 556 5 HTML/XHTML 202 124 8 Speech 198 386 13 Interface Engine/Protocol Translator 194 289 17 Authentication/Directory 194 53 16 CVS 190 291 17 LDAP 190 77 9 Investment 190 156 18 Terminals Terminals 187 107 8 Vector-Based 181 588 9 To-Do Lists 179 577 9 ERP 173 30 1 Mailing List Servers 170 581 4 Library 170 590 1 Streaming 165 270 7 WAP 162 587 9 Time Tracking 159 602 13 Robotics 157 38 1 Ham Radio 155 597 6 Card Games 154 103 8 Digital Camera 151 158 18 Terminal Emulators/X Terminals 144 585 9 Calendar 143 118 8 CD Ripping 141 566 17 Wireless 140 159 18 Telnet 138 610 16 Virtual Machines 135 591 13 Intelligent Agents 129 244 7 Link Checking 125 108 8 Raster-Based 124 64 19 Emacs 123 35 1 IMAP 121 615 5 RSS 112 286 1 Gnutella 111 121 8 Mixers 109 570 16 CASE 108 567 13 Earth Sciences 105 408 16 I18N (Internationalization) 104 627 17 Search 103 94 7 Page Counters 102 619 16 Usability 101 622 1 BitTorrent 98 284 15 Genealogy 98 116 8 CD Audio 98 603 16 Profiling 96 51 16 CORBA 93 574 1 MSN Messenger 89 101 8 Capture 88 157 18 Serial 88 78 9 Spreadsheet 87 409 16 L10N (Localization) 86 609 13 Molecular Science 84 613 5 SOAP 83 33 1 Post-Office 81 36 1 Fax 80 633 6 Console-based Games 76 256 8 Non-Linear Editor 74 580 9 Data Warehousing 72 117 8 CD Playing 70 593 16 Cross Compilers 68 616 5 XML-RPC 64 140 17 Init 64 62 3 Screen Savers 64 241 1 Napster 60 37 1 FIDO 58 586 9 Resource Booking 57 104 8 Screen Capture 56 145 17 BSD 55 102 8 Scanners 54 621 16 Genetic Algorithms 52 558 5 TeX/LaTeX 52 283 15 History 49 155 17 Hardware Watchdog 45 153 17 Power (UPS) 43 568 13 Ecosystem Sciences 38 623 8 Realtime Processing 37 578 9 OLAP 29 25 1 Unix Talk 27 54 16 RCS 23 88 7 Finger 23 59 3 Enlightenment 23 61 3 Themes 20 595 8 Special Effects 18 596 8 Codec 18 594 8 Still Capture 16 605 4 MARC and Book/Library Metadata 15 571 12 New Age 14 555 5 DocBook 13 614 5 NNTP 11 290 17 NIS 8 592 17 Log Rotation 7 557 5 SGML 7 239 17 GNU Hurd 6 313 17 Mainframes 6 604 4 OPAC 6 60 3 Themes 6 260 16 SCCS 5

Table 6: Project user interfaces Consolidated 48 categories into 9 categories (I tried to consolidate to widely used interfaces. I put libraries and API’s in the “Other” category)

Rationale for this consolidation: Categories with similar characteristics were grouped together. For example, Java Swing, Java SWT, Java AWT were all grouped together under the new category "Java." Categories containing large numbers of projects were generally left as individual categories. Many obscure interfaces with few projects were placed in the new category “other.”

What was done: I created a new table in sf_merged named ‘categ_user_interface_aug06’. This table is a duplicate of the ‘project_user_interface01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_categ_user_interface.py. I added the fields ui1’, ui2’, …, ‘ui9’ to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-9) in the user interface table below. I updated the pl fields with the new codes from the ‘categ_user_interface_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code.

NOTE: 4,830 projects that exist in the project_user_interface_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate.

num_pro code New code New description description jects 237 1 Web-based Web-based 22223 230 2 Windows Win32 (MS Windows) 16894 229 3 X Window (X11) 9664 238 4 Non-interactive Non-interactive (Daemon) 5105 459 5 Console Command-line 3616 471 6 Java Java Swing 2945 231 7 Gnome Gnome 2362 310 8 Other (API) Cocoa (MacOS X) 2093 460 5 Console Console/Terminal 1943 232 9 KDE KDE 1908 314 8 Other Handheld/Mobile/PDA 1282 477 8 Other (library) GTK+ 943 227 8 Other Curses/Ncurses 893 469 8 Other .NET/Mono 863 475 8 Other (library) OpenGL 734 479 8 Other (library) Qt 726 481 8 Other (library) wxWidgets 621 472 6 Java Java SWT 561 461 8 Other Plugins 540 480 8 Other (library) SDL 468 583 8 Other Eclipse 428 470 6 Java Java AWT 380 463 8 Other Project is a user interface (UI) system 367 495 8 Other Other toolkit 294 466 8 Other Project is a 3D engine 249 478 8 Other Tk 225 473 8 Other (API) Carbon (Mac OS X) 167 467 8 Other Project is a graphics toolkit 153 485 8 Other DirectX 131 464 8 Other Project is a templating system 91 474 8 Other Framebuffer 84 468 8 Other Project is a remote control application 82 492 8 Other (library) Allegro 62 490 8 Other (library) GLUT 61 484 8 Other (library) FLTK 52 465 8 Other Project is a window manager 34 462 8 Other Grouping and Descriptive Categories (UI) 32 228 8 Other Newt 30 476 8 Other TabletPC 17 494 6 Mac Quartz 15 491 8 Other Crystal Space 14 493 8 Other Motif/LessTif 12 487 8 Other GGI 10 483 8 Other (library) SVGAlib 8 488 8 Other Glide 8 489 8 Other (library) ClanLib 5 482 8 Other (library) AAlib 2 486 8 Other (library) Plib 2

Table 7 DB Environment: Consolidated 33 categories to 3 categories

Rationale for this consolidation: I consolidated into 3 new categories (Open Source Database, Proprietary Database and Other) to discern whether projects using Open Source Databases produce more successful projects than Proprietary databases.

What was done: I created a new table in sf_merged named ‘categ_db_env_aug06’. This table is a duplicate of the ‘project_db_environment01_aug_06’ table with the fields ‘new_code’ and ‘new_description’ added. I updated the new fields in the new table with the new codes and descriptions listed in the table below using a python script I (Bob English) wrote called add_db_env.py. I added the fields de1’, de2’, de3 to the ‘statistical_analysis_0607 table (the flat file table). These fields correspond to the new codes (1-3) in the database environment table below. I updated the pl fields with the new codes from the ‘categ_db_env_aug06’ table. The new fields are the Boolean data type and have a 1 value if the project lists the corresponding new code and a 0 value if it does not. Some projects list more than one code.

NOTE: 539 projects that exist in the project_db_env_aug_06 table do not exist in the ‘statistical_analysis_0607’ table. Some projects are missing from the ‘statistical_analysis_0607’ table because they were deleted from Sourceforge.net or because we don’t have accurate data for them for various reasons. This defect in the data may impact the statistical analysis and adjustments may need to be made to compensate. code New code New description description num_proj 5241 Open Source DB MySQL 4598 5021 Open Source DB JDBC 1296 5083 Other SQL-based 1185 5073 Other XML-based 1052 5251 Open Source DB PostgreSQL (pgsql) 906 5213 Other Flat-file 778 5302 Proprietary DB Microsoft SQL Server 557 5311 Open Source DB SQLite 374 5262 Proprietary DB Oracle 351 5032 Proprietary DB ADOdb 343 5053 Other PHP Pear::DB 314 5093 Other Other API 278 5182 Proprietary DB Microsoft Access 268 3 Other Project is a database abstraction layer 510 (API) 255 5043 Other Perl DBI/DBD 225 5013 Other ODBC 203 5222 Proprietary DB Proprietary file format 173 5233 Other Other file-based DBMS 158 5281 Open Source DB Firebird/InterBase 148 5063 Other Python Database API 147 5321 Open Source DB HSQL 127 5171 Open Source DB Berkeley/Sleepycat/Gdbm (DBM) 122 5163 Other Project is a relational object mapper 113 5333 Other Other network-based DBMS 100 5113 Other Project is a database management tool 98 5272 Proprietary DB IBM DB2 79 3 Other Project is a file-based DBMS (database 512 system) 56 5292 Proprietary DB Sybase 55 3 Other Project is a tool for a proprietary 514 database file format 52 3 Other Project is a network-based DBMS 513 (database system) 42 5153 Other Project is a database conversion tool 38 5193 Other xBase 34 5203 Other PalmOS PDB 24