From: Jim@NITV [email protected] Subject: RE: Estimate Package Date: 6 Mar 2014 17:40 To: Jamie Allen [email protected]

Dear%Mr.%Allen, % We%do%not%sell%the%CVSA%for%research%purposes%–%this%is%specifically%prohibited by%the%CVSA%End%User%License.%%Also,%the%CVSA%requires%a%US%Government approved%Export%License%for%sales%outside%the%USA. % Sorry%we%cannot%be%of%assistance. % % Jim$Kane ExecuGve%Director NITV%Federal%Services,%LLC 561O798O6280 % From:%Jamie%Allen%[mailto:[email protected]]% Sent:%Thursday,%March%06,%2014%3:15%AM To:%[email protected] Subject:%EsGmate%Package

Dear Sir or Madame,

I would be interested to test the voice stress analysis products you have on offer (for research purposes in anthropological and ethnographic work). Would you be able to provide a copy for testing, and give me a sense of the pricing and costs involved?

I found this email address on the following website: http://www.cvsa1.com/Products.htm

Cheers, Jamie Allen head of research - CIID copenhagen institute of interaction design co-editor - continent. a topology of thought sciencefriction.dk studio

Zimmerman “Not Guilty” Verdict Validates Computer Voice Stress Analyzer (CVSA) Results

The day after George Zimmerman shot Trayvon Martin, Sanford police detectives requested that Mr. Zimmerman submit to an examination by the Computer Voice Stress Analyzer (CVSA). The CVSA is a newer form of truth verification, which has largely replaced the old and is now utilized by approximately 2,000 agencies, including many major US law enforcement agencies such as the Atlanta P.D., Nashville P.D., Baltimore P.D., San Francisco P.D., New Orleans P.D., Miami P.D., the California Highway Patrol, and the U.S. Federal Court system.

The results of the CVSA examination clearly indicated that George Zimmerman believed his life was in jeopardy when he shot Trayvon Martin. Based on the evidence and information developed during their investigation, which included the results of the CVSA examination, the Sanford P.D. initially declined to arrest Mr. Zimmerman. However, after public outcry, the Florida State Attorney decided to charge Mr. Zimmerman with second degree murder, a charge that Jonathan Turley, the Sharpiro Professor of Public Interest Law at George Washington University, called "Legally and tactically unwise."

This case is not unique since many innocent people have been charged with crimes they did not commit. To learn more about other unusual criminal cases the CVSA has helped solve, please go to http://cvsa1.com/realcases.htm

CURRENT SCIENTIFIC RESEARCH SUPPORTS THE CVSA: The findings of the Zimmerman CVSA examination are consistent with recent peer-reviewed research published in the 2012 annual edition of the scientific journal “Criminalistics and Court Expertise” which reports the accuracy rate of the CVSA is greater than 96%, an assertion long made by the system’s manufacturer and law enforcement users. The research also showed a 100% correspondence between CVSA results and court adjudications.

To watch the Zimmerman CVSA exam, click here

IN THE CIRCUIT COURT OF THE EIGHTEENTH JUDICIAL CIRCUIT IN AND FOR SEMINOLE COUNTY, FLORIDA

STATE OF FLORIDA

VS. CASE NO.: 2012-001083-CFA SANO:1712F04573 GEORGE ZIMMERMAN

STATE'S MOTION IN LIMINE REGARDING CVSA TESTING

The State of Florida, by and through the undersigned Assistant State Attorney, hereby moves this Honorable Court to prohibit any argument, testimony or evidence concerning or related to Defendant George Zimmerman taking a Computerized Voice

Stress Analysis (CVSA) test. In support of the instant Motion, the State submits the following;

(1) The State has reason to believe that the Defendant will attempt to

introduce evidence, testimony, or make other reference to the fact that the

Defendant took a CVSA test after he shot and killed Trayvon Martin.

(2) It is contrary to the rules and laws governing the courts of the State of

Florida to permit such evidence or inference. This type of evidence has long

been held to be unreliable and therefore inadmissible.

(3) Such evidence, and anything akin to such evidence, is irrelevant, has no

probative value and would serve only to prejudice the jury.

AFRL-IF-RS-TM-2001-7 In-House Technical Memorandum November 2001

INVESTIGATION AND EVALUATION OF VOICE STRESS ANALYSIS TECHNOLOGY

Darren Haddad, Sharon Walter, Roy Ratley and Megan Smith

APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED.

AIR FORCE RESEARCH LABORATORY INFORMATION DIRECTORATE ROME RESEARCH SITE ROME, NEW YORK 20020610 039 This report has been reviewed by the Air Force Research Laboratory, Information Directorate, Public Affairs Office (IFOIPA) and is releasable to the National Technical Information Service (NTIS). At NTIS it will be releasable to the general public, including foreign nations.

AFRL-IF-RS-TM-2001-7 has been reviewed and is approved for publication.

APPROVED: GERALD C. NETHERCOTT Chief, Multi-Sensor Exploitation Branch Info and Intel Exploitation Division Information Directorate

FOR THE DIRECTOR: JOSEPH CAMERA Chief, Information & Intelligence Exploitation Division Information Directorate

If your address has changed or if you wish to be removed from the Air Force Research Laboratory Rome Research Site mailing list, or if the addressee is no longer employed by your organization, please notify AFRL/IFEC, 32 Brooks Road, Rome, NY 13441-4114. This will assist us in maintaining a current mailing list. Do not return copies of this report unless contractual obligations or notices on a specific document require that it be returned. Form Approved REPORT DOCUMENTATION PAGE OMB No. 0704-0183

Public reportingrtlng burden lor,or thisthis^ection co,,ec*on ,no, '"'°'*<™™*"°"™ ^^^°Z^1^J>Z*«'"^7'^nfl the co ecton o 'nform3«on=^^^^^^^^^^^^l^^ Send comments regarding this burden estimate or any other aspecta^ecVÄ n-^-^H^. .,.„-1 «u UMvT'Sao^Tad^ '°704-°18BI- W35hi"gton-DC20S03 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED NOVEMBER 2001 Ifi-House August 1998 - July 2001 5. FUNDING NUMBERS 4. TITLE AND SUBTITLE INVESTIGATION AND EVALUATION OF VOICE STRESS ANALYSIS C:N/A TECHNOLOGIES PE: N/A PR: NUR 6. AUTHOR(S) TA: SA Darren Haddad and Sharon Walter WU: 13 Roy Ratley and Meagan Smith

"7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION REPORT NUMBER AFRL/IFEC ACS Defense 32 Brooks Road P.O. Box 1188 Rome NY 13441 Rome NY 13442

"9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING AGENCY REPORT NUMBER AFRL/IFEC 32 BROOKS ROAD ROME NY 13441-4114 AFRL-IF-RS-TM-2001-7

11. SUPPLEMENTARY NOTES AFRL/IFEC Project Engineer Darren M. Haddad, 330-2906

12b. DISTRIBUTION CODE 12a. DISTRIBUTION AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. ABSTRACT (Maximum 200 words) Numerous police officers and agencies have been approached in recent years by vendors touting computer-based systems capable of measuring stress in a person's voice as an indicator of deception. These systems are advertised as being cheaper, easier to use less invasive in use, and less constrained in their operation than polygraph technology. They claim that a speaker's medical condition, age, or consumption of drugs does not affect use of their system. Voice stress analysis does not require physical attachment of the system to the speaker's body and does not require that answers be restricted to "yes" and no". Purportedly, according to some vendors, any spoken word or even a groan, whether recorded, videotaped, or spoken in person, with or without the speaker's knowledge, are acceptable inputs to voice stress analysis systems.

The value of voice stress analysis technology for military application could be extensive. During military field interrogations of potential informants, it could be applied in a manner similar to its application for law enforcement. Also, it's not known if stressed speech has any effects on the accuracy of speech technology, such as speaker identification and language identification. If voice stress can be detected, perhaps it can be taken into acdount in applying voice recognition technology and be used to improve these recognition capabilities. Therefore, this effort is to determine the scientific value and utility of existing, commercial voice stress analysis technology for law enforcement and military applications.

15. NUMBER OF PAGES 14. SUBJECT TERMS Voice Stress, Speaker Identification 118 16. PRICE CODE

20. LIMITATION OF ABSTRACT 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION OF REPORT OF THIS PAGE OF ABSTRACT UNCLAS UNCLAS UNCLAS UL Standard Form 298 (Rev. 2-89) (EG) Prescribed by ANSI Std. 239.IB Designed using Perform Pro, WHS/DIOR. On 94 Table of Contents

1.0 EXECUTIVE SUMMARY 1

2.0 EFFORT OBJECTIVE 1

3.0 INTRODUCTION 1

4.0 HISTORY OF VSA TECHNOLOGY 2

5.0 AVAILABLE VSA SYSTEMS 5

5.1 PSYCHOLOGICAL STRESS EVALUATOR (PSE) 5 5.2 LANTERN 5 5.3 VERICATOR 5 5.4 COMPUTERIZED VOICE STRESS ANALYZER (CVSA) 6 5.5 VSA MARK 1000..; 6 5.6VSA-15 6 5.7 XANDI ELECTRONICS 6 5.8 TVS A3 7 6.0 METHODS OF VOICE STRESS ANALYSIS AND CLASSIFICATIONS 7

7.0 TESTING 8

7.1 TEST OBJECTIVE 8 7.2 SCOPE/APPROACH 8 7.3 TEST AND ANALYSIS PROCEDURES 9 7.4 SYSTEMS TESTED ..9 7.5 TECHNICAL TESTING 9 7.5.1 Artificial Signal Test (Test 1) 9 7.5.1.1 Objective (Test 1) 9 7.5.1.2 Test 1 Set-Up 9 7.5.1.3Vericator 10 7.5.1.4 Diogenes Lantern 10 7.5.1.5 Summary (Test 1) 10 7.5.2 Source Consistency Tests (test 2) 13 7.5.2.1 Objective (test 2) : 13 7.5.2.2 Scope 13 7.5.2.3 Test 13 7.5.2.4 Examination Results 14 7.5.2.5 Summary Test 2 15 7.5.3 Objective (Test 3) 15 7.5.3.1 Data Evaluation (Test 3) 15 7.5.3.2 Data collection and Down Sampling 16 7.5.3.3 Segmentation 16 7.5.3.4 Testing 16 7.5.3.5 Results 16 7.6 FIELD TESTING 16 8.0 CONCLUSION 19

9.0 SUGGESTED FOLLOW ON 19 REFERENCES 20 Appendix A 21 Appendix B 22 Appendix C 25 List of Figures

FIGURE l: FM RECORDER TEST SIGNALS @ 80 Hz & 160 Hz 10 FIGURE 2: WAVERFORMS TO THE DIOGENES LANTERN SYSTEM 12 FIGURE 3: TEST CONFIGURATION FOR SOURCE CONSISTENCY TESTS 13 FIGURE 4: WAVEFORM CHANGES USING THE CASSETTE RECORDER WITHOUT THE AGC SET 15

u List of Tables

TABLE 1: COST COMPARISONS MADE BY A VS A VENDOR 2

HI Preface

Relationship between AF and NU:

The unique relationship between the Air Force, Rome Research Site and the National Institute of Justice (NIJ), was established over four years ago with a Memorandum Of Understanding (MOU) agreement. This agreement allowed the Air Force the opportunity to test and evaluate their technologies for law enforcement applications. With this agreement, the Air Force was able to collect a variety of law enforcement audio/video data that would eventually be utilized to test the performance of Air Force algorithms developed within the Rome Research site at Rome, New York. This enables the NIJ to receive the capability to demonstrate technologies that may apply to law enforcement encouraging the possibility transfer of those technologies to state and local crime fighting units.

IV 1.0 EXECUTIVE SUMMARY Voice Stress Analysis (VSA) systems are marketed as computer-based systems capable of measuring stress in a person's voice as an indicator of deception. They are advertised as being less expensive, easier to use, less invasive in use, and less constrained in their operation than polygraph technology. Law enforcement officials have inquired about this technology. As a result, the National Institute of Justice (NU) has petitioned the Air Force Research Laboratory (AFRL/IFE) for assistance in evaluating voice stress analysis technology. This evaluation is broken down in three phases. In the first phase, Dr. John H.L. Hansen, from the University of Colorado, investigated the feasibility of detecting stress from speech. He reported on the methods, analysis, and classification of voice stress contained in the appendix of this report. The second and third phase of this study investigated the reliability of commercial VSA units, from a theoretical point of view and from an application (i.e. law enforcement) point of view.

2.0 EFFORT OBJECTIVE The Objective of this effort is to determine the effectiveness of commercially available voice stress analyzers (VSA) to detect "stress" in the voice of a talker. The use of "stressed speech" for this effort is defined as speech that exhibits a change in characteristics caused by mental stress such as anxiety and/or fear. Of particular interest is the detection of stressed speech (change) caused by an act of deception under law enforcement interview questioning or military interrogation.

3.0 INTRODUCTION Police departments everywhere are bombarded with offers of advanced technologies by commercial enterprises that promise to reduce their officers' workload, improve law enforcement effectiveness, and/or save lives. With increasingly limited budgets, police departments must turn a critical eye to every purchase. One interest by law enforcement and military organizations are the commercial VSA systems, which are advertised to detect deception or to detect when a person under interrogation is lying. If voice stress can be detected, and effectively analyzed, perhaps it can be used as a viable investigative tool as well as an adjunct to speech recognition technology in order to improve speech recognition capabilities. Numerous police officers and agencies have been approached in recent years by vendors touting computer-based systems capable of measuring stress in a person's voice as an indicator of deception. These systems are advertised as being cheaper, easier to use, less invasive in use, and less constrained in their operation than polygraph technology. Table 1 is a replication of the table of comparisons made by one vendor contrasting their VSA system with a computerized polygraph. Besides costing less to purchase the equipment and train users, the table indicates that a VSA examiner can conduct seven (7) exams per day while a polygraph examiner can conduct only two (2) per day. This vendor claims to always have conclusive results, and the ability to analyze recorded audio as well as live speakers. They claim that a speaker's medical condition, age, or consumption of drugs does not affect use of their system. Voice stress analysis does not require physical attachment of the system to the speaker's body and does not require that answers be restricted to "yes" and "no". Purportedly, according to some vendors, any spoken word or even a groan, whether recorded, videotaped, or spoken in person, with or without the speaker's knowledge, are acceptable inputs to voice stress analysis systems. The value of voice stress analysis technology for military application could be extensive. During military field interrogations of potential informants, it could be applied in a manner similar to its application for law enforcement. Also, it is not known if stressed speech has any effects on the accuracy of speech technology, such as speaker identification and language identification. If voice stress can be detected, perhaps it can be taken into account in applying voice recognition technology and be used to improve these recognition capabilities. Therefore, this effort is to determine the scientific value and utility of existing, commercial voice stress analysis technology for law enforcement and military applications.

Table 1: Cost comparisons made by a VSA vendor Computer Voice Computerized Polygraph Stress Analyzer $13,000.00 Initial cost of system $9,250.00

$3,000.00 Tuition for 1 student $1,215.00

Length of training 6 days 8 weeks

Cost of room and board factored at $70 per day $420.00 $3,920.00

Salary for student while in training $769.23 $6,153.84 (U.S. average) Number of exams that an examiner can conduct 7 exams 2 exams per day Average percent of inconclusive results on 0% 20% exams Can unit analyze audio tapes for truth Yes No verification? Do drugs, medical condition, or age affect No Yes testing? Total cost to purchase 1 unit and train 1 agent $11,654 $26,073.84

4.0 HISTORY OF VSA TECHNOLOGY In 1970, and prior to the publishing of Lippold's article in 1971, three military officers retired from the U.S. Army and formed a company which they named Dektor Counterintelligence and Security (CIS). The three officers were Alan Bell, Bill Ford and Charles McQuiston. Bell's expertise was in counterintelligence, Ford's was in electronics, and McQuiston's was in polygraphy. Ford had invented an electronic device that utilized the theory of Lippold, Halliday and Redfearn in which he tape-recorded the human voice, slowed it down three to four times its normal rate, and fed it through several lowpass filters which then fed the signal into an EKG strip chart recorder. The strip chart recorder then made chart tracings on heat sensitive paper. They named their device the Psychological Stress Evaluator (PSE). Although Dektor CIS was intended to be a security company, the PSE immediately became a success and their focus became centered on this system. One of the first individuals hired by Dektor was a polygraph examiner with a local police department which had started utilizing the PSE. This individual, along with McQuiston, wrote a three-day training course based on their polygraph experience and utilizing polygraph formats. According to Allan Bell Enterprises [1], "All lie-detection examinations or evaluations are predicated upon the fact that telling a significant lie will produce some degree of psychological stress. Psychological stress, in turn, causes a number of physiological changes." Polygraph takes advantage of these physiological changes to measure one's psychological stress. customarily measure changes in blood pressure, hormone levels, stomach and chest breathing patterns, galvanic skin response (perspiration), the pulse wave and amplitude. VSA literature [9] points to a descriptor of the physiological basis for the micro muscle tremor or microtremor. This paper describes "a slight oscillation at approximately 10 cycles per second" (i.e. physiological tremors) during the normal contraction of voluntary muscle. All muscles in the body, including the vocal chords, vibrate in the 8 to 12 Hz range. It is these microtremors that the VSA vendors claim to be the sole source of detecting if an individual is lying. This human system is a feedback loop, similar to a thermostat/heater that will maintain an average temperature. By raising the temperature a little above the setting, it will switch off, and not come back on until the temperature is a little below it. Just as the temperature swings up and down over time, so too do the muscles tighten and loosen as they seek to maintain a constant tension. In moments of stress, the body prepares for fight or flight by increasing the readiness of its muscles to spring into action. The muscle vibration increases. This muscle tremor is usually evident in a hand tremor, as when one holds their arm out in an extended position. This indicates that restricting the blood supply to the muscle can reduce the tremor. Physiological tremor is "the ripple that is superimposed on the voluntary contraction of a particular muscle and arises solely from this activity." Most people exhibit a fine, rapid tremor of their hands when their arms are outstretched. According to the Merck Manual [12], "enhanced physiologic tremor maybe produced by anxiety, stress, fatigue, or metabolic derangements (ie. alcohol withdrawal, thyrotoxicosis) or by certain drugs (ie. caffeine and other phosphodiesterase inhibitors, beta-adrenergic agonists, and adrenal corticosteroids beta-blocker: propranolol)." The initial VSA development entitled "Application of Voice Analysis Method" was funded by the U.S. Army Land Warfare Laboratory, performed by Decision Control, Incorporated of Bethesda, Maryland. This study was performed to assess the capability of a method of voice analysis to detect stress in the spoken response, "no." The studies recorded voice responses of individuals undergoing polygraph testing and were analyzed for their stress values. The results were then compared to the polygraph interpretations. In this stress response comparison, the waveform results were similar. A prototype voice analyzer was developed, fabricated and tested. The device processed recorded audio and provided three voice measures. The introduction to this report indicates that a previous study [14], had shown that an analysis of the response "no" could provide an accurate assessment of whether the response was truthful or deceitful. Six semi-orthogonal measures and a number of bandpass frequencies were used in the study. The experiment simultaneously used the polygraph to determine the existence of stress. The results concluded that it was highly desirable to reduce the number of measures, and to determine the best set of bandpass frequencies.

U.S. Army Land Warfare Laboratory Technical Report #LWL-CR-03B70 by Joseph F. Kubis of Fordham University, titled "Comparison of Voice Analysis and Polygraph as Lie Detection Procedures" (commonly referred to as "the Kubis Report."), completed a study comparing the two types of lie detecting systems [8]. Two voice analysis systems were evaluated as lie detection devices in a simulated theft experiment, which utilized 174 subjects. One group of subjects was examined with the polygraph, at the same time their voice recordings were taken. A smaller group was tested only with their voice being recorded. The results failed to demonstrate that either of the voice analysis systems were accurate in identifying the three basic roles of Thief, Lookout, and Innocent Subject in a simulated theft experiment. The polygraph achieved an accuracy score of 76 percent, a value comparable to that obtained in previous studies using the simulated theft paradigm. Independent raters, who knew nothing about the characteristics of the experiment subjects, also obtained 50 to 60 percent accuracy scores in the examination of the polygraph charts. In the Kubis report, the results showed that the voice recordings were not statistically significant. It showed that lower accuracy using voice analysis was obtained with voice recorded and polygraph-tested subjects than with those who had their voice recorded without the polygraph. Audio recording monitors that were present during the interrogation sessions based their judgements more on their perceived impressions of the suspect rather than the output of the system. They were able to discriminate among Thief, Lookout, and Innocent Suspect. Based on these results, one could hypothesized that the simulated theft procedure induced a sufficient degree of emotional stress on a subject which indicates that is could be useful for lie detection research. Another study [2], takes issue with the Kubis Report, citing the experimental methodology as a "game model" with possibly insufficient induced stress for measurement, a noisy environment, and deviations from manufacturer-recommended questioning techniques. Polygraph-licensing laws in some states require lie detection to be accomplished specifically with a polygraph. These laws define lie detection equipment as equipment providing a permanent record of cardiograph (heart) and pneumograph (breathing) data. The earliest law was enacted in Kentucky in 1962. These legislation enactments made the VSA units illegal in some states. State-sponsored hearings in Florida in 1973 and 1974 resulted in an informal acceptance of the PSE for law enforcement use in that state. North Carolina and Arkansas soon followed and formally authorized use of the PSE. 5.0 AVAILABLE VSA SYSTEMS Currently, there are many available VSA on the market today. The major VSA vendors market their products on a laptop with specific software, while a few are sold as an electronic device with the software embedded on its chips. Examples of VSAs currently available are described below.

5.1 Psychological Stress Evaluator (PSE) Dektor Counterintelligence and Security, Inc. of Springfield, Virginia. The Canadian Patent #943230 (March 5, 1974) and United States Patent #3,971,034 (July 20, 1976), submitted by Allan D. Bell, Jr., Wilson H. Ford, and Charles R. McQuiston, describe their "Physiological Response Analysis Method and Apparatus." This unit was the first VSA unit on the market, released in March of 1971. It was designed to be used in the same manner as a polygraph, one-on-one testing for the detection of deception. It was a black box with an output in the form of a waveform via thermograph readout The PSE senses the difference and records the change in the inaudible FM qualities of the voice on a chart. When an experienced examiner interprets the chart, it reveals the key stress areas of the person being questioned.

5.2 Lantern The Diogenes Group, Inc. (407) 933-4839 FAX: (407) 935-0911

The Diogenes Group Inc., established in 1995, produces a system called the Lantern. The Lantern instrumentation consists of an analog-type magnetic tape recorder with integral microphone, a Pentium laptop computer serving as a high-speed processor, and an extensive program of copyrighted, proprietary processing software designed specifically for ease of operation. The Windows 3.11™ or Windows 95™ based software is also responsible for control of all processing operations, display format and presentation, and the printing of hard copies of the waveforms representing the behavior of the microtremor. The tape recorder is operated throughout an interview to create the primary record, which includes both questions and answers in the context in which they occurred. The monitor output of the recorder provides the real-time input to the digital processor. The examiner is able to control, with a single finger, high-sample rate digital capture of the sound of each answer. [6]

5.3 Vericator Trustech Ltd. Integritek Systems, Inc., Ill Bermuda Ave. Tampa, FL 33606 +1 813 250 3922 Trustech Ltd. was founded in 1997, and produces a system called Vericator, formally known as the Truster Pro. This system allows the user to use their own personal computer with the following requirements: WIN95 ™ / WIN98 ™ /NT 4.0 ™, Pentium™ II or III, 32 MB RAM to 128 MB RAM, a microphone, CD ROM Drive (double speed), and al6 Bit Soundcard (full duplex). The package includes a Vericator CD , Stereo T-Connector (for connecting your PC and telephone), Vericator User Manual. It features automatic calibration process; analysis of deception in real-time; analysis of pre-recorded online conversations/interviews and TV or radio segments. The summary and technical reports can be viewed, saved and printed. There are graph displays for advanced diagnosis; four built-in psychological lie detection patterns; filtering system for reducing background noise. [7]

5.4 Computerized Voice Stress Analyzer (CVSAj National Institute for Truth Verification (NITV) West Palm Beach, Florida 33414 (561)798-6280 FAX: (561) 798-1594 In 1986, NITV began to market the Computerized Voice Stress Analyzer (CVSA), currently known as the most popular VSA system available. NITV advertisements claims that the system is in use in more than 500 law enforcement agencies, and offers as evidence, letters of endorsements from agencies throughout the United States. NITV claims to market only to law enforcement agencies in order to prevent it from being used by criminals to identify undercover agents. [5]

5.5 VSA Mark 1000 - Marketed by CCS International, Inc.: This system is marketed as a covert electronic lie detection system providing fast analysis, fast results and fast answers. With its built-in tape recorder, the VSA Mark 1000 allows you to analyze audio data at a later time. A clear, precise digital readout is given in both LED and printed format, where the results are instantaneous. For more information go to http://www.spyzone.com/catalog/index.html. [4]

5.6VSA-15 - Marketed by CCS International, Inc.: This system is similar to the VSA Mark 1000, but is marketed as a miniaturized hand held system. This unit is targeted for the non-professional user. For more information go to http://www.spyzone.com/catalog/index.html. [4]

5.7 Xandi Electronics - Markets a Voice Stress Analyzer Kit (Model # XVA250) for $59. It has 10 LEDs. The System is powered by either a 9 volt battery or a power adapter. As you speak into the analyzer, the LEDs in the normal position (the left) should light up. Under stress conditions, more of the LEDs on the right-hand side will light up in the stress position, and fewer will light up in the normal position. [11]

5.8 TVSA3 - This VSA software is freeware off the World Wide. The TVSA3 is a software program, which inputs digital audio files, and outputs new audio files mixed with a changing tone in the background. These background tones indicate the changing stress levels of the individual that is speaking. A higher tone indicates a higher stress level. The lone control is a threshold setting, which determines how high the voice stress frequency must be to trigger the background tone. The threshold setting is treated as a percentage of the stress range found in a given recording. [3] Only the Diogenes-Lantern and Vericator were assessed in this study and will be discussed in this report. These are the most popular VSA units available on the commercial market today. Another popular unit is the CVSA, but the company decided not to participate in this study.

6.0 METHODS OF VOICE STRESS ANALYSIS AND CLASSIFICATIONS

To better understand the aspects of stress speech in a human, the Air Force Research Laboratory (AFRL) worked with Dr. John Hansen of the University of Colorado, to determine if it is feasible to recognize and classify stress in an individual's voice. Dr. Hansen is a world known expert in the area of voice stress. The report is included in this report, and is attached as an Appendix C. He states " it is not inconceivable that under extreme levels of stress, that muscle control throughout the speaker will be affected, including muscles associated with speech production". In this study he used the Speech under Simulated and Actual Stress (SUSAS) database. This database includes stress speech such as angry, loud, lombard (speaking under noisy conditions), and fear stress. In his report, he reviewed literature that discussed past speech under stress studies. He analyzed stress in speech, in which he concluded that voice stress is caused by factors that introduce variability into the speech production process. These variabilities or features include duration, glottal source factors, pitch distribution, spectral structure and intensity. Duration includes four area: (1) overall word duration, (2) individual speech class (vowel, consonant, semivowel, and diphthong) duration, (3) duration shifts between classes, and (4) speech class duration ratios. Glottal source factors measured the spectral slope of those vowels, which were longer than 5 frames or 96 msec. The first and second formants locations are measured to determine the spectral structure. Intensity is a calculation of in an voice signal. These variabilities could also be speaker dependent. By using these various linear and nonlinear features, and testing with the Bayesian hypothesis method, it was concluded that different types of emotional stress could be classified. The Bayesian hypothesis method is a stress detection technique to determine if a given piece of audio data is either neutral speech or a certain classification of stress speech. From the results, it suggests that it is unlikely that a single feature could be used to accurately detect deceptive stressed speech. The more features that are fused together, the stress type recognition improves. It also shows that some features, single handed, can detect a specific type of stress better than other features. For an example, the pitch feature could detect loud stress better than angry and lombard stress. Whereas, the spectral structure feature could detect angry stress better. Classification of deceptive stress was not tested due to the unavailability of a deceptive database. The collection of a deceptive database is a recommendation of future work (see section 8.0).

7.0 TESTING The goal of these tests is to determine how effective these VSA units can detect stress. VSA vendors have marketed their technology as scientific, as it takes advantage of the human micromuscle tremor in the vocal tract. These tests attempt to prove or disapprove these theories.

7.1 Test Objective

The objective of these tests is to measure the output response of two VSA systems, given several controlled input signals. This will be used to verify the manufacturer's claims of operation for each analyzer. The degree of source consistency of results for each analyzer will then be determined. This will determine the correct process to use when recording audio for evaluation. Finally, the VSA systems will be laboratory tested and field tested by evaluating them with trained laboratory analysis and experienced police investigators.

7.2 Scope/Approach This effort will test and evaluate two (2) commercially available voice stress analyzers. Tests will be accomplished using a series of test signals that contain information distributed over the frequency spectrum, generally covered by the spectrum of normal speech. Analysis of the VSA test results will be conducted to determine

- VSA response characteristics

- Degree of accuracy compared to the manufacturers theory of operation and technical specifications

- Accuracy of result repeatability

- Evaluate under real world conditions

The test approach taken in this plan is to consider each analyzer to be a black box, and to record its output response to known input test signals. 7.3 Test and Analysis Procedures The procedures are developed for three areas - procedures for the development of test tapes containing artificial signals, source consistence test, and analysis and evaluation of audio data with stress ground truth.

7.4 Systems Tested Vericator and Diogenes Lantern

7.5 Technical Testing Trained analysis in a laboratory setting completed the technical testing. These analysis were each trained through the VSA vendor training programs.

7.5.1 Artificial Signal Test (Test 1)

7.5.1.1 Objective (Test 1) Test 1 of the VSA Evaluation was to determine if the VSA units detect the frequency modulation of a signal. These signals are similar to the microtremor, which manufacturers state is their theory of operation. For the purposes of this test we utilized the Vericator system and the Diogenes Lantern system. A generically generated signal database of FM frequencies, occurring at different rates and depths of modulation, was processed repeatedly through the systems.

7.5.1.2 Test 1 Set-Up The test was performed on laptop computers that contained the Vericator and Diogenes Lantern software. The signals were fed to the laptops from a desktop PC. The desktop PC dispatched the artificial signals using the commercial off-the-shelf (COTS) application Cool Edit. Cool Edit is a digital audio editor for a Windows base system. It is used to record and play files in a wide variety of audio formats, edit files and mix them together, and convert audio files from one format to another. Cool Edit also gives the ability to create sounds from scratch with generated tones and generated noise signals. The FM test signals that comprise the signal database for Test 1 were generated using Cool Edit. These FM signals were generated at the carrier frequencies of 80 Hz and 160 Hz (these frequencies represented the fundamental frequency on a speech signal), with varying modulation rates and depth of modulation rates (Figure 1). Modulating rates measures how fast the signal modulates, and depth represents how much the signal modulates from the carrier frequency. MODULATION RATE (Hi0 1 2 4 6 10 15 20 25 1 X X X X X X X X 2 X X X X X X X X X X X X X X DEPTH 4 X X X X X X X OF 6 8 X X X X X MODULATION 10 X X X X (Hz) 15 X X X 20 X X 25 X X

Figure 1: FM Recorder Test Signals @ 80 Hz & 160 Hz The test signals were recorded in 15 second utterances. Each signal was passed through each VSA system. The test results were recorded on data spreadsheets, and the wave analysis was labeled and printed. Once the test waves were all analyzed the documentation was compared to determine consistency. 7.5.1.3 Vericator For the purposes of this test we utilized the "Online Mode" of the Vericator application. The "Online Mode" measures five voice parameters SPT, SPJ, JQ, AVJ, SPLC-SOS (see appendix A) in real time (< 2 second delay) to detect stress. The signals were processed through the Vericator in short utterances. A few signals were attempted with a consistent result of "No indication of voice segments" or "Not enough voice samples." To overcome the inability to analyze the bare FM tones, we added a voice to the signal to the tone. After recording a female voice, analyzing it and determining the most consistent signal, the FM frequency was added. The system was able to identify the signal and process it. The system responded with a spike in the analysis wave every time the FM frequency was introduced to the signal. These results were recorded in the test log for the Vericator analysis. The signals listed in figure 1 were processed through the Vericator system.

7.5.1.4 Diogenes Lantern The test waves were re-sampled to an 11025-sampling rate, 8-bit mono to facilitate the acceptance of the signal by the Diogenes Lantern system. The FM signals were 3-4 seconds long. The signals were processed through the Lantern system. The graphs were labeled according to frequency, depth of modulation, and modulation rate. The signals listed in figure 1 were processed through the Diogenes Lantern.

7.5.1.5 Summary (Test 1) The tests were performed, the data was documented, and the results were compared. The Vericator and Diogenes Lantern Systems were utilized in this evaluation and their technology was tested. The primary goal of this phase of the VSA evaluation was to determine if the microtremor claim is the VSA's true theory of operation. For the purposes of this test the nature of the results, stress or no stress indicated, were not taken into account. The results

10 were found to be consistent across the board with little variation in the results in response to the adjustments/changes in the modulation or depth of modulation rates. For example, the analysis of the 80Hz FM test wave, with a depth of modulation rate of 1 Hz and a modulation rate of 1 Hz, differed very little from an 80 Hz FM test wave with a depth of modulation of 4 Hz and a modulation rate of 25 Hz. Since there was no variation of indicated stress from different input signals, it can be assumed that the systems tested do not use microtremors as indicated in their claims. It was determined, late in the testing phase of this project, that the Diogenes Lantern System measure the energy change of the spectrum envelope between 20 Hz and 40 Hz. This is what the Diogenes Lantern System claims to be microtremors. It is the change of energy in the speech envelope. If an individual is under stress, their vocal tract muscles are likely to tighten up. When the vocal tract muscles tighten up the energy of the voice signal becomes abrupt when the individual starts and finishes talking. During the time an individual talks, there is less variation of energy within this the 20 Hz bandpass. When an individual is not stressed, their voice energy slowly leads to a peak when they start to speak, then the energy varies until the individual stops speaking where the energy slowly tails off. This algorithm was coded in the laboratory with the same audio signal inputted. As seen in the waveforms in figure 2a and 2b, the results were identical when compared to the Diogenes Lantern system. The waveform comparison could also be seen in figure 2c and 2d. These figures prove that the algorithm used in the Diogenes Lantern system is energy based. This discovery makes the artificial signal test obsolete since the objective was to determine if these units detect frequency modulation of an audio signal. The Vericator claims that it analyzes multiple features of speech to determine if an individual is lying. It is was not proven if this claim is true, since this information is proprietary, nor was it proven what speech features are being analyzed. However, it is likely that they do process multiple algorithms simultaneously due to the multiple waveforms being display. Since, the Vericator did not react to this test, it is safe to say that the measurement of micromuscle tremor is not one of the speech feature algorithms being used in their system.

i

i fl^^^jim^ _„„^/ ^^II^WM

Figure 2a Diogenes Lantern System Output (No Stress Indicated)

11 Z\ xWAWi'IMHÜM^ -*«■«$$»**

ÄOOfO ÄS^O oooo MtMJW *000

1600 aooo

Figure 2b Matlab Output (No Stress Indicated)

■■vwwetf Wiwtttoii'

^j^fflft^.tfflft*.!.^ ^w«u^ »W»w»»*

.tau eifo

Figure 2c Diogenes Lantern System Output (Stress Indicated)

asoo ■ -<»ooo

Figure 2: Waverforms to the Diogenes Lantern system

12 7.5.2 Source Consistency Tests (test 2) One of the major questions presented to the engineers testing the voice stress systems, "is there a difference in the analysis of an audio file utilizing different medias?" The different medias could be a Digital Audio Tape system (DAT), a cassette recorder, or telephone input device. Each recording device has their own different properties, which could effect the overall analysis by the examiners.

7.5.2.1 Objective (test 2) This experiment is designed to compare the analysis of identical signals utilizing the different medias.

7.5.2.2 Scope Identical signals were fed several times into these medias, according to figure 3, to evaluate the consistency of the results from the two VSA systems. The analysis of the output was then compared to the analysis of the output of the same signal from a different type of media. This gave indications of whether or not different types of medias play an important role in the evaluation and analysis of the voiced responses.

7.5.2.3 Test AFRL and ACS Defense jointly collected 60 voiced utterances from different males and females and recorded those voice utterances on DAT, computer and cassette media. These utterances were collected simultaneously by the computer (. wav format), analog cassette format, and digital via a 48KHz Digital Audio Tape (DAT) recorder (see figure 3). The audio was analyzed separately from each of the three medias (cassette, computer, and DAT). The live feed was connected directly into the VSA computer, the output was analyzed and the results were printed. At the same time, the utterance was recorded on the DAT, this signal was latter processed in the VSA unit for reanalysis and the results were printed. Again at the same time, the utterance was recorded on a cassette tape and this signal was latter processed in the VSA unit to be re-analyzed, and again the results were printed.

LIVE DATA

DAT VSA UNDER * CASSETTE TEST

Figure 3: Test Configuration for Source Consistency Tests

13 7.5.2.4 Examination Results Voiced analysis reported consistent results utilizing DAT and live voice. Each utterance was examined and found that all the waveforms and analysis was consistently identical. When using a cassette recorder similar results were obtained as in the live data. When recording with a cassette player, care needs to be taken when adjusting the automatic gain control (AGC). If the recording volume is not set accurately, the input voice signal gets clipped, so when the output waveform is processed it gets distorted too. This could result in an analysis which is completely different from the truth, therefore providing and incorrect result by the examiner. These discrepancies can be seen in figures 4a - 4d, when using the Diogenes Lantern system. When using the Vericator these discrepancies are also evident. The Vericator results reacted differently each time the same clipped data was inputted into the system. For example, if a clipped audio segment was processed in the Vericator, the system may display truth, while at another time that same clipped data would cause the system to display false statement.

Reviewing the charts in figure 4, shows how much the waveform will change when recording with the cassette recorder without the AGC set. The input file (top waveform) is consistent for figure 4a, 4b, and 4c. It is clipped for figure 4d. This corrupts the output signal (bottom waveform), as seen when comparing figure 4d with the others. Other waveforms can be reviewed in appendix B.

It ... .

Figure 4a: Original Signal

pwmm*¥tM^WW*v, _J_

Figure 4b: Data recorded on DAT

14 Figure 4c: Data recorded on cassette recorded with the AGC set t^mmmiilüMfc!^ Vlßfik^

Figure 4d: Data recorded on cassette recorded without the AGC set Figure 4: Waveform changes using the cassette recorder without the AGC set

7.5.2.5 Summary Test 2 From the results in test 2 it is recommended to perform all analysis, when recording, using the DAT as a recording media. This eliminates any media effects on the audio signal and provides consistent results. It is also absolutely necessary to use a Shure microphone model SM58, or one with equivalent specifications. When using the cassette recorder, it was shown that human error could change the results of any findings. As shown, this type of recording distorts the input audio signal, therefore providing the VSA units clipped audio data to process.

7.5.3 Objective (Test 3) The next stage of the VSA evaluation, consisted of assessing the results of known-ground- truth data when processed through the Vericator and Diogenes systems. Segmented audio data was administered to the systems from law enforcement cases that have been solved.

7.5.3.1 Data Evaluation (Test 3) Data: Audio statements from 2 sets of polygraph tests performed by a certified polygraphist Evaluators: By analysis who were certified by Diogenes and Vericator manufacturers.

15 7.5.3.2 Data collection and Down Sampling Six videotapes were obtained from DODPI, of two suspects in two separate murder cases. The audio portion of the videotape was extracted and digitized into .wav files. The audio files were then down sample from 48kHz down to 11.025kHz to accommodate the manufacture's requirements. This process is necessary to make the data compatible to the VSA units, since these units are programmed to accept data at the 11.025 kHz sampling rate. These digital audio files were inserted into the VSA computers.

7.5.3.3 Segmentation Once the exact audio data was entered and stored in the two computer systems we then proceeded to segment the audio. For the Lantern system we had to create individual .wav files for each utterance that the defendant made, usually answered by a yes or no in these cases. This was done to allow short utterances to be processed by the Lantern as suggested by the manufacturer. There were a total of 45 questions ranging from relevant to non- relevant questions. The Vericator performs it's own unique segmentation. We segmented the audio utilizing their own process. This was done through the off-line mode.

7.5.3.4 Testing Each audio segment was processed through the Lantern system and performed a separate analysis of each wave pattern. Each waveform was compared to the other to verify any distinct changes due to stress. Each file that gave indication of stress were marked and compared to the baseline. Each audio file was processed through the off-line mode of the Vericator. Results were automatically recorded by the system.

7.5.3.5 Results The stress ground truth was obtained through the polygraph examiner and court proceedings via the outcomes of each of the interviews. Both suspects confessed and were subsequently convicted of murder. All of the relevant stress sentences were verified. Each of the 48 utterances was analyzed and compared to the ground truth. Each system gave indications of high levels of stress where stress indicators were verified. The Vericator system scored 100% in its indication of some form of stress, where as it displayed deceitful, high stress, or probably lying. The Lantern system also scored 100% in its indication of stress through the waveform analysis. Both systems gave the examiner a conclusive indication of relevant stress.

7.6 FIELD TESTING In the field testing portion of this study, two local police investigators obtained a VSA system, Mike Adist of the Canastota, New York Police Department and James F. Masucci of the Rome, New York Police Department. Mike Adist used the Vericator and James Masucci

16 used the Diogenes Lantern. The goal of this phase of testing was to determine the feasibility of these systems in the law enforcement environment. It also provided the unbias opinion of an experienced investigator. The following are their reports:

I have been in Law enforcement for the part twenty-one years, and during this time I have had the opportunity to see all facets of crime and investigations. I have been involved in crimes dealing with the least punishable to the severest of them all. I have had the opportunity to attend schools that taught me how to detect when a suspect is being deceptive during questioning. In some cases it was difficult to determine if a suspect was deceptive, and that made my job harder until the summer of 1997 when I came to the Law Enforcement Analysis Facility (LEAF) for help.

My first contact was with Sharon Walters who advised me that the U.S. Government (Military) and a group of Research Technicians (Private Contractors) at the Rome Research and Technology Facility were about to take on the task of evaluating some technology dealing with truth verification. I was also told that this evaluation might be effective in Law Enforcement. At that time I was pleased for many reasons.

I was asked to join this task force to assist the government in this evaluation, but first I was to learn what truth verification was. This required me to learn and study what a microtremor was, and how algorithms mathematically calculated the stress in a human voice. I reviewed the technology and was given a voice stress program called Truster-Pro, now known at Vericator. Using this system I was able to interview a subject who may have been involved in a crime. First an interview is performed to determine the facts, as he/she knew them. Then, I was able to give the subject one or two tests to determine the truth or deception. Finally, a post interrogation would be conducted in an attempt to get a confession.

Keeping the voice stress technology in mind during the testing of a subject, one should remember that this type technology in it's self is only as an investigative tool, and cannot be use to convict the subject. Along with observing a subject's involuntary movements such as his eyes, legs and hands I have had great success the voice stress technology. I have had the opportunity to use this technology on crime from Petty Larceny to Rapes, and have been able to determine either from the victim (s) or the suspect (s), the deception or truth. Not all of the testing were positive, but on the majority of them I was able to get true confessions to the crime. Over the past three years I would say that I have achieved a success rate of about 97 percent on tests vs. confessions. I believe in this systems capability of becoming a valuable investigative tool for the law enforcement officer on the streets of our cities, towns and villages across the nation.

Respectfully,

Michael G. Adsit Criminal Investigator

I have been using the Lantern Voice Stress Analyzer from Diogenes since October of 1997.1 have had many rewarding experiences with the Lantern. I have successfully used it in homicide, arson, robbery, burglary, assault and sexual abuse cases. I do all of the testing for the Oneida County Child Advocacy Center, formerly known as the Oneida County Sexual

17 Abuse Task Force. I point this out to show that I have tried the Lantern on just about every type of crime. Although I did not keep statistics, I feel that I can safely say that with the aid of the Lantern, I have been able to eliminate about as many suspects as I have found reason to "dig into " a little more.

I am not much of a technical expert, but I have made the following observations. I do believe the theory of the micro-muscle tremor and the need for "jeopardy." I have found that without jeopardy, or a fear of some consequence to lying, you do not get accurate charts. I have seen a tremendous difference in the voice stress patterns when there is jeopardy - vs - no jeopardy. For example, I have told suspects to intentionally lie on certain questions during the test, I have found that when they do lie over something that means nothing, you don't get a clear-cut stress pattern. I have seen a small amount of "stress" in those answers, but nothing comparable to a stress pattern when the suspect lies on a relevant question.

As far as recorded material being analyzed by the Lantern, I personally am not a big proponent. I have had some success in analyzing audiotapes, but I find the charts much more difficult to analyze. I have used both cassette and DAT and I really don't see much of a difference between the two. They are both just as difficult for me to interpret. The patterns seem to appear much different that when a "live" test is administered. I do not feel that I can say that the taped material gives inaccurate readings, it may be just a personal preference on my part.

Another area of concern that I have concerning "live tests" is the possibility of interference. I have noticed that if I am conducting a test in a room, which contains a computer, there are noticeable differences in the patterns produced. I have shut the computer off and then asked the same question and received the same answer from the suspect, but the pattern is now different. Assuming that there was no other ambient noise during both times the question was asked, the patterns should be the same, except of course, if the interference was coming from the computer. On the same note, I also have noticed that possibly some interference caused from fluorescent lights. This should be an area of concern and perhaps more testing should be done to determine if the Lantern operates effectively under the above listed conditions.

One final and perhaps most important point I would make regarding the Lantern is the fact that you should not rely solely on the charts to make a determination if someone is "lying." I am not saying that Diogenes professes that this is a "lie detector", actually they profess the opposite. I am just saying that this should never be looked at as a "lie detector." I have truly found that it CANNOT detect lies. As you know, it DOES detect stress. Stress, however, does not always equate to a "lie." I have found in several cases that a person "fails", if you will, on all relevant/crime questions, but has been found to have not committed the crime.

I will close by saying that my experiences with the Lantern have been very positive, however, it cannot be looked upon as a "magic bullet." It is simply an investigative tool. Interrogation and the manner in which questions are formulated are very important. I truly believe that a person that is not strong in the interrogation area will not be as successful with it, as the person that possesses strong interrogation skills. There is much open to interpretation on the charts as far as I can see. It is very situational and again, can NEVER be determined a "lie detector."

James F. Masucci Rome P.D. 05/17/00

These two reports reinforce the results of the technical testing, in that these systems do indicate stress. Caution should be taken when using these systems. They should only be used as investigator tool, and not total rely on these systems for a case conclusion.

18 8.0 CONCLUSION After reviewing the three technical tests performed, it could be stated that these two VSA units do recognize stress. Although these systems state they detect deception, this was not proven. This study does show, from a number of speech under stress studies, that linear and non-linear features are useful for stress classification. Due to the lack of deceptive stress data available, classification of deceptive stress versus emotional stress or physical stress was not tested. This is a vital role in the detection and classification of stress. Many suspects are under an extreme amount of stress when being interrogated. Do these VSA systems actually differentiate between the different types of stress? This still needs to be proven. It was shown, under test 1, that the Diogenes Lantern system detects stress via the amount of energy in the speech envelope. Even though this system performed well under the technical and field tests, it seems from an engineering point of view, that one feature, such as duration, glottal source factors, pitch distribution, spectral structure, or intensity, is insufficient to detect and classify deceptive stress. In the study under Dr. Hansen, it was shown that fusion of features help to increase the accuracy of stress classification. It was proven that the systems tested will and do give the same response when the audio is recorded as opposed to live. The only criterion is when recording using a cassette player, set the AGC, this will prevent any audio clipping. To eliminate the possibility of this error, recording with a DAT is the safest way to go.

9.0 SUGGESTED FOLLOW ON As it was stated, the biggest challenge that was encountered during this project, was the unavailability of sufficient deceptive stress data with ground truth. To make an accurate assessment of these systems, in respect to detecting deception, this data is needed. To develop this database three parties need to be involved. Walter Reed Army Institute of Research (WRAIR) will be tasked to collect/analyze a robust stress database, while evaluating deceptive stress vs. physiological and biochemical stress. The Department of Defense Polygraph Institute (DODPI) along with AFRL/Rome Research Site (RRS) and NLECTC/NE (Law Enforcement Analysis Facility (LEAF)) will collect deceptive stress data and test VSA systems simultaneously with polygraph under neutral/"crime subject" conditions. Personnel from the LEAF will be used because of their extensive VSA background. For years, the Department of Justice and Law Enforcement Agencies in the United States have had only the polygraph technology as a "deception" indicator. With this recommend program, another "deception" indicator will be evaluated separately, and in a complimentary role with polygraph technology. AFRL/RRS will investigate this complimentary role, but in addition, possibly lay the groundwork for the future "fusion" of the two technologies, in an attempt to raise the confidence levels to the more acceptable standards of our justice system.

19 REFERENCES [I] Allan Bell Enterprises, "Comparisons of Existing Lie-Detection Equipment," Unpublished.

[2] Allan D. Bell Jr., "The PSE: A Decade of Controversy," Security Management, March 1981, pgs. 63-73.

[3] " TVS A3 : Voice Stress Analysis Freeware" http://www.whatreallyhappened.com/RANCHO/POLITICSA^SA/truthvsa.html, August 1999.

[4] "The CCS Group" http://www.spyzone.com/catalog/index.html

[5] "National Institute for Truth Verification" http://www.cvsal.com

[6] "Diogenes Company" http://www.diogenesgroup.com/

[7] "Trustech LTD." http://www.truster.com/

[8] Joseph F. Kubis, "Comparison of Voice Analysis and Polygraph as Lie Detection Procedures," U.S. Army Land Warfare Laboratory, LWL-CR-03B70, August 1973.

[9] Olof Lippold, "Physiological Tremor," Scientific American, Volume 224, Number 3, March 1971.

[10] D. H. VanDercar, J. Greaner, N.S. Hibler, CD. Spielberger, and S. Bloch, "A Description and Analysis of the Operation and Validity of the Psychological Stress Evaluator," Journal of Forensic Sciences, January 1980, page 174-188.

[II] "Voice Stress Analyzer Kit" Instructions, XANDI Electronics Model No. XVA250.

[12] Merck Manual John H.L.Hansen, Guojun Zhou and Bryan L. Pellom, "Methods for Voice Stress Analysis and Classification," Robust Speech Processing Laboratory, University of Colorado, July 1999.

[13] John J. Palmatier, "A Field Study to Test the Validity and Comparative Accuracy of Voice Stress Analysis as Measured by the CVSA: In a Psychophysiological Context," Research Proposal, The Michigan Department of State Police, Forensic Science Division, January 1997.

[14] Validation Program for Lie Detection Techniques Using Voice Analysis," U.S. Army Land Warfare Laboratory Purchase Order #DAAD05-69-M-5025, August 1969.

20 Appendix A SPT - A numeric value describing the relatively high frequency range. Vericator associates this value with emotional stress level.

SPJ - A numeric value describing the relatively low frequency range. Vericator associates this value with cognitive stress level.

JQ - A numeric value describing the distribution uniformity of the relativity low frequency range. Vericator associates this value with global stress level.

AVJ - A numeric value describing the average range of the relativity low frequency range. Vericator associates this value with thinking level.

SOS (SFLC) - Say Or Stop, a numeric value describing the changes in the SPT and SPJ values within a single sample sequence. Vericator associates this value with fear and the "breaking point" of the subject.

21 Appendix B

■■■^WtWrWiW*-: VyVrVWrVV-Vi

! ,#»^w.,,^^

Data recorded live

JJHClATOO

.m^^^j-m^,W-, *^->V4f~-w

Data recorded on a DAT

^JWrserae

. .^W-)i^MW>,VM«W ^I^LU*«." -■-

.... vHP J'fc^vw...

Data recorded on cassette recorded with the AGC set

22

tern Y'v<)^..Ji»v

Data recorded on cassette recorded without the AGC set

---,~VfV(fc(

..,.>'.v^yVrWrt^, ■rw ,/ ^

Data recorded live

j.jMO-.ri

^--^>riwWW*^--:•■• -V^

^vmftWtKy^. Vv-,

Data recorded on a DAT

23 .,.>>y\*f.)WJt]i*M*w-'. - ■ ■■ill ■

^»•VAX^V**.^,..^.

Data recorded on cassette recorded with the AGC set

Data recorded on cassette recorded without the AGC set

24 Appendix C

ANALYTICAL SYSTEMS ENGINEERING CORPORATION (NOW ACS DEFENSE, INC.) & U.S. AIR FORCE RESEARCH LABORATORY ROME, NY

METHODS FOR VOICE STRESS ANALYSIS AND CLASSIFICATION

John H.L. Hansen Principal Investigator

Guojun Zhou Bryan L. Pellom

Final Technical Report RSPL-99-August Project Period: August 1998 — July 1999

RSPL: ROBUST SPEECH PROCESSING LABORATORY "~T—I CENTER FOR SPOKEN LANGUAGE UNDERSTANDING Y*2r »>)) University of Colorado, Campus Box 594 •^ Boulder, Colorado U.S.A. Phone: (303) 735-5148 INTERNET: jhlhocslu.colorado.edu FAX: (303) 735-5072 http-V/cslu .colorado.edu/rspl/

25 1 Final Project Report

1.1 Executive Summary: Current speech processing algorithms for classification and assessment of speaker stress which are designed to address DOD and law enforcement applications in such areas as automatic speech recognition, speaker identification, or gisting techniques for message sorting and translation lack the necessary signal processing capability to achieve reliable performance in emotional or task induced stressed environments. Unfortunately, in most applications requiring a man-machine interface, speaker monitoring, or analysis of subject interviews or telephone callers, it is specifically these high stress, emotional, deceitful, or emergency situations where reliable performance is critical. There have been much activity recently in the commercialization of voice stress analyzers for law enforcement applications. This study does not seek to directly prove or disprove these commercial systems, since in most cases the underlying details of their algorithms are typically revealed. Instead, we focus on features which have been used for stress assessment in both the linear and nonlinear speech processing domains. Those linear based speech features include: pitch, periodicity, jitter, glottal spectral slope, duration, intensity, formant locations, spectral structure as represented by the Mel-frequency cepstral parameters (MFCC), and the CVSA based measure. A number of nonlinear based features were also considered based on signal processing methods using the Teager Energy Operator (TEO). Our focus in this study was to use the measure based on an auditory critical-band frequency partition with temporal consistency represented by the autocorrelation envelope response across critical bands (i.e., the TEO-CB-Auto-Env feature). These features were implemented and evaluated using speech data from a number of corpora (i.e., SUSAS, SUSC-0, and ASEC-Stress data).

VOICE STRESS ANALYSIS: During the project period, we completed three manuscripts, which summarize research conducted over the past two years on voice stress analysis for NATO IST/TG-1. The studies focused on analysis of fundamental frequency (pitch), duration, intensity, glottial spectral structure, and vocal tract formant characteristics. An analysis of vocal tract articulatory profiles under stress was also considered.

26 STRESS CLASSIFICATION: In this area, we formulated an optimum detection algorithm for stress classification based on Bayesian Hypothesis Testing. Five speech production feature areas, originally investigated for analysis of speech under stress, were evaluated for optimum stress detection using the SUSAS speech under simulated and actual stress database. Results showed that pitch (fundamental frequency) was the best feature for stress classification (equal error rate EER=11.4%), followed by individual phone class intensity and duration (ERR=23.1% and ERR=30.8%), and to a lessor degree glottal spectral slope (EER=32.2%). Individual formant locations (first and second) were not reliable features for stress classification (ERR=45.5% and 46.3%).

Next, we considered features derived from signal processing scheme based on the nonlinear Teager Energy operator (TEO). This operator assumes that a speech resonance to be modeled by an AM/FM component. While several TEO based stress classification measures were proposed, here we present the TEO-CB-Auto-Env. This measure is based on an auditory based critical band frequency partition, followed by an autocorrelation envelope estimation. The feature represents the time-domain correlation of AM/FM based structure across a partitioned frequency band. Results presented for stress detection show that the TEO-CB-Auto-Env measure (mean classification rate of 94.2%) outperforms pitch (mean classification rate of 88.5%), vocal tract spectral characteristics as represented by the mel-frequency cepstral coefficients (MFCCs) (mean classification rate of 89.6% STRESS ASSESSMENT: During the project period, we extended the application of stress classification based measures to the problem of stress assessment. To determine the usefulness of the nonlinear TEO-CB-Auto- Env measure for stress assessment, we considered an evaluation of SUSC-0 speech corpus from NATO IST/TG-01. This evaluation focused on the Mayday! domain, which involved voice communication between an aircraft pilot and controller during an emergency where is engine fails. The TEO based measure was shown to follow the perceived level of stress in the extracted voice recordings based on a secondary informal listener evaluation. This result is meaningful, since the anchor neutral/stress models used in the assessment were based on speech data from the SUSAS stress database (i.e., open speakers and open training speech material).

27 COMMERCIAL/CONVENTIONAL 'VSA' FEATURES: A number of commercial voice stress analyzers have recently appeared on the market. These methods are based on some form of speech signal processing to extract excitation information related to small microtremors which are believed to be associated with the laryngeal muscles during voiced excitation. Physiological tremor is produced through repetitive movement of muscle contraction and relaxation. Slow tremor occurs at rates between 3-5Hz, while rapid tremor occur between 6-12Hz. Benign hereditary tremor is a fine-to-coarse slow tremor that usually effects the hands, head, and voice. Such tremor generally increases with age, and in some cases (some families), ingestion of small amounts of alcohol markedly suppresses the tremor. Other forms of tremor in voice are associated with neurological speaker changes such as the resting tremor seen in Parkinson's speech. There are many causes of tremor, which include medical illness, drugs, stress, and brain disorders such as multiple sclerosis.

The focus here is on how stress impacts the laryngeal muscles during normal production of speech, and whether speech processing algorithms/systems are able to extract and quantify such information if it exists. In our study, we focused on excitation features which include (i) normalized pitch, (ii) periodicity, and (iii) jitter. These features have long been used in the medical field for assessing changes in speech production due to pathology such as vocal fold cancer, vocal fold nodules, or other physically based change in the structure or movement of the vocal folds during phonation. Here, we evaluated these features for the purpose of voice stress classification using a Gaussian mixture model (GMM) classifier. In the evaluations, we considered a range of GMM classifier mixture weights, training iterations, static features with and without first and second order derivative features, and combinations with spectral parameters. The best GMM classifier included all three excitation features with first and second order derivatives, a feature trained variance threshold of 0.001, 64 Gaussian mixtures, and at least some form of overall vocal tract spectral structure if the data is available. It should be noted that these methods are effective only if the speaker conveys changes in his excitation, or in the laryngeal muscles associated with microtremors. A number of useful studies have considered the use of computer voice stress analyzers for the purpose of detection of deception. Some of these include:

28 D. VanDercar, J. Greaner, N. Hibler, C. Spielberger, S. Bloch, "A description and analysis of the Operation and Validity of the Psychological Stress Evaluator," Journal of Forensic Sciences, vol. 25, no. 1, pg. 174-188, Jan. 1980.

F. Horvath, "An Experimental Comparison of the Psychological Stress Evaluator and the Galvanic Skin Response in Detection of Deception," Journal of Applied Psychology, vol. 63, no. 3, pp. 338-344, 1978.

0. Lippold, "Physiological Tremor," Scientific American, vol. 224, no. 3, pp. 65-73, 1971.

In addition to these references, there are a number of commercial voice stress analysis systems (e.g., Israeli system called TRUSTER, Psychological Stress Evaluator (PSE) by Verimetrics, Computerized Voice Stress Analyzer (CVSA), and others). Our findings suggest that if the input speaker does in fact produce microtremors in their laryngeal muscles during voiced speech production, then the simple filtering operation proposed in PSE (Bell, et. al, 1976), CVSA, and others, can extract the presence or absence of such tremor. It has been suggested that if the natural tremor is absent, then the speaker is experiencing stress, and if the fluctuations are present the person is not experiencing stress. Again, extreme caution should be exercised in using these devices because it is not necessarily true that the muscle tremor associated with stress or deception will always effect those laryngeal muscles using during phonation in the same manner for all speakers. This is a well known and documented observation in the area of vocal fold cancer detection (Hansen, Gavidia-Ceballos, Kaiser, 1998), since it is possible that a physiological change in the vocal folds (a cancer growth, or muscle change/paralysis) may not always impact the normal mucosal wave, and therefore will not be represented in the sound pressure wave which excites the vocal tract. We discuss a number of these issues in the summary and conclusions section.

STRESS ASSESSMENT: Mount Carmel Law Enforcement Evaluation In the final section of this study, we consider the application of the voice excitation features used by commercial voice stress analyzers for stress assessment of the 911 audio recordings obtained from the Law Enforcement encounter with an extremist sect in Mount Carmel. These recordings were between the sect leader and a 911 operator during the FBI encounter. For analysis, we considered normalized pitch, periodicity, jitter, and a software implemented version of the commercial CVSA system. The resulting feature profiles were compared with feature profiles

29 obtained from an evaluation of speech data from the SUSAS speech unders stress database. The profiles for normalized pitch and periodicity did not appear to be reliable indicators of speaker stress. The CVSA profile did show some of the structure which were expected for Mt. Carmel data, but was not as consistent for SUSAS actual stressed speech (this could be explained because that speech portion of SUSAS was collected during amusement park roller coaster rides which could have introduced physical vibrations during speech production). An extensive evaluation of the entire series of sentences for stress assessment using pitch, spectral MFCC features, and the TEO-CB-Auto-Env measure showed that pitch and the new TEO-CB-Auto-Env measure produced more consistent assessment scores. A number of issues regarding successful stress classification and assessment using either traditional excitation features or nonlinear speech features must be addressed to achieve successful voice stress analysis performance. Ultimately, the success of the measure rests on how the speaker imparts, either consciously or subconsciously, stress in their speech production process (either through controlled airflow from the lungs, muscle control of the vocal system articulators, choice of vocabulary). It is suggested that more success could be achieved if the subjective impression of the operator could be reduced for commercial VSA devices. Further training data for model adaptation, and establishing well recognized anchor neutral/stress models for a given speaker in context, should ultimately produce more reliable performance. At best, the available commercial systems should be used with caution if they are to be applied.

1.2 Outline of Report The outline of this report is as follows. In Section 2, we provide a review of the literature in speech under stress. Next, in Section 3 we present a brief overview of results from analysis of speech production under stress which include pitch, speech duration, speech intensity, glottal spectral structure, and formant structure. Section 4 discusses methods considered for stress classification. This section focuses on previous approaches, Bayesian stress classification, linear feature classification, nonlinear based features. Section 5 considers the use of stressed classification features for stress assessment using actual stressed speech from a pilot emergency (MAYDAY2 portion of the SUSC-0 corpus from NATO). This worked focused on normalized

30 pitch, spectral structure using MFCCs, and the nonlinear TEO-CB-Auto-Env measure. Finally, Section 7 presents a series of probe evaluations of speech data from Mount Carmel. This represents speech data from a high stress law enforcement encounter. Since the available speech was limited, the analysis was restricted to comparisons of feature profiles for linear features, and assessment evaluations using anchor models trained with SUSAS stressed speech data. In the Appendix (Section 9), we identify the software, which has been implemented and will be delivered to the sponsor as part of this project.

31 2 Speech Under Stress: Review of the Literature Stress is a psychological state that is a response to a perceived threat or task demand and is accompanied by specific emotions (e.g., fear, anxiety, anger). Initial investigations of verbal indicators of stress have focused on identifying speech markers of stress (e.g., stuttering, repetition, tongue-slip). Psychiatrists agree that verbal markers of stress range from highly visible to invisible markers as perceived by the listener (Goldberger and Breznitz, 1982), and that these markers are continuously monitored both consciously and subconsciously by the speaker and thus are prone to correction.

2.1 Acoustic Correlates of Stress and Emotion in Speech A number of studies have considered analysis of speech under simulated and actual stress conditions (see Table 1), though changes in speech characteristics remain unclear. Thus far, most research has been limited in scope, often using only one or two subjects and analyzing a single parameter (often f0). It is not unusual for researchers to report conflicting results, due to differences in experimental design, level of actual or simulated stress, or interpretation of results. For example, some studies concentrate on analysis of recordings from actual stressful situations (Kuroda, et al. 1976; Simonov and Frolov, 1977; Streeter, et al., 1983; Williams and Stevens, 1972). There is usually little doubt as to the presence of stress in these recordings, however a quantitative analysis can only be carried out if recordings of the talker speaking the same utterances under stress-free conditions is available. In addition, some researchers argue that speakers in these situations may experience several emotions simultaneously, (e.g., the Hindenburg announcer most likely experienced combinations of fear, grief, and anxiety). Another group of studies have been performed using simulated stress or emotions (Hecker, et al., 1968; Hollien and Hicks, 1981a, 1981b; Williams and Stevens, 1972). This offers the advantage of a controlled environment, where a single emotion can be examined with little background noise. In some cases, variable task levels of stress have been used. Other advantages include larger data sets with multiple speakers. The major disadvantage in these studies have been the reduction in task stress levels. In addition, studies using actors may produce exaggerated caricatures of emotions in speech.

32 l In previous work, Williams and Stevens, (1972), and Hecker, et al.(1968) found that f0 to be the acoustic property most sensitive to the presence of stress. There are several reasons why

changes in /0 with time provide information on emotional state. For example, respiration is frequently a sensitive indicator in certain emotional situations. When an individual experiences a stressful situation, his respiration rate increases. This presumably will increase subglottal

pressure during speech, which is known to increase f0 during voiced sections (Pickett, 1980). An increased respiration rate also leads to shorter durations of speech between breaths, which would affect the temporal pattern (articulation rate). The dryness of the mouth found during situations of excitement, fear, anger, etc. can also effect speech production (e.g., muscle activity of larynx and condition of vocal cords). Muscle activity of the larynx and vibrating vocal cords

directly affect the volume velocity through the glottis, which in turn affects f0. Other muscles, (for example those controlling tongue, lips, jaw, etc.) shape the resonant cavities for sound and

therefore do not have a direct influence on f0.

2.2 Analysis Using Simulated Stress or Emotion Here, analysis of studies using simulated stress or emotion are considered first (see Table 1). Here, we place emphasis on the study by Williams and Stevens (1972) since they considered analysis of recordings from both simulated2 and actual3 emotional environments. Hicks and Hollien (1981a,b) simulated stress by using mild electrical shock. Hecker et al., (1968) simulated stress by having subjects perform a timed arithmetic task.

Fundamental frequency f0 contours and /„ variability were analyzed for anger, sorrow, and

fear by Williams and Stevens (1972). For fear, the /0 contour departed greatly from neutral, while for anger the contour was generally higher throughout with one or two syllables

characterized by large peaks. Hicks and Hollien (1981a,b) found similar increases in/0.

1 Most early studies on speech under stress consider fundamental frequency, but use the term pitch. Since pitch is a perceptual quantity, our research here will focus on fundamental frequency. 2 Simulated recordings consisted of data from actors simulating fear, anger, sorrow, and neutral emotions. 3 Actual recordings consisted of data from the radio announcer during the Hindenburg disaster.

33 However, Hecker et al. (1968) observed conflicting results (some subjects increased, while others decreased /„).

Mean articulation rate in syllables/second was determined for the three emotions considered by Williams and Stevens (1972). Results from fastest to slowest were neutral, anger, fear, and sorrow. Hicks and Hollien (1981a,b) also observed similar results.

Speech intensity or vocal effort per unit time, during voiced sections was considered by Hicks and Hollien (1981a,b) and Hecker et al. (1968), although inconsistent results occurred across test phrases. Pisoni et al. (1985), Summers et al. (1988) investigated acoustic-phonetic correlates of speech produced in noise (also called the Lombard effect (Lombard, 1911)). With subjects speaking in quiet and 90 dB SPL white masking noise, results showed an increase in overall amplitude of vocalic sections, increased duration, increased average f0, and reduced spectral tilt. Junqua (1993, 96) also performed analysis on Lombard effect speech and concluded that female speakers seem to be more intelligible than male speakers. Rostolland (1982a,b) performed acoustic and phonetic studies for shouted speech and observed reduced intelligibility with a raised f0 contour.

In other investigations, Lieberman and Michaels (1962) had subjects simulate eight emotional states. Their approach was to select a parameter as an emotion relayer, extract that parameter, and observe whether the resulting sound could correctly be identified as the simulated emotion by listener groups. While only characteristics of pitch and amplitude were considered, results showed that fear was highly identified using only amplitude information with constant pitch.

2.3 Analysis Using Actual Stress or Emotion Situations A comparison of results from actual stressful recordings is somewhat difficult, due to varying parameters measured and levels of stress experienced. However, considering such studies are important, since the analysis may help verify experimental procedures and results from simulated studies.

34 Kuroda et al. (1976) analyzed tape recordings of pilots with varying mission experience in actual aircraft accidents. Their analysis consisted of finding a parameter related to pitch, termed the vibration space shift rate (VSSR) from speech spectrograms. Their ultimate conclusions showed as stress increased so did /„. A more recent study by Ruiz, et al. (1996) considered time and frequency-domain analysis of emergency aircraft cockpit recordings.

Simonov and Frolov (1977) analyzed communications of cosmonauts at various flight stages. Analysis consisted of monitoring heart rate and the spectral centroid of the first vocal tract formant. Though general trends were noted, their summary emphasized the need of further research.

Streeter et al. (1983) carried out a more complete analysis of a telephone conversation between a system operator (SO) and his superior chief (CSO) prior to the 1977 New York City blackout. Analysis consisted of pitch, amplitude, and timing measurements. An attractive feature of the data was the increased siruational stress throughout the hour-long conversation. Results were somewhat conflicting since it appeared SO was passing decision making authority to CSO during the emergency. Results showed that listeners referred to a vocal stress stereotype, which includes: elevated pitch and amplitude, and increased variance in these vocal cues.

Finally, Williams and Stevens (1972) performed analysis of the recorded radio announcer during the Hindenburg disaster. In an effort to justify results from their simulated emotions, they had actors recreate the announcer's message. Results were not entirely consistent, though increased average f0 along with tremor were observed for both, with larger variations for the actor. This would indicate that the actor's emotions to a certain extent were overemphasized. This, as well as other previous studies on speech under stress are summarized in Table 1.

2.4 Voice Stress Analysis for Law Enforcement Most of the studies discussed in Section 2.2 and 2.3, and summarized in Table 1, deal with speech produced in either emotional or task induced stressful environments. There is another area of analysis of speech under stress, which deals with voice tremor and how it applies to detection of deception for law enforcement applications. The study by Cestaro (1995) was an

35 extensive evaluation of the commercial Computer Voice Stress Analyzer (CVSA). In that study, two experiments were designed to validate the underlying theory of CVSA, and second to examine the accuracy of CVSA with traditional polygraph instrument for the problem of stress detection due to deception. He simulated the CVSA signals electronically using signal generators in order to have careful control over what commercial have been made for CVSA. His findings show that CVSA was less successful and accurate than a polygraph (38% versus 62%). His results suggest that there may be a systematic and predictable relationship between voice patterns and the stress related to deception.

Another study by VanDercar, et. al, (1980) detailed an evaluation of the psychological stress evaluator (PSE) as a commercial system for representing different levels of speaker stress. They measured PSE profiles in addition to heart rate and State Trait Anxiety Inventory (STAI) scores during relaxed and high stress (through the threat of electric shock). When the potential for stress was high, PSE, STAI, and heart rate measures all reflected different levels of stress and were significantly correlated with each other. A second study with reduced stress levels failed to show the reliability of PSE. It was suggested that the lower levels of stress were a factor in the difference in performance for the second experiment. In later sections, we consider a computer implementation of the stress classification feature within the CVSA instrument. Evaluations are performed for SUSAS and Mt. Carmel stressed speech recordings (this data will be discussed later).

36 Table 1: A summary of studies on speech under simulated and actual stress conditions. Summary of Speech Under Stress Studies Simulated Stress Speech Analysis Areas Stress/Emotion Lieberman & Michaels (1962) Pitch & Amplitude Simulated Stress: (Listener Assessment) Simulated Emotion Hecker, Stevens, et al (1968) Mean Pitch Simulated Stress: Speech Level Timed Arithmetic Task Spectogram Comparison Hicks &Hollien (1981) Mean Pitch Simulated Stress: Mean Intensity Mild Electric Shock Speech Rate Rostolland(1982) • Acoustic Analysis Simulated Stress: Shouted Speech Pisoni, et al. (1985) Pitch Simulated Stress: Duration Lombard Effect General Spectral Stanton, et al (1988) Pitch Simulated Stress: Duration Loud and Lombard Effect Frequency Characteristics Junqua(1993) Pitch Simulated Stress: Duration Lombard Effect Frequency Analysis Actual Stress Speech Analysis Areas Stress/Emotion Kuroda,etal. (1976) Pitch (VSSR) Actual Stress: 14 Pilot Emergency Cockpit Recordings (8 Fatal) Simonov & Frolov (1977) First Formant Analysis Actual Stress: Heart Rate Cosmonaut Flight Recording Analysis Streeter, etal. (1983) Pitch Actual Stress: Speech Level Telephone Call Analysis of Con.Ed. Timing Measurements New York City Backout (1977) Simulated & Actual Speech Analysis Areas Stress/Emotion Williams & Stevens (1972) Pitch Contours Simulated Stress: Pitch Variability Method Actors: Spectrogram Comparison Fear, Anger, Sorrow Avg. Spectral Content Simulated Hindenburg Mean Articulation Rate Announcer Actual Stress: Hindenburg Announcer Hansen (1988) Pitch Simulated Stress: Glottal Source Fast, Slow, Loud, Soft, Duration Clear, Angry, Question, Intensity Lombard Effect, Computer Vocal Tract Spectrum Response Task Stress (+200 Speech Features) Actual Stress: Roller-Coaster Ride Speech, Psychiatric Emotional Analysis

37 3 Acoustic/Phonetic Analysis of Speech under Stress In order to perform an in depth study, a comprehensive speech under stress database, entitled SUSAS (Speech Under Simulated and Actual Stress) was formulated (Hansen, 1988; Hansen and Bou-Ghazale, 1997). The database is partitioned into five domains, encompassing a wide variety of stresses and emotions. A total of 32 speakers (13 female, 19 male, with ages ranging from 22 to 76 were employed to generate in excess of 16,000 utterances. Table 2 illustrates the various domains present in the database. The vocabulary consists of 35 aircraft communication words containing a number of subsets that are difficult for recognition systems.

Table 2: The SUSAS Speech under Simulated and Actual Stress Database. Susas Database Speech Under Simulated and Actual Stress Vocabulary Domain Type of Stress or Emotion Speakers County

Simulated Stress 9 Speakers 8820 35 Aircraft Slow Soft (All Male) Communication Fast Loud Words Angry Clear Question 35 Aircraft Single Calibrated Workload 9 Speakers 1890 Communication Tracking Tracking Task: (All Male) Words Task Moderate & High Stress Lombard Effect 35 Aircraft Dual Acquisition & Compensatory 8 Speakers 4320 Communication Tracking Tracking Task: (4 Male) Words Task Moderate & High Stress (4 Female) 35 Aircraft Actual Amusement Park Roller-Coaster 9 Speakers 500 Communication Speech Helicopeter Cockpit Recordings (4 male, 3 Female) Words Under (G-Force, Lombard Effect, Noise, Fear, (2 Male) Stress Anxiety) Psychiatric Patient Interfiews: 8 Speakers 600 Conversational Analysis (Depression, Fear, (6 Female) Speech:Phrases Anxiety, Angry) (2 Male) & Sentences

3.1 Analysis Overview of Speech Under Stress In this section, we summarize the analysis conducted on speech from (i) Simulated stress or emotion, and Simulated workload task or Lombard effect), followed by (ii) probe studies using Actual speech under stress. In this context, simulated conditions refer to areas where talkers were asked to either speak in a prescribed emotional manner, or perform some computer response task while uttering speech. For these domains, control of experimental conditions and

38 environmental factors were possible (e.g., vocabulary, task difficulty). Actual conditions refer to areas where talkers are in environments similar to real stressful situations. These domains differ from simulated conditions in their lack of control in experimental and environmental factors (e.g., varying noise levels, task difficulty, vocabulary choice).

To our knowledge, there is no singly reliable acoustic indicator of psychological stress. There has been a lack of consistent results in past research efforts. After considering experimental design and analysis, it is apparent that past approaches to stress analysis suffer from one to five of problems summarized as follows; (i) analysis based on too few speakers, (ii) analysis based on too few utterances, (iii) analysis based on a limited set of parameters with no consideration within speech sound classes, (iv) no statistical analysis to determine if changes are statistically significant, (v) no confirmation of simulated results with those from actual recordings. Here, we address these problems in the context of fundamental frequency, duration, and intensity, glottal source, and spectral factors. The analysis included extensive parametric and non-parametric statistical tests (see Hansen, 1998a, 1998b).

3.2 Analysis of Fundamental Frequency The first area considered for stress evaluation involves characteristics of fundamental frequency f0, including contours, mean, variability, and distribution.

A subjective evaluation of more than 400 f0 contours was conducted across all stress conditions (sample contours4 are shown in Fig. 1). The overall shape of the contours for fast and slow speech did not change appreciably. Angry and loud contours had much higher variability than neutral, with angry the highest mean and variability of all stress conditions considered. f0 contours of soft speech were almost always smoother than neutral. Speech under Lombard effect had a slight elevated mean, but the contours appeared similar in shape. Contours for moderate and high workload task conditions were similar to neutral.

4 Here, the f0 contours for the word "histogram" are shown continuous, because the timing of the unvoiced obstruent /s/ varies during production across the stress styles, so endpoints of the contours are joined in this example, and all statistical analysis was performed directly on frame based fundamental frequency values.

39 1 F--.NO/vuCrirAl- P «.c^-iNC» CI« "»'»*"! lof»» * UMlAufHf it l*hKHu*Hiir ni^TRafe/rmN

apo »■«■»«ir»»» [Hi)

FuuaAMiHTH F nc^Mtni:* Dl»TRI*i>riar« FI**?.»MECTVU. F«4^W:T Din-RnOriQta

: 1 1XJUBARD

CONDOSD if->

las »n •orwMHCV (HJJ

Figure 1: Distribution of fundamental frequency based on fundamental frequency estimates from neutral, angry, fast, slow, loud, soft, clear, Lombard effect, moderate (50%), and high (70%( workload condition speaking styles.

Next, we consider differences in mean, variance, and distribution of pitch (/0). Since a number of statistical tests performed assume sample variables to be Gaussian distributed, a comparison of /0 distribution contours was performed (see Fig.l). We are primarily interested in seeing whether the distribution shape differs substantially from Gaussian. f0 distributions for neutral, clear, slow and fast have similar shapes with a bimodal concentration. Negative kurtosis values (from Table ) confirm distributions which are more flat as compared to Gaussian. Lombard and loud /0 distributions had similar shape, along with values of skewness and kurtosis, though the range of loud f0 was wider. The soft /0 distribution was highly concentrated with a very small variance, which was confirmed by a large positive kurtosis value suggesting a peaked distribution. Of the f0 distributions considered, loud and angry styles were judged not to be Gaussian. Angry resulted in a very irregular shaped distribution, with large concentrations towards higher frequencies.

Finally, below we summarize some of the key findings for analysis of pitch mean and variance, based on speech data from the SUSAS database.

40 Summary of Key Results: Mean Fundamental Frequency vs. Stress The position of mean f0 from highest to lowest versus speaking style is shown below.

Fundamental Freq. Condition Shift in mean f0 Highest Angry +73% stat. Significant Loud +48% stat. Significant Lombard +10% stat. Significant Clear +4% stat. Significant Fast +6% stat. Significant Neutral Slow -2% Task Condition 70% -2% Task Condition 50% -3% Lowest Soft -5% stat. significant Mean/„values may be used as significant indicators for speech in soft, fast, clear, Lombard, angry, or loud styles when compared to neutral conditions.

Loud, angry, and Lombard mean f0 are all significantly different from all other styles considered.

Mean fQ was not a significant indicator for moderate versus high task workload conditions.

Speech under Lombard effect gave mean f0 values most closely associated with f0 from fast and clear conditions.

Changes in mean f0, based on Student's t-tests, appears to be a consistent and reliable stress indicator over a wide variety of conditions. Summary of Kev Results: Variance of Fundamental Frequency vs. Stress

The position of /0 variability from highest to lowest versus speaking style is shown below. Pitch Condition Shift in Standard Deviation Highest Angry +506% stat. significant Loud +213% stat. significant Lombard +55% stat. significant Clear +59% stat. significant Slow +19% stat. significant Fast +2% Task Condition 50% +3% Task Condition 70% +2% Neutral Lowest Soft 28% stat. significant Variance of /„ values may be used as significant stress indicators for speech in soft, loud, angry, clear, or Lombard styles when compared to neutral conditions.

Soft and loud fQ variance are significantly different from all styles considered. Pitch variance was not significantly different for moderate versus high task workload conditions. Pitch variance was unreliable for slow and fast stress conditions. Pitch variance for clear and Lombard conditions are similar, but different from all other styles considered.

41 3.3 Analysis of Duration In order to address duration in speech under stress adequately, analysis was partitioned into stress relayers across four areas. The first two focused on overall word and individual speech class (vowel, consonant, semivowel, and diphthong) durations. Third, analysis within speech classes provided detailed indicators of duration shifts between classes. Fourth, because overall word duration may supersede the requirements of lengthening consonantal periods (or semivowel, diphthong periods), several duration ratio measures are proposed.

Several comments concerning durational effects caused by prosodic features may help explain the durational variation caused by stress. There have been a multitude of studies investigating durational variations which arise from prosodic conditions (Fry, 1955; Creelman, 1962; House, 1962; Perkell and Klatt, 1986). A number of more recent studies have considered timing and height of pitch contours for female speakers (van Santen and Hirschberg, 1994), phone/syllable duration and timing representations for text-to-speech synthesis (van Santen, 1995), pitch and duration in signaling emotion (Ofuka, Campbell, et al., 1994), and segment duration in hidden Markov model speech recognition (Levinson, 1986; Wang, et al. 1996). Basic data and initial prosodic rules, which govern consonant and vowel duration, can be found in studies by Klatt (1973,76), Fry (1955), House (1962), and Umeda (1975,77). Duration patterns have also been studied in an attempt to arrive at principles of motor organization by Lindblom (1963), Lindblom, Lyberg, and Holmgren (1977), and Köhler (1986). Barnwell (1971) was the first to identify a limit to the temporal compressibility of vowels when they are followed by an unvoiced consonant and/or by an additional syllable. In a later study, Klatt (1973) expressed this incompressibility in a formula, and applied it as a rule to adjust consonant durations for various shortening effects (Klatt, 1976).

Below, we summarize results from the statistical evaluation of mean and variance of word and phoneme class duration.

42 Summary of Key Results: Mean Duration vs. Stress Mean duration from highest to lowest versus talking style for all speech classes (* significant with respect to neutral) Shift in Mean Duration Word Vowel Consonant Semivowel Diphthong SL +73% SL* +84% C* +84% SL* +112% SL* +94% C* +39% A* +69% SL* +52% LM* +63% A* +64% A* +38% L* +58% SO +24% A +42% L* +53% L* +36% C +26% C7 +22% C +39% LM +30% LM* +20% LM +24% C5 +12% L +27% SO +9% SO +7% N LM +4% C5 +20% C +4% +3% SO +19% N C7 +5% SO -8% L +14% C -7% C5 +1% c5 -8% N C7 7 -8% N c7 -8% A -12% N C5 F* -26% F* -28% F* -27% F -27% F -27% Mean word duration values may be used as significant indicators for speech in slow, clear, angry, loud, Lombard, or fast styles when compared to neutral conditions Slow and fast mean word duration are all significantly different from all other styles considered Clear mean consonant duration was significantly different from all styles except slow Word and phoneme class duration are not significant indicators for moderate vs. high task workload conditions Summary of Key Results: Duration Variance vs. Stress Duration variance from highest to lowest versus speaking (* = significant with respect to neutral) Shift in Duration Variance Word Vowel Consonant Semivowel Diphthong SL* +173% A* +191% C* +456% A* +1045% SL* +324% A* +128% SL* +166% SL* +294% SL* +942% A +112% C* +122% L* +141% L* +106% LM* +531% L +70% L +56% C* +115% A* +83% C* +370% LM +6% LM +32% LM +65% C7* +78% L* +326% C7 +3% N N SO* +63% C5 +150% C +1% SO -11% C5 -4% LM +44% C7 +106% N C5 -11 C7 -4% C5 +39% SO +91% C5 -27% C7 -22% SO -22% N F +45% SO -66% F -33% p* -54% F* -39% N F -70%

43 • Duration variance increased for slow speech in all domains (word, vowel, consonant, semivowel, diphthong • Duration variance decreased for most domains under fast stress condition • Duration variance significantly increased for angry speech • Duration variance generally increased for loud speech, but was mixed for soft speech • Clear consonant duration variance was significantly different from all styles • Duration variance is not a significant indicator for moderate versus high task workload conditions Since overall word duration may supersede requirements for lengthening of consonants (or other speech classes), several duration ratio measures were proposed. The following three5 ratios were considered; i) a consonant versus vowel duration ratio (CVDR), ii) a consonant versus semivowel duration ratio (CSVDR), and iii) a vowel versus semivowel duration ratio (VSVDR),

CVDR = dconsonanl(neutral) (1)

dvowei{neutral)

CSVDR Jco—Meutral) (2) dsemivoweiKneutral)

d „ „i (neutral) ,- VSVDR = —22=^7v w '-r (3)x dsemivowi\neutral)

where duration values d class(stress) are for a particular phoneme class and stress condition. It is suggested that such ratios can be used to determine directions in which speakers vary their duration patterns under stress. By using neutral ratios as baseline values for comparison, one can determine how phone class duration varies for individual stress styles.

5 Duration ratios with respect to dipthongs were not considered, due to the limited number of examples available in the Talking Styles stress domain.

44 CVDR and CSVDR suggest that there is a shift in percentage time spent in vowels and semivowels towards consonants for soft, clear, and to a lesser degree the two task conditions. CVDR and VSVDR also revealed increased vowel duration at the expense of consonant and semivowel portions for angry and loud speech. It is difficult to get a clear picture of the global changes in duration from simply comparing the duration ratios. Therefore, a pictorial representation of global duration shifts is presented in Fig.2. A bar graph, proportional to average word length from the SUSAS database is shown for each stress condition. The percentage of vowel, semivowel, and consonant duration with respect to an overall average word duration is also shown within each shaded section. The percentage is simply the ratio of average phoneme class duration to that of an average word duration assuming one vowel, consonant, and semivowel. All calculations are based on tabulated values. As an example, the 24% consonant duration for neutral was obtained by assuming an ideal stressed word with one phoneme from each class as follows,

word = vowel+ semivowel + consonant (4)

dwrJ^, = d vowel {neutral)* dsemivowe, {neutral)+ dconsonant {neutral) (5)

295 = 165 + 59+71 (in msec). (6)

The consonant percentage is simply the ratio of the average consonant to ideal word duration.

Jco^anXneutral) rercer" consonant -7 / ,\ V' neutral dword\ )

24%= — XIOO (8) 285

The arrows in Fig.2 indicate significant shifts in duration based on CVDR, CSVDR, and VSVDR. As an example, angry speech results in significant increases in vowel duration at the expense of semivowel and consonant duration. It is apparent from the results presented here the presence of stress influences overall and individual duration characteristics.

45 3.4 Analysis of Intensity The control of vocal intensity is based on adjustments of laryngeal and subglottal variables. In addition, past research on the effects of intensity for speech intelligibility has also served to improve the knowledge of how speakers vary intensity in typical speech production. An analysis of consonant strength and precision was performed by House, et al.(1965). In this investigation, consonant-vowel amplitude ratios (CVAR) were measured for two speakers differing in intelligibility as measured by the Modified Rhyme Test. It was found that the more intelligible speaker had CVAR's 2-4 dB higher than the less intelligible speaker. Hecker (1974) attempted to increase speaker intelligibility by increasing the CVAR. This was accomplished by splicing out the consonant, increasing its amplitude, and re-splicing it into the word. After processing, intelligibility based on the Modified Rhyme Test showed an increase from 78% to 81% at 4dB of signal-to-noise ratio, a small but significant increase. A number of studies have also considered changes in vocal effort (Perkell and Klatt, 1986) and presence of the Lombard effect (Hanley and Harvey, 1965; Pearsons, et al., 1977; Junqua, 1993,96).

Below, we summarize the primary findings from analysis of mean and variance in word and phoneme class intensity for various speech styles under stress.

Summary of Key Results: Mean Intensity vs. Stress Mean RMS intensity from highest to lowest versus speaking style for all speech classes (* = significant with respect to neutral)

Shift of Mean RMS Intensity Word Vowel Consonant Semivowel Diphthong A* +48% A* +32% SO +33% A +16% L* +46% L* +38% L* +25% C7 +23% SO 0% A* +45% LM +8% C +2% C5 +14% F 0% LM +8% SL +4% LM +1% A +12% N C +3% F +2% SL +1% SL +3% L -6% F +3% N N F +2% SL -7% N SO -5% F -2% LM +1% C5 -15% SL -1% c -8% SO -3% N C7 -17% C5 -3% C5 -8% C7 -6% C -8% LM -17% C7 -4% C7* -10% C5 -8% L -17% C -18% SO -7%

46 GLOBAL SHIFTS IN DURATION

VOWEL SEMIVOWEL CONSONANT 1 ►!—— ► dsv%d dv%dw w dc%dw

NEUTRAL 478 MSEC

23 20 1 SLOW 827 MSEC

FAST 23 25 353 MSEC

SOFT 22 28 509 MSEC

LOUD 19 \ 17 I 650 MSEC

4 ANGRY 20 I 1 I 662 MSEC

CLEAR 666 MSEC

COND-50 482 MSEC

COND-70 22 28 501 MSEC

LOMBARD 26 19 572 MSEC

Figure 2: A pictorial representation of global duration shifts for speech under stress. The length of each bar graph is proportional to each style's average duration. Speech class percentages shown for each style are based on an ideal word containing one phoneme from each class. Arrows indicate significant shifts in duration based on phoneme ratios as a result of stress.

47 Mean RMS word intensity values may be used as significant indicators for speech in angry, loud, and high workload task styles when compared to neutral conditions Loud and angry mean RMS word intensity are significantly different from all other styles considered Loud and angry mean RMS vowel and diphthong intensities are significantly different from all styles considered Mean RMS consonant and semivowel intensity are not significant stress indicators for any style considered Mean RMS intensity is not a significant indicator for moderate versus high task workload conditions Summary of Kev Results: Intensity Variance vs. Stress RMS intensity variance from highest to lowest versus stress style (* = significant with respect to neutral) Shift in the Variance of R. MS Intensi tv Word Vowel Consonant Semivowel Diphthong A* +312% A* +107% SO* +64% A* +346% N L* +154% L +25% A +54% L +80% C7 -20% LM +37% C +17% C7 +52% C7 +46% F -25% SO +30% C7 +8% SL +47% C5 +27% C -39% N F 0% L +19% LM +19% C5 -40% F -18% N C +14% F +6% SO -43% C5 -25% C5 -1% C5 +11% N A -50% SL -31% LM -14% LM +10% SL -17% L -55% C7 -33% SL -18% N SO -20% SL -62% C -47% SO -32% F -8% C -41% LM -78% • Variance of RMS word intensity may be used as a significant indicator for speech in angry and loud styles when compared to neutral • Variance of loud and angry RMS word intensity is significantly different from most other styles considered • Variance of angry RMS vowel and semivowel intensities were significantly different from most styles considered • Variance of RMS consonant and diphthong intensity were not significant stress indicators for most styles • Variance of RMS intensity (for word or phoneme class) was not a significant indicator for moderate versus high workload task conditions Next, it may be beneficial to reflect on the intensity variation between individual phoneme classes. Consider the case when a talker is speaking under fast or Lombard effect in noisy environmental conditions. A talker could maintain overall word intensity, yet vary a particular phoneme class with respect to another. Hence, several average RMS ratio measures were

48 formed. Three ratios were considered: i) consonant versus vowel amplitude ratio (CVAR), ii) consonant versus semivowel amplitude ratio (CSVAR), and iii) vowel versus semivowel amplitude ratio (VSVAR). These ratios are used to determine in which directions speakers vary their intensity patterns under stress.

CVAR's for fast, slow, and Lombard effect conditions were relatively constant. Increased CVARs resulted for soft and both task conditions, which suggest talkers emphasize consonant amplitude with respect to vowel amplitude under these stress conditions. Decreases in CVAR's for loud, angry, and clear styles signify further importance in vowel rather than consonant amplitudes. CSVARs also demonstrate a talker's emphasis of consonant versus semivowel amplitudes for soft, Lombard, and both task conditions. Decreased CSVAR was noted for only loud and angry styles. Finally, VSVAR generally result in further vowel emphasis. Only the soft speaking style results in decreased VSVAR, with loud having the highest. In order to get an overall perspective of changes in intensity across the various styles, a pictorial representation is shown in Fig.3. The clear bar graphs are proportional to RMS word intensity for each style. Shaded regions within each bar graph indicate average RMS intensity values for vowel, semivowel, and consonant phoneme classes6. Triangles below each bar graph indicate statistically significant shifts in phoneme class intensity. A single arrow indicates a strong shift in average RMS intensity from one class to the other. A double arrow indicates an extreme shift in RMS intensity (from weaker to stronger). Phoneme class shifts indicate opposite movement for soft and loud speech classes. For loud speech, vowel amplitudes are strongly emphasized, while in soft speech consonant amplitudes are emphasized. Fast speech had little or no movement between phoneme classes.

6 Phone class heights have all been scaled by a factor of 2/3 so word RMS intensities are visible in the presentation.

49 GLOBAL SHIFTS IN INTENSITY

40.5dB WORD

V v V .. V C \ f > > ti*. t V '> SV $v sv sv i>' sv* sv sv ' SV SV J' O O n 3 3 2- "5 o O) a> 1 • Q iE to O c TJ T3 A 3 0) O C C 0) < O o E ü o 3

Figure 3: A pictorial representation of global intensity and shifts in intensity for speech under stress. The height of each bar graph is proportional to each style's average RMS word intensity. Speech class RMS values are also shown. Arrows indicate significant shifts in intensity as a result of stress.

50 3.5 Glottal Source Spectral Analysis In this section, we focus on glottal source effects. There are a variety of characteristics relating to speech excitation which are adjusted to convey the stress or emotional state of a speaker. We have seen that the fundamental frequency of vocal fold movement is a statistically reliable indicator of many stressful speaking conditions. In addition to rate, aspects such as duration of each laryngeal pulse (both open and closed glottal periods), the instant of glottal closure, or the shape of each pulse play important roles in a talker's ability to vary source characteristics.

An analysis of glottal source spectral characteristics was performed for SUSAS speech data. Utterances rich in vowel content but lacking adjacent nasal portions were chosen. An algorithm was developed for analysis of the distribution of frame energy for each stress condition. The division between voiced (high frame energy) and unvoiced (low frame energy) speech is quite apparent in all cases. These frame energy distributions can also serve as possible stress indicators. For example, high energy frame concentration increased for angry, loud, and Lombard conditions. However, low energy frame concentration increased for clear, slow and soft conditions. A shift was also observed for frames with moderate energy (40 to 60 dB) toward primarily higher regions. This was observed for loud and angry styles, thus indicating that under these conditions, the time duration spent during transitional periods between voiced and unvoiced portions is reduced. A voiced energy cutoff was selected from corresponding frame energy distributions close to the upper peak in the frame distributions (i.e., normally between 65 to 70 dB). Frames above the threshold are extracted and a gain normalized periodogram spectral estimate found for each frame. Periodograms from all selected frames are averaged to remove the effects of the varying vocal-tract response. This leaves an estimate of the glottal source spectrum. Each selected frame's energy is also averaged to obtain a final gain factor for the glottal spectrum. This was performed for each of the ten stress conditions.

51 GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM so NEUTRAL 60 msr 40 so üilfellk. o iiiillillia'steA.,

■a>t -40 4000 lou KM low 100 400 1000 400O 100 400 1000 LOG FMOUENCV (HZ) Loo FnEOuecv (Hz) Loo FREQUENCY (HZ)

GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM K

6C

|«T S x

•2C

•4C 400 1000 4000 100 400 1000 4000 100 LOO FWOUiNCV (Hz) Loo FMOUENCY(HZ)

GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM so ComSO 60 | 40

jjj 20 I 0 Wfflm,

-20

-40 4000 100 400 1000 4000 1O0 400 1000 Loo FttoutNcv (Hz) Loo FmouiHCY (Hz)

GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM GLOTTAL SOURCE SPECTRUM CLEAR LOMBARD [ >., 40 ! ■ i S 20 i t' i' i i ! lillllife IlKtanw 1 0 | 1-2»

•40 4000 100 400 1000 4000 400 1O00 4C0 1000 Loo FdEQUEftC» (Hz| Los FREQUENCY (Hz) Loo FREQUENCY (Hz)

Figure 4: Glottal source spectral estimates based on non-nasalized utterances for neutral, loud, soft, fast, slow, moderate (Cond50\%) and high (Cond70\%) computer workload conditions, angry, clear, and Lombard effect stress talking styles. (Note: A +6dB/octave spectral roll-off should be added to remove the effects of the lip radiation component.)

52 3.6 Analysis of Vocal Tract Artlculatory Characteristics It is reasonable to hypothesize that stress factors will also affect the position and rate of change of the articulators, which shape the vocal-tract. It has been suggested that these changes may represent a major contributor to the reduced performance of present-day recognition algorithms in stressful environments (Hansen, 1988, 1996). Therefore, we consider an initial analysis of vocal tract structure based on sample articulatory profiles. Previous articulatory studies have considered methods to estimate vocal tract configuration based on the acoustic signal (Kobayashi, Yagyu, Shirai, 1991; Wakita, 1973). The analysis here is based upon a linear acoustic tube model with speech sampled at 8 kHz. In order visualize the effects of stress on physical vocal tract shape, the movements throughout the vocal tract can be displayed by superimposing a time sequence of estimated vocal tract shapes for a chosen phoneme. The vocal tract shape analysis algorithm assumes a known normalized area function and acoustic tube length. The algorithm begins by computing the sagittal distance function by assuming a cylindrical vocal tract. Next, a set of rigid points from the glottis to the upper teeth (and rigid upper lip) models the hard palate. With the hard palate model in place, the soft palate and pharynx are approximated by forming a dependence upon the sagittal distance function. Finally, the lower lips are modeled using one of four rigid models dependent upon the acoustic tube length.

An analysis of articulatory changes in vocal tract shape under neutral and various stress conditions was performed for extracted phoneme sequences from SUSAS. Fig. 5 illustrates a set of vocal tract shapes which are superimposed for each frame in the analysis window (the number of extracted frames are summarized for each stress condition). For Neutral, there is some movement of the articulators in the pharynx and oral cavities (as there should be for the production of the /r-iy/ phone sequence in "freeze"). There is also limited movement for the Soft speaking condition. However, for Angry, Loud, and Lombard conditions, there is significant perturbation in the blade and dorsum of the tongue and the lips. This extreme vocal tract variation is also present in the same phone sequence from the Actual stress domain (speech from roller-coaster rides). This suggests that when a speaker is under stress, typical vocal tract movement is effected, suggesting a quantifiable perturbation in articulator position. This

53 characteristic was used as one of several features for a study on stressed speech classification by Womack and Hansen (1996).

NEUTRAL It - iy/ Louo /r - iy/ SOFT /r-iy/ . CLEAR /f - iy/ FRAMES 46 FRAMES 40 FRAMES 26 FRAMES ~.' 40 ,_f r" r .I'.V.V ,j -a --H-:Wjf Jft hi

LOMIAHO ANGRY /r • iy/ FAST /r-iy/ SLOW A • iy/ /r • iy/ 39 FRAMES 14 FMUES 56 FRAMES . 53 FRAMES

; v;H-'" - '■'■■•' lV& ;S

'(-■.

NEUTRAL It • iy/ ACTUAL /r-iy/ COMDSO Ar • iy/ C0NO70 /r - ly' ^ 45 FRAMES (M3> 14 FRAMES (M3) 45 FRAMES- 55 FRAMES

i.ssr '■• ::■<•" C- ifi VJ m

Figure 5: Sample vocal tract articulatory profiles for the phoneme sequence /r-iy/ from the word freeze across SUSAS speech under stress conditions. Each speech frame analysis width was 24ms, with a skip rate of 8ms. The number of frames indicate the /r-iy/ duration over which the profiles are plotted (e.g., 39 frames x/ 8ms/frame = 312ms).

3.7 Vocal Tract Spectre-graphic Analysis Our initial analysis of vocal tract spectral structure focuses on sample spectrographic analysis of speech under stress. Several hundred spectrograms were informally compared across stress conditions in SUSAS. Fig.6 illustrates example responses for the word help spoken under stress conditions. Final stop releases were in general not present in high stress styles such as Angry,

54 Loud, and most Actual stress examples. Stop release time was normally longer for Clear and at times Lombard effect conditions. For Angry, Loud, and Lombard stress conditions, the high frequency energy generally increases with more irregular formant structure. Formants are also higher in amplitude and more clearly defined. This was partially confirmed in the previous analysis on the glottal source spectrum.

The spectral characteristics illustrated in spectrographic analysis suggest that the presence of stress based information can be obtained from a statistical analysis of formant location and bandwidth. A more complete discussion of the statistical analysis of individual phonemes across formant location and bandwidth can be found in Hansen, 1998b.

NEUTRAL ACTUAL

Figure 6: Spectral Responses under Stress Caption Sample vocal tract spectral responses from help utterances in the SUSAS speech under stress database.

55 4 Stress Classification The field of computer based voice stress detection is an emerging area. There has been some activity in commercial voice stress analysis in applications for forensic science (Cestaro, 1995). These methods are typically based on some aspects of pitch perturbation or micro-tremors (Lippoid, 1971). However, these commercial systems are not universally accepted by speech scientists because the 'stress' in these cases is normally associated with deception.

The focus here is exclusively on stress based speech production variations resulting from task workload, Lombard effect, or emotional/psychological changes. Recently, a number of studies have been reported which focus directly on the formulation of computer based stress detection.

Several methods have been proposed based on neural network classifiers. For example, Hansen and Womack (1996) considered a neural network based stress classifier using five different cepstral feature sets. Features that were found to be the most useful were the auto-correlation Mel ,4c,-and cross-correlation Mel XC,j (XC-Mel) cepstral parameters. Further classification studies have expanded on these neural network approaches using target driven features (Womack and Hansen, 1996). In that method, a wide selection of features were automatically extracted including articulatory measures, pitch, phone duration, and a variety of spectral based information. Next, the most effective feature subset for each targeted stress condition was determined during training, and only those targeted features used during classification. This allows the classifier to use the most discriminating features for classification of each stress style.

Other methods have also been proposed based on the nonlinear Teager Energy Operator (TEO) (Cairns and Hansen, 1994a, 1994b), where the shape of a duration normalized energy profile was used in a hidden Markov model based stress classifier. That study, clearly demonstrated the potential a TEO-based feature could have in improving stress classification performance. Motivated by this work, several recent investigations have considered more extensive feature processing methods based on TEO principles (Zhou, Hansen, Kaiser 1998a,b).

In the following sections, we first consider stress classification experiments which use linear based speech features and optimum Bayesian detection theory. These experiments were

56 conducted on features such as duration, intensity, pitch, glottal source, and vocal tract spectrum. In a recent study, several TEO-based nonlinear features were found to be very effective for stress classification (Zhou, Hansen, Kaiser 1998). Therefore, in Section 4.3 one nonlinear based feature is described, with results presented for both classification and assessment of speaker stress.

4.1 Bayesian Stress Classification with Linear Speech Features Having established relationships between speech production under stress and speech feature variation (Hansen 1998a,b), we now turn to the related problem of classification of speech under stress. Our task here is to formulate an algorithm for detection of speech spoken under one particular stress style versus neutral speech. It has been shown that there are observable differences in duration, intensity, pitch, glottal source information, and formant locations between neutral and stressed speech. Therefore, it is worthwhile to evaluate their performance for stress classification, or stress detection. Here the two terms, classification and detection, can be used interchangeably since only pairwise classification is considered. Two processing stages are required for stress detection. In the first stage, acoustical features are extracted from an input speech waveform. The second stage is focused on detection of stressed speech from neutral using one or more available methods. A variety of methods exist for stress detection which include, but not limit to, detection-theory based methods, methods based on distance measures, neural network classifiers, and statistical modeling based techniques. In this section, we employ two methods, one using a Bayesian hypothesis-testing framework, and the other using a distance measure to detect stressed versus neutral speech.

4.1.1 Description of Features For the five linear features used for stress classification, only vowel sections were extracted from the simulated domain of the SUSAS database for evaluation. The sample length of each vowel in msec is used as the duration feature. The intensity feature is defined as,

Intense U-I>2(i) (9) V K i=i

57 where j(i)(i = l,•■■,*) represents the K individual samples in the vowel. Pitch, glottal source information, and formant locations are extracted on a frame basis with frame length being 32 msec and an overlap length between adjacent frames of 16 msec. The modified simple inverse filter tracking (MSIFT) algorithm (Arslan, 1996) is employed to extract pitch frequencies from vowel speech waveforms. Spectral slope was used as the glottal source feature. It is difficult to obtain the glottal spectral slope from the raw vowel speech waveform due to the coupling effect between the sub-glottal structure and forward portion of the vocal tract. To avoid this effect, only data obtained during closed vocal fold periods was used, which unfortunately limits the available data. Also, it is difficult to accurately locate the boundaries between vocal fold closing and opening periods. As an approximation, a frame based log average amplitude FFT was computed versus log frequency for each vowel section and used to determine boundaries.

The fourth feature is the slope of the glottal source spectrum. A straight line is used to approximate excitation spectral envelope, and the line's slope is considered as the glottal spectral slope. Finally, for the last set of features, the first two formant locations are used, since these were shown shown to change measurably between neutral and stressed speech (Hansen 1998b). Here, the ESPS/xwaves function "formant" was employed to extract formant locations for all vowels in the SUSAS database (Entropie, 1993).

4.1.2 Detection-theory Bayesian Hypothesis Testing A flexible framework for stress detection can be easily established using detection theory. For such a scheme, there are two hypotheses termed HO and m. Under HO, the speech is neutral; while under Hi, the speech is stressed. Given an input speech feature vector,x,(x = X\>—,Xu'M is the vector length) the following two conditional probability densities (PDF) are estimated, p{x\HO)zn&p{x\H\). With these PDFs, the likelihood ratio, X, is then defined as,

A=*i!. (10)

The decision of whether the input speech is neutral or stressed is made by comparing the likelihood or log likelihood ratio with a pre-defined threshold, ß. If it is bigger than ß, the input

58 speech is labeled as stressed; otherwise it is classified as neutral. The value of ß depends on what criterion is used for detection. In a stress classification system, a criterion should be selected so that the two important probabilities, the false acceptance rate (FAR) and the false rejection rate (FRR), should be as low as possible. Obviously, it is not possible to minimize both FAR and FRR, and hence, a compromise must be made between FA and FR. For some systems, the requirement for one probability is more important than the other. For a stress classification system, however, we are only interested in the overall accuracy and have no preference for either FAR or FRR. Therefore, the value of ß corresponding to equal error (FAR=FRR) rate (EER) is selected. In the experiments performed here, the values of FAR and FRR were calculated as the ratio of the number of falsely accepted vowels to the total number of vowels, and the ratio of the number falsely rejected vowels to the total number of vowels, respectively. By changing the threshold value, the value of ß corresponding to EER can be found.

In order to form the likelihood ratio in Eq.10, we must first estimate the PDFs (p(x|tfo)and/?(x|//i)) of both the neutral and stressed speech features. If we assume that all components {x\,Xi>—>Xu) of the feature vector x are independent and identically distributed Gaussian random variables with mean, /*„, and variance, a\, under neutral conditions, but with a different mean, //,and variance

r (.. .. \2\ r^^-L^J-^A (ii)

/fol/flj-^exp -*^- • (12) 2 pna s With these PDFs and assuming statistical independence, the overall conditional probabilities

P{X\HO)andp{x\Hl) can be computed as,

59 f 1 M/ \2^ 2 2 (13) p{x\H0) = (2K(T n)-¥zxp —rZiZi-M*) V 2(7„2 w

1 M/ \2 />(;c|tfl)=(2/r(72)-^exp (14) v 2at w Substituting Eq.13 and 14 into Eq.10, the likelihood ratio can be computed as,

p{x\H\) A = - (15) i(x\H0)

M ( 1 M 1 M/ exp -V2 s Or/ - Mn f - —T s Car* - /*,) K°* v2(7 M 2(T ; '=i Taking the logorithm of each side, the log likelihood ratio is obtained as follows,

M 1 / M2 1 %./.. .. ^\2 lnA = Mln ■ z CJT/ -V"„ ) -T-^- 2:0, -^) (16) 2ff« w 2a; M

M '(7 ^ 1M/ * ~ \l 1i S2,/, ~ - \2 = Mln \a*J 2cri /-i 2(7 ? i-l

! j ,M1„r^+_L_gfc-/i)^:iL(a-ft)^-V!te,-/i) -^r^-^) , t<7j 2(7„2«=1 2(72 2(7, M 2(7,

(a >k 2 2 2 = Mln +J^(

1 M (17) Mi=i

(18) Mi=i

60 Similarly, if we assume that all feature vector components {x\>X2,—,Xu) are independent and identically distributed from a Gamma distribution, T{a,ß) with a = a„ and/? = /?„ under neutral conditions, but with a = asandß = Runder stressed conditions, the PDFs are formed as,

/fe/M= rk) (19)

ß,M°'~ V/',/i" *<>o F (20) /(*,M= 3, *(S0

The conditional probabilities are then obtained as,

(21) ?(4#o)= 0, U> ) I Aw J *, otherwise .'• . '

ffi r(*,)J (n*/ ZXi I £,>0Fi=l,2-,M ?(x|//l) = 0, Ul , otherwise (22)

Substituting Eq.21 and 22 into Eq. 10 we obtain the likelihood ratio, Ä, and the log likelihood ratio, lnA,for the case where sample features are Gamma distributed (#•,. >0,i = l,2--,M), as follows,

(23) M ( „-*.„,_. ^ (M \(«*-«.) ( 11^ a UXi exp M/i(-——) ß-n <«s)) Ul ; \ ßn Ps J

f^T(«Jl l«/l: +(a, -

61 where //is the estimated mean of the input sample vector, x, as defined by Eq. 17, and ftb is defined as,

1 M «n M ,-i

The decision of whether the input speech is neutral or stressed is made by comparing the likelihood (Eq. 15 for Gaussian distributed features or 23 for Gamma distributed features) or log likelihood ratio (Eq. 16 or 24) with a pre-defined threshold, ß. If it is bigger than ß, the input speech is labeled as stressed; otherwise it is classified as neutral. The value of /? depends on what criterion is used for detection. In a stress classification system, a criterion should be selected so that the two important probabilities, the false acceptance rate (FAR) and the false rejection rate (FRR), should be as low as possible. Obviously, it is not possible to minimize both FAR and FRR, and hence, a compromise must be made between FA and FR. For some systems, the requirement for one probability is more important than the other. For a stress classification system, however, we are only interested in the overall accuracy and have no preference for either FAR or FRR. Therefore, the value of £ corresponding to equal error (FAR=FRR) rate (EER) is selected. In the experiments performed here, the values of FAR and FRR were calculated as the ratio of the number of falsely accepted vowels to the total number of vowels, and the ratio of the number falsely rejected vowels to the total number of vowels, respectively. By changing the threshold value, the value of ß corresponding to EER can be found.

4.1.3 Distance Measure Testing It is also possible to detect stressed speech from neutral using a distance measure with prior trained feature distributions. Given an input speech feature vector, (x=Xi,Z2> — *Zii)iMis the vector length, two values, the distance between x and the neutral speech feature distribution, d„, and the distance between x and the stressed speech feature distribution, ds, are computed as,

62 äsJt^A, (27) aas

where *a n ',a ft ,u* f~' s *,cr, S are means and standard deviations for the neutral and stressed speech features, which are obtained from training data; ft andtx are the sampled estimated mean and standard deviation of the components of the input vector, x, as defined in Eq.17 and 18 respectively. This distance measure reflects how close the input test speech feature vector is to the feature distributions of neutral and stressed speech data. If d„is smaller than ds, the input vector x is labeled as neutral, otherwise, it is assigned as stressed. The distance scores can also be used to quantify the degree of stress content in the test data.

4.2 Linear Feature Based Evaluations A 33 word vocabulary7 under neutral, angry, loud, and Lombard effect speaking styles from the simulated domain of the SUSAS database was employed for evaluations. For each test token, all samples corresponding to vowels were extracted. Other voiced data such as diphthongs, liquids, glides, and nasals were not considered due to changing spectral structure from articulatory movement. It is believed that the muscle control needed for such articulatory movement would also be effected under stress. Vowels were selected to investigate vocal fold changes under stress and static vocal tract adjustments due to stress. From all identified vowels, duration, intensity, pitch, glottal spectral slope, and formant locations were extracted. For each feature, all extracted data was used to estimate the density function of the feature distribution, and then obtain the ROC (receive operation characteristic) curve for the Bayesian hypothesis-testing method. In order to achieve open-set performance in the test phase, the entire vowel data set was first divided for each feature into 10 equal-size groups. For each set of the 10 groups, one group is set aside and the remaining data (9 groups) used to obtain the EER threshold for Bayesian

7 For these evaluations, the two words "destination" and "histogram" were set aside because of the increased impact of lexical stress on polysyllable words. The remaining 33 word vocabulary consisting of 26 monosyllable words and 7 two-syllable words.

63 hypothesis-testing method, and the mean and variance for the distance measure approach. The final error rate is obtained by accumulating all error rates from 10 open-set tests. The next five subsections consider stress detection performance using the optimum Bayesian detection scheme for a feature in each speech production domain (duration, intensity, pitch, glottal source, vocal tract spectrum).

4.2.1 Duration For the Bayesian hypothesis-testing method, PDFs of vowel duration were first estimated to form the likelihood ratio. Using phone segment label information, a vowel duration histogram was obtained, following by fitting a Gamma distribution to the data histogram (examples for the loud speaking style are shown in Fig. 7a. Based on this Gamma pdf, the ROC for open-set test performance was obtained for the Bayesian hypothesis-testing method. To find average test results, the data was divided for each feature into 10 equal size sets. For each of the 10 sets, we test with one set and train with the other 9 to calculate the average EER threshold for the Bayesian hypothesis testing approach, and the mean and variance of the feature distribution for the distance measure approach. Fig.8a shows the ROC of detecting speech under "loud" speaking style from neutral speech using duration. Table 3 lists the open-set test results by using the Bayesian hypothesis-testing method as well as using the distance measure approach. Several testing feature vector lengths (1, 5, 10) were used to obtain ROC curves and error rates. From the results in ROC and table, increasing the input vector length does not significantly improve the detection accuracy (especially for detection of Lombard effect versus neutral speech using the Bayesian hypothesis-testing method). Also, Table 3 shows that the distance measure approach produces slightly better performance for Lombard effect and loud speech, but slightly lower results for angry speech when compared with the Bayesian hypothesis-testing method. In general, given the error rate levels for the three stress classes tested, vowel duration is not a strong feature for stress detection.

64 Table 3: Error Rate (percentage) of open-set Stress Detection Test using Duration as the feature. Detection Vector Error Rate: Stress Style of Test Speech DURATION Method Length Neutral Ansry Neutral Loud Neutral Lombard Optimal 1 45.13 45.38 38.21 38.72 40.77 40.26 Detection 5 36.36 38.96 33.77 35.06 40.26 40.26 10 41.03 35.90 38.46 35.90 38.46 46.15 Distance 5 48.05 49.35 29.87 36.36 32.47 42.86 Measure 10 43.59 53.85 28.21 41.03 1 30.77 46.15

4.2.2 Intensity For each vowel, we use Eq. 9 to calculate its root mean square (RMS) intensity. From a plot of the histogram, it was determined that the Gaussian PDF would fit the intensity distribution well. From Eq. 9, it is clear that the intensity feature can never be negative while a Gaussian PDF ranges from -°°to°o. To solve this conflict, a conditional PDF, f(x\X>o), is used to fit the intensity distribution. f(x\x>o) is obtained as follows,

1 2\ fix\x>o)=- rexp (28) 2 2 />0V2tfcr 2a

2"\ rexp (Z-M) (29) />o=i-L°» 7 i2na l 2a where n and a2 are the mean and variance. Fig. 7b shows how a conditional Gaussian PDF fits the intensity distribution for all vowels spoken under the loud stress condition.

Based on the conditional Gaussian PDF, the Bayesian hypothesis-testing method was used to classify stressed speech from neutral. ROC curves for each stress condition were obtained (sample ROC is shown in Fig. 8. In a similar manner to that used for duration, the open-set test results for the Bayesian hypothesis-testing method and distance measure approach were obtained and summarized in Table 4.

65 Table 4: Error Rate (percentage) of Open-set Stress Detection Test Using Intensity as the Feature

Detection Vector Error Rate: Stress Style of Test Speech INTENSITY Method Length Neutral Angry Neutral Loud ■ I Neutral Lombard Optimal 1 40.26 37.44 34.87 32.82 40.77 39.49 Detection 5 24.68 22.08 27.27 22.08 38.96 35.06 10 23.08 17.95 28.21 17.95 35.90 35.90 Distance 5 41.56 27.27 35.06 22.08 40.26 35.06 Measure 10 30.77 33.33 25.64 23.08 | 41.03 33.33

The ROC curves in Fig. 8 and open-set test results in Table 4 show that increasing input vector length does improve performance, especially for detecting angry and loud speech for the Bayesian hypothesis-testing method. As for the distance measure approach, increasing input vector length does not always improve performance. The open-set test results also show that both methods perform better for detection angry and loud speech than for detecting Lombard effect speech.

4.2.3 Pitch Frame-based pitch measurements were extracted for the input neutral and stressed data, and the resulting histogram showed that a conditional Gaussian PDF was suitable for model distribution of this feature. Fig. 7c, Fig. 8c, and Table 5 show the pitch distribution, ROC curves for the Bayesian hypothesis-testing method, and open-set test results for both methods, respectively. Note that all zero pitch values are removed from ROC plots and open-set tests.

Table 5: Error Rate (percentage) of Open-set Stress Detection Test Using Pitch as the Feature

Detection Vector Error Rate: Stress Style of Test Speech PITCH Method Length Neutral Angry Neutral Loud I Neutral Lombard Optimal 1 18.95 18.57 11.94 11.63 24.08 24.18 Detection 5 15.17 14.31 10.34 10.00 21.90 22.07 10 12.76 11.72 7.24 8.28 20.69 19.31 1 Distance 5 15.34 15.00 12.41 7.07 23.10 19.48 1 Measure 10 14.48 12.76 12.07 4.83 I 21.38 33.33

Compared to duration and intensity, pitch resulted in much better performance for stress detection. In a similar manner to intensity, pitch performs better for detection of angry and loud speech than for Lombard effect speech when using the Bayesian hypothesis-testing method. For detection of loud versus neutral speech, the Bayesian hypothesis-testing method achieves very

66 high accuracy. The distance measure method produced a similar level of performance with pitch as the feature.

4.2.4 Glottal Source Spectrum For estimation of the glottal source spectral slope, only those vowels which were longer than 5 frames (i.e., 96 msec) are used (in order to get reliable slope estimates). Since glottal spectral slopes for vowel sections are almost all negative, the resulting feature histogram shows an envelope that is close to a Gamma distribution. In order to fit Gamma distribution to the feature histogram (shown in Eq. 19 and Eq. 20), only the absolute value of each spectral slope was considered (sample Gamma distribution for loud speech is shown in Fig. 7D. The ROC curves for the Bayesian hypothesis-testing method are shown in Fig. 8, and open-set test results for both Bayesian hypothesis-testing method and distance measure approach are summarized in Table 6.

Table 6: Error Rate (percentage) of Open-set Stress Detection Test Using Glottal Spectral Slope as the Feature

Detection Vector Error Rate: Stress Style of Test Speech SPECTRAL SLOPE Method Length Neutral Angry Neutral Loud Neutral Lombard Optimal 1 33.33 36.78 41.38 41.72 42.76 42.07 Detection 5 25.45 21.82 30.91 34.55 30.91 36.36 10 25.00 17.86 35.71 35.71 28.57 32.14 Distance 5 34.55 18.18 38.89 35.19 38.89 33.33 Measure 10 35.71 17.86 44.44 25.93 44.44 25.93

The open-set test results from the Bayesian hypothesis-testing method show that spectral slope is more suitable for detecting angry speech than for detection of loud or Lombard effect speech from neutral. In spite of this, it still does not produce a reasonable levelof accuracy for classifying angry versus neutral speech. One possible reason for this result is that a more direct glottal source estimation method might be needed, since the results presented in (Hansen, 1998b) seem to suggest that glottal spectral slope should be more successful. The distance measure approach shows a similar level of performance as that obtained using the Bayesian hypothesis- testing method.

67 4.2.5 Vocal Tract Spectrum In the evaluation of vocal tract spectral structure, first and second formant location was used. Since it was of interest to reduce vowel phoneme dependent traits (i.e.,the absolute vowel formant location), formant location measurements were made with respect to the deviation from the expected average value. Therefore, using the expected average formant locations obtained from (Deller, Hansen, and Proakis, 1999; page 125), we subtract off the expected formant location knowing the particular vowel data under test (single and uppercase ARPABET labels are used for phonemes from Deller, Hansen, and Proakis, 1993; page 118). Using this conversion, formant location deviations of all vowels can be collected into a histogram and were shown to fit well to a Gaussian PDF (shown in Fig. 7e,f for first and second formants). Fig. 8e,f shows ROC curves for the Bayesian hypothesis-testing method for different vector lengths. The open-set test results are summarized in Table 7. A comparison of ROCs and distance measure performance, we conclude that first and second individual formant location are not suitable for stress detection.

Table 7: Error Rate (percentage) of Open-set Stress Detection Test Using First and Second Formant Location as the Features Detection Vector Error Rate: Stress Style of Test Speech FORMANT FREQUENCY Method Formant Length Neutral Angry Neutral Loud Neutral Lombard 1 42.60 41.80 46.43 45.10 46.84 46.90 1st 5 40.60 40.30 46.12 45.82 47.91 46.87 Optimal 10 38.79 40.91 43.03 44.24 47.58 47.88 Detection 1 51.48 50.88 58.20 54.51 52.98 49.88 2nd 5 53.88 49.85 58.51 56.12 54.78 50.90 10 55.76 47.58 59.39 57.27 53.33 55.15 5 43.58 37.76 44.63 45.97 45.82 46.72 1st Distance 10 41.82 39.39 43.64 41.82 45.45 46.06 Measure 5 53.28 49.85 41.49 74.78 36.87 74.93 2nd 10 54.55 49.09 40.00 74.85 38.48 76.06 1

68 (a.) DURATION DISTRIBUTION (b.) INTENSITY DISTRIBUTION

• (c.) PITCH DISTRIBUTION (d.) GLOTTAL SPECTRAL SLOPE DISTRIBUTION

tKClMLfeOM

(e.) Fi LOCATION DISTRIBUTION (f.) Ji LOCATION DISTRIBUTION

Gaussian and Gamma pdf* used to approximate the feature distribution of vowels under loud speaking style, (a.) duration: T(a,0) with (a = 4.4402, ß = 45.6920); (b.) intensity: N(u,a3\X > 0) with (p = 9.99 x 103, 0) with (p = 192 Hz, a» = 2094); (d.) glottal spectral slope: T[a,ß) with (a = 4.2329, ß = 3.6612); (e.) first formant location: N(p,a2) with (p = 73.28,ar = 1.32 x 103); (f.) second formant location: N(ji,o*) with (p = — 39.90,a = 6.89 x 104).

Figure 7: Gaussian and Gamma pdfs

69 (a.) DURATION ROC (b.) INTENSITY ROC

(c.) PITCH ROC (d.) GLOTTAL SPECTRAL SLOPE ROC

(e.) Fi LOCATION ROC (f.) F2 LOCATION ROC

'AIM ACCVTANCa MWMUJTV tFA) FALSC ACCCPTANC« PftOBABKJTV {FA| ROC detection curves for "loud" versus neutral speech (vowels) using input vector lengths of (1,5,10) represented as (solid line *:, dashed line o, dotted line A) for: a.) duration: EER(*) = 38.32%; EER(o) = 30.77%; EER(A) = 33.33% b.) intensity: EER(*) = 32.74%; EER(o) = 23.08%; EER(A) = 20.51% c.) pitch: EER(*) = 11.47%; EER(o) = 9.88%; EER(A) = 8.80% (d.) glottal spectral slope: EER(*) = 40.51%; EER(o) = 32.22%; EER(A) = 34.48% (e.) Hrst formant location: EER(*) = 45.67%; EER(o) = 45.51%; EER(A) = 43.07% (f.) second formant location: EER(») = 46.94%; EER(o) = 46.32%; EER(A) = 47.49%

Figure 8: ROC detection curves

70 4.2.6 Discussion of Linear Speech Feat u res Based on Tables 3, 4, 5, 6, and 7, the following observations can be made: (1) that pitch is the best feature for stress classification among the five features considered, (2) error rates generally decrease as feature vector length increases, (3) performance differences exist between different stress styles, and (4) mean vowel formant locations are not suitable for stress classification. The results in this section have therefore established stress classification performance using linear speech production based features with two types of optimum detection methods.

4.3 Stress Classification Using Nonlinear Speech Features In this section, recently proposed approaches to stress classification that employ Teager Energy Operator (TEO) based processing are considered. Three were proposed in the study by Zhou, Hansen, and Kaiser (1998a), and a fourth was discussed in Zhou, Hansen, and Kaiser (1998b). Here, we briefly consider the basic principles of the TEO, and one nonlinear feature for stress classification (TEO-CB-Auto-Env). This is followed by evaluations using stressed speech data from SUSAS for classification. Finally, we consider a comparison of three features for stress assessment in speech using actually emergency data provided by NATO IST/TG-01.

4.3.1 Teager Energy Operator According to studies by Teager (1980, 1983), the assumption that airflow propagates as a plane wave in the vocal tract may not hold, since the flow is actually separated and concomitant vortices are distributed throughout the vocal tract. Based on the theory of the oscillation pattern of a simple spring—mass system, Teager developed an energy operator to measure the energy for simple sinusoids which has been suggested as being a useful element for speech. The simple and elegant form of the operator was introduced by Kaiser (1990,1993) as,

d d*/2 > Wc frWl=I ^zim -xif\ 4r^(0 dt1 (30)

71 where yr[] is the Teager Energy Operator (TEO), and %{t) is a single-frequency component of the continuous speech signal. Kaiser (1990, 1993) derived the operator for discrete-time signals from its continuous form y/c [x{t)], as,

v\z{nJ\ = Z2(.n)-x(n+l)x(n-l)> (31) where ^(«)is the sampled speech signal.

The TEO is typically applied to a bandpass filtered speech signal, since its intent is to reflect the energy of the nonlinear energy flow within the vocal tract for a single resonant frequency. Under this condition, the resulting TEO profile can be used to decompose a speech signal into its AM and FM components within a certain frequency band via,

,M l (. v\y{n)]+v\y{n+\)Y\ fin) = arccos 1 - -— ' r , v. (32)

V\x{n)\ W\~ (33) l [x v\y(n)]+Vf{y{n + \)]

where ><») = *(«)-#(«-W] is the TEO operator as shown in Eq. 31, /(»)is the FM component at sample n, and a{n) is the AM component at sample n. On the basis of this work, Maragos, Kaiser, and Quatieri (1993a,b) proposed a nonlinear model which represents the speech signal s(t) as,

M (34) *(*)= STmW m=l where rm (0 = am {ftexfacifaJ + \'0 qm {r)dr)+ e) (35) is a combined AM and FM structure representing a speech resonance at the mfh formant with a center frequency Fm =fcm. In this relation, am{t)is the time-varying amplitude, and gm{r)is the frequency modulating signal at the wth formant.

72 Although the TEO is formulated for single-frequency signals or signals with a single resonant frequency, previous studies have shown that the TEO energy of a multi-frequency signal is not only different from that of single-frequency signal but also reflects interactions between different frequency components (Zhou, Hansen, and Kaiser 1998a,b). This characteristic extends the use of TEO to speech signals filtered with wide bandwidth band-pass filters (BPF). In the next section, we consider one TEO based features for stress classification.

4.3.2 TEO-CB-Auto-Env: Critical Band Based TEO Autocorrelation Envelope Empirically, the human auditory system is assumed to be a filtering process which partitions the entire audible frequency range into many critical bands (Yost, 1994). Based on this assumption, a nonlinear feature is proposed that employs a critical band based filterbank to filter the speech signal followed by TEO processing (see Fig. 9) Each filter in the filterbank is a Gabor bandpass filter, with the effective RMS bandwidth being the corresponding critical band. This feature is an extension to previous TEO based features which have been proposed (Zhou, Hansen, and Kaiser 1998a), and preliminary classification results have also been reported (Zhou, Hansen, and Kaiser 1998b). Here we consider a comparison with other features for classification, and extend the basic ideas for the problem of stress assessment.

—#■ 1—BH Critical Band Basad 1 w•* ■ 1 1 B- Volcid_ FHtafbank • • Sagmantttlon • • TEO • -Compute & Normiflr» Output Ualng Gabor m TEO • (Ftxad Frama) • Autocorratallon • am undtr •nrnlepa of Spaach Bandpaaa Fitters • • • • TEO Autoeorralatlon ► —B- —► Fratura

Figure 9: TEO-CB-Auto-Env Feature Extraction To extract the TEO-CB-Auto-Env feature, each TEO profile of a Gabor BPF output is segmented into 200-sample (25 msec) frames with 100-sample (12.5 msec) overlap between adjacent frames. Next, M normalized TEO autocorrelation envelope area parameters are extracted for each time frame (i.e., one for each critical band), where M is the total number of critical bands. This is the TEO-CB-Auto-Env feature vector per frame. Fig. 9 shows the entire feature

73 STRESS CLASSIFICATION RESULTS SIMULATED DOMAIN ACTUAL DOMAIN O NEUTRAL . O NEUTRAL ca ANGRY EJ ACTUAL m LOUD C3 LOMBARD 100 vr « go M_ 90. 80- n. Ul i 70- X 60 o •- 50- I s c 40. I 3 30 0. ü 20.. < 10 0 1 Pitch MFCC TEO-CB-Auto-Env MEAN : m = 88.5% m = 89.5% m = 94.2% ÖTD: a - 7.22 o = 5.73 a = 3,97

Figure 10: Pairwise Stress Classification Results (Mean and standard deviation of overall neutral/stress classification rates are shown; Different speaker groups were used for simulated and actual stress conditions)

The evaluation results are shown in Fig. 10. In general, the TEO based feature was effective in classifying stressed speech from neutral for both simulated and actual stress situations. We should expect that the performance for the neutral versus actual stress domain to be better than simulated domain (angry, loud, Lombard effect), since the speakers clearly demonstrated extreme levels of stress for this data. The TEO-CB-Auto-Env feature with its fine frequency band partitions, provides the most effective and consistent level of stress classification performance compared with MFCC and pitch information.

The evaluations in this section have shown that the proposed nonlinear based TEO-CB-Auto-Env feature is effective in the classification of speech under stress in both simulated and actual stress settings. This assumes that the goal is to detect the presence of stress. In some voice

75 communication settings, it is also necessary to assess the level of stress in a speaker's voice. The next section considers both linear and nonlinear based features for the task of stress assessment using actual emergency military voice communications between aircraft pilots from the SUSC-0 stress database.

76 5 Stress Assessment In many commercial, law enforcement, and military applications, it is necessary to assess whether or not, as well as the degree to which, a speaker is under stress. To evaluate the techniques discussed and their ability to detect real stress, the SUSC-0 database containing speech of pilots under stress was processed (in a later section, we present an equivalent evaluation of speech data from the Mt. Carmel law enforcement encounter). The SUSC-0 database is from NATO IST-TG01, which consists of actual aircraft pilot communications under emergency situations8. Specifically, the Mayday2 domain in SUSC-0 was used, which contains speech data between a pilot and controller collected from the initial ground aircraft system check, through preliminary discovery of engine emergency, until safe resolution of the emergency. The different stress degrees experienced by the pilot are reflected by his speech in Mayday2. Twelve (12) sentences from the SUSC-0 database were extracted to represent different speaking styles for the assessment evaluation. Table 11 shows the 12 sentences from SUSC-0, where No. 1 represents ground systems check; in sentences 2-7 the pilot understands there is a problem and is working through a series a checks to determine the cause and to attempt to remedy the cause; sentence 7-11 the pilot realizes now that he is in an extreme emergency and stands a real possibility of not being able to land his aircraft; finally in No. 12 he has landed his aircraft and expresses relief.

A baseline HMM-based stress assessor with continuous Gaussian mixture distributions was used for the evaluation. Two reference HMM models, one representing neutral speech and the other representing stressed speech, were trained. All voiced segments of the word "help" under neutral conditions in SUSAS database were used to train the neutral HMM reference model. For the stressed HMM reference model, two different data sets were trained, one from a combination of simulated angry, loud, and Lombard stress conditions, and one from that actual stress roller coaster and free fall ride data, respectively. If a speech feature can assess the degree of stress regardless of text, the log likelihood ratio of the unknown speech generated by the stressed

8 Sample audio files for stressed speech databases used in this study, SUSAS and SUSC-0, are available from the NATO IST/TG-01 Web page on Speech Under Stress: http://cslu.colorado.edu/rspl/STRESS/info.html

77 HMM model versus the neutral HMM model should be able to indicate whether it is more likely under stress or neutral. Since the TEO-based autocorrelation envelope feature (TEO-CB-Auto- Env), MFCCs, and frame-based pitch information were shown to be very effective for stress classification, they were used to assess the stress for SUSC-0 data. Since both the TEO-based feature and pitch information are only useful for voiced speech, the assessment is based on the extracted voiced portions from each utterance. To consider the variations within each utterance, 4 voiced portions per utterance (shown in Table 11) are extracted for the assessment. Note that the neutral and stress HMM classification models were trained from the /eh/ phoneme in help, and that almost all tested voiced sections consisted of different phonemes.

Table 8: Sentences from SUSC-0 used for Stress Assessment Evaluation. Note that bold uppercase characters represent voiced sections which were used for overall stress assessment ofthat sentence. Sentences used for Evaluation from Mayday2 Domain of SUSC-0 No. Sentence Extracted Phonemes Avionics llGHt hydrAUlic oil pressure HGHt engine indications ARE ... /ay/ /ao/ /ay/ /aa/ AND you'er gONNA declare an emERgency or am I /ae/ /aa-n-ax/ /er/ /ay/ ... checklist Oil pressure malfunction G one-hundred ... cruise /oy/ /iy/ /-/ /-/ altitude stORe jett... throttle minimize mOvement... /-/ /-/ /ao/ /uw/ Roger that OH indicAtor is nOW zERO /oy/ /ey/ /aw/ /ih-r-ow/ ... ALRIGHt newt... engine fault UGHt still lit... hydrAUlics are ... /ao-1-r-ay/ /ay/ /ao/ /-/ total pOUNds six .. /-/ /-/ /-/ /aw-n/ And I'm going there and I'm there I'm descENding down to ten grANd /eh/ /ae-n/ /-/ /-/ right I'm nOt picking up a tAcan lock /-/ /-/ /aa/ /ey/ No I'M doing ALRIGHt now and the rAdial is whAt /eh-m/ /ao-1-r-ay/ /ey/ /ax/ OkAY give me immEdiate vectors this is an emERgency I'm gengine OUt /ey/ /iy/ /er/ /aw/ give me hEAdings I nEEd headings nOW /ih/ /eh/ /iy/ /aw/ 10 Put the cAble dOWN pUt the cAble down /ey/ /aw-n/ /uh/ /ey/ 11 I'm hOt I nEEd the cAbLe . /aa/ /iy/ /ey/ /ax-1/ 12 mAn I thOUGHt I wAs gOne /ae/ /ao/ /aa/ /ao/

The assessment results are shown in Fig. 14. Here, a single score is obtained by finding an average output score across the four extracted voiced sections per sentence. Generally speaking, the recordings begin in a neutral relaxed setting (sentences 1-2), then move into concern while pilot begins to determine the cause of the problem (sentences 3-7). Finally, the pilot determines that the emergency is serious and must land the aircraft without power (sentences 8-11). Sentence number 12 indicates his relief after a safe landing.

Both figures ((a) and (b) in Fig. 14) show that the general assessment score trend is similar regardless of which anchor stress HMM reference model is used (note that a negative likelihood

78 difference score means that the 'neutral' HMM model is more likely, and that a positive score means the 'stressed' HMM model is more likely). The results do show the stress HMM reference model trained from actual SUSAS stressed speech has larger fluctuations among assessment scores. This may be because that model represents an extreme case of stress. It is noted that SUSC-0 recordings can at times have high levels of background noise, so it is possible that stress assessment could be affected by this distortion9. The stress level profile versus increasing sentence location showed limited variation for MFCC features. This occurs, because while there are significant changes in spectral structure on a per phoneme basis as demonstrated in (Hansen, 1998b), the differences in phoneme content for the voiced sections analyzed are more dissimilar to either neutral or stressed MFCC trained HMM model (this explains why the difference in log- likelihood scores are close to zero, since both models give similar scores). For pitch (fundamental frequency) versus time, we see that the neutral model is selected for sentence counts 1-7, with a sharp change towards the stressed HMM model for 8-12. We note that for sentence example 9, there were irregular pitch values resulting from the pitch estimation scheme which were not corrected (i.e., we wanted to compare performance of features without user intervention). Finally, the TEO-CB-Auto-Env feature produced more meaningful scores for the case of a neutral versus Actual stress trained reference HMM model as opposed to a simulated stress trained reference HMM. Again, the neutral model received very high scores (large negative likelihood difference score) for sentence entries 1-7. Sentence entries 8-12 produced scores which were more associated with the stressed model in both test cases. Since the neutral reference HMM model was the same in both test cases, the difference in scores reflect differences in the stressed reference model. The results here demonstrate that the proposed feature can be used for the purposes of stress assessment, though it is suggested that the stressed speech reference model should be trained on data which reflects the desired type of stress to be assessed. Also, future studies could consider the influence of other distortions for assessment, including channel/microphone differences and acoustic background interference.

9 In this study, we choose not to perform speech enhancement due to the potential of introducing spectral based processing artifacts (Hansen 1999).

79 extraction procedure. Since each critical band possesses a much narrower bandwidth than the 1 kHz bandwidth used for BPFs in the TEO-Auto-Env feature (discussed in Zhou, Hansen, and Kaiser 1998a), post Gabor bandpass filtering centered at median FO is not needed in TEO-CB- Auto-Env extraction. This makes the new feature independent of the accuracy of median FO estimation.

In practice, all TEO profiles are segmented into many frames and all autocorrelation functions are normalized. As a result, the constant autocorrelation function is represented as a decaying straight line from (0,l)to (N,0), where N is the frame length. Those variations caused by harmonic distribution as well as by modulations from stress are expected to be reflected by the change in the TEO autocorrelation envelopes.

4.4 TEO Based Stress Detection Evaluations Evaluations were also conducted using the SUSAS, Speech Under Simulated and Actual Stress database (see Hansen, 1998a for a discussion). In experiments discussed here, angry, loud and Lombard effect styles were used from SUSAS for simulated stress (speakers were requested to speak in that style; 85 dB SPL pink noise played through headphones was used to simulate the Lombard effect). Data for SUSAS actual stress was selected from the subject motion-fear domain. In the actual domain, a series of controlled speech data collection experiments were performed with speakers riding an amusement park roller coaster.

Since the TEO is more applicable for the voiced sound than for the unvoiced sound, only voiced sections of all word utterances were used for the evaluation. A baseline 5-state HMM-based stress classifier with continuous Gaussian mixture distributions was employed for the evaluations. For the purposes of comparison, a frame based pitch and MFCC features (Davis and Mermelstein, 1980) were used.

74 A second evaluation for stress assessment will be presented in Section 7, which specifically considers the law enforcement voice recordings from the shoot-out at Mount Carmel.

NEUTRAL HMM MODEL VS. SIMULATED STRESS HMM MODEL 20 Ul 15 8 10 Q 5 8X a 0 .3 •5 •10 -15 •20 E -25 O -30 1 2 3 4 5 6 7 B 9 10 11 12 13 SENTENCE/ PHRASE COUNT (INCREASING IN TIME) (a) NEUTRAL HMM MODEL VS. ACTUAL STRESS HMM MODEL

12 3 4 5 6 7 8 9 10 11 12 13 SENTENCE / PHRASE COUNT (INCREASING IN TIME) (b)

Figure 11: Assessment results for pilot's speech from Mayday2 domain of SUSC-0 database (Log likelihood ratio is shown along Y-axis while sentence number is shown along X-axis): (a) Neutral vs Simulated stress (Loud, Angry and Lombard) HMM reference models; (b) Neutral vs Actual stress HMM reference models

80 6 Conventional/Commercial Voice Stress Analyzer Features In this section, we consider an evaluation of several traditional features which have been used in the development of commercial voice stress analyzers. The results here are presented in the form of a series of experiments. The three features considered include: (i) normalized pitch frequency, (ii) periodicity, and (iii) pitch jitter. The evaluations were conducted using three stressed speaking styles extracted from the SUSAS speech database. The stressed speech conditions include: Angry, Loud, and Lombard effect.

6.1 Features: Normalized Pitch, Periodicity, Jitter The scaled pitch measure is computed using the autocorrelation method. For the z'th frame of

windowed speech, *,(«), the maximum valued autocorrelation lag, wmax(/) is computed using the function,

N m (i)=argmax (ä,. (I») = —— ~£s, (/>, (/+m)\. (36) max { N-m i=o J

The pitch frequency of the signal is obtained by dividing the sample rate, Fsample, by the maximum valued autocorrelation lag,

Fo{i) = 5fHE!L, (37) wmax

max Finally, scaled pitch is obtained by first applying the constraint that ^O/fe

max dividing by a maximum allowable pitch frequency (F0 = 400/fz),

j? (/)=^l (38) 00 w £. max

Scaled pitch values range from 0 to 1 with values near 1 typically observed for speech under extreme stress.

81 Periodicity represents the degree of voicing state of the speech waveform. It is simply computed as the ratio of the energy of the wmax autocorrelation lag:

f(.)=*/knax) (39) i?,.(0)

Jitter is related to the frame-to-frame variation in pitch period and essentially measures small fluctuations in glottal cycle lengths. Let v(i) represent the absolute difference between the pitch period at frame / and frame z'-l:

v(i)=\p(i)-p{i-q (40)

Jitter, j(i) is computed as follows

i[F( )-.(„tl)] W ±[P{i-l)+P(i)+P(i + l)]

6.2 CVSA: Computer Voice Stress Analyzer The operation of the computer voice stress analyzer (CVSA) is based on the notion that muscles and limbs of the human body exhibit a natural tremor rate ranging between 8 to 12 Hz. There are several underlying assumptions made about speech production which leads to the formulation of the device. First, since vocal chords are primarily muscular tissue, it is assumed that the voice fundamental frequencies are modulated by an 8 to 12 Hz "microtremor". Second, increased levels of arousal or stress contribute to additional tension in the vocal chords. This results in a reduction of the natural tremor amplitude. Finally, it is assumed that "microtremors" are not audible to the listener, but measurable using computer aided algorithms.

Various devices have been constructed to measure microtremors in the human voice. The analog device known as the Psychological Stress Evaluator (PSE) was studied by VanDercar, et. al (1980). The general operation of the device consists of four basic modes. Each operation mode (known as Mode 1 to Mode 4) controls the degree to which the signal is filtered. The filtering in all four modes was accomplished using a combination resistor and capacitor circuit to produce

32 varying degrees of low-pass response. In Mode 3, for example, the PSE circuitry consists of a 4.6 /uF capacitor in parallel with a 30KQ resistor.

After reviewing the literature for the CVSA as described in (Cestaro, 1995), we implemented a Matlab version of the CVSA. The Matlab software is very simple in that it implements the digital filter described by Mode 3 operation of the PSE device. The software assumes input speech sampled at 8 kHz and outputs a time-domain waveform shape analogous to the pen- drawings illustrated in (VanDercar, et. al, 1980).

During processing, the speech signal is first passed through an 8 times oversampling to simulate the one-eighth tape play speed of Mode 3. After oversampling, the waveform is passed through a low-pass digital filter with frequency response derived from the resistor/capacitor description of the analog device. The Matlab code listing is shown below: function z = cvsa(x)

samp = 8000; pass = 12; stop = 15;

x = resample(x,8,l); x(find(x<=0)) = zeros(length(find(x<=0)),l); y =x; [n,Wn,beta,typ] = kaiserord([pass stop],[l 0], [0.01 0.1], 8000); b = firl(n, Wn, typ, kaiser(n+l,beta), 'noscale'); z = filter(b,l,y);

plot(z); The CVSA output is analyzed visually. Four aspects of the output waveform are assumed to contribute to reveal the degree of vocal stress. These include: amplitude, leading edge, cyclic rate change, and "blocking". A description of each term and it's visual manifestation can be found in (VanDercar, et al., 1980). The most important indicator of stress or deception in speech is thought to be "blocking". Blocking occurs when straight parallel lines are seen in the output to form an envelope over the CVSA signal. Evaluation of the implemented CVSA scheme is presented in Section 7.

83 6.3 Evaluations: Normalized Pitch, Periodicity, Jitter The speech data consisted of simulated stress from the SUSAS speech database. Specifically, 56 isolated words from each of 9 speakers were used to estimate GMM (Gaussian Mixture hidden Markov Model) based models for each stressed condition. The remaining 14 words were used for open test evaluation. Due to limited data, a round-robin train/test paradigm was used. During processing, each word token was first processed using an automatic end-point detection algorithm. Next, the (3) features were extracted every 10 msec from 30 msec windowed portions of data.

The evaluation consisted of a pair-wise stress classification task. Data submitted for test was assumed to be either neutral data or one of three stressed speaking styles. The classifier must therefore decide if the data is either neutral or stressed.

The evaluation consisted of submitting the test set data (different from training data) to each GMM (normal, angry, loud, Lombard). The output scores for each frame were used to compute a frame-based log likelihood ratio. The average of the frame-based measures were computed over a single isolated word and the output compared to a decision threshold. Values greater than the threshold are considered to be from normal speaking conditions while values less than the threshold constituted stressed speaking style. The results summarized below are presented in the form of a series of experiments which serve to determine if the GMM classifier structure, or the input speech data type, influence stress classification performance.

Experiment 1: In order to determine the influence of the number of mixtures in the GMM classifier, we ran an experiment with three different mixture sizes. All three features (i.e., normalized pitch frequency, periodicity, and pitch jitter) were used as a per frame vector. The results are shown in Table 9. In general, as the number of Gaussian mixtures is increased, the ability of the classifier to more closely represent the changing feature structure should increase. As the results in Table 91a show, there is only a slight increase in performance as the number of mixtures increase. Since excitation features change more significantly for angry and loud speech, we would expect their performance to be much better than for Lombard speech. While

84 this is true, the difference is not as large as one might expect if we simply considered mean pitch changes.

Table 9: Experiments using (i) normalized pitch frequency, (ii) periodicity, and (iii) pitch jitter as a three feature set for a GMM (Gaussian mixture model) stress classifier. Evaluations using 3 different size sets of mixture weights, and adding first and second order feature derivatives. Normal vs. Pairwise GMM-Based Stress Classification Results (la) 32 mixtures 64 mixtures 128 mixtures Angry 72.1% 70.9% 71.4% Loud 69.2% 72.2% 75.8% Lombard 62.7% 59.6% 64.4% (2a) 32 mixtures 64 mixtures 128 mixtures Angry 75.0% 75.0% 73.9% Loud 77.6% 71.7% 71.2% Lombard 63.7% 63.3% 59.6% (2b) 32 mixtures 64 mixtures 128 mixtures Angry 81.3% 78.7% 80.2% Loud 82.3% 78.5% 78.0% Lombard 66.0% 61.0% 61.0% (3a) 32 mixtures 64 mixtures 128 mixtures Angry 82.8% 78.8% 80.4% Loud 86.6 79.5% 82.3% Lombard 67.8% 68.7% 68.7% (3b) 32 mixtures 64 mixtures 128 mixtures Angry 83.2% 84.0% 83.2% Loud 86.6% 85.8% 83.6% Lombard 70.8% 76.9% 69.2%

Experiment 2: In this experiment, the conditions are the same as that for Experiment 1, with the exception that only voiced speech sections were used in the 3-feature vector per frame. To determine which frames were voiced, we extracted all framew with a periodicity measure greater than 0.30. The results in Table 9 (2a) are for the case when the pitch mean is removed, and 9 (2b) are for the case when pitch mean is not removed. In cases where pitch mean was previously shown to change significantly (i.e., loud and angry), the stress classification results were better. The results are about the same for Lombard speech.

Experiment 3: Several experiments were also performed were we augment the three excitation features with the first and second-order derivatives. Results for Table 9 (3a) are for the case for a combined 6 feature vector (3 static, 3 first-order derivatives) in the stress classification. In this scenario, stress classification performance improves for Lombard speech, but little real

85 improvement is observed for angry or loud. If the second-order derivatives are included (now a 9 feature vector per frame; results in Table 9 (3b), there is a measurable level of improvement. This was especially true for the 64 mixture case, and less so for the 128 mixture case. Again, including static, along with first and second order derivatives generally provides better resolving power for the classifier.

Experiment 4: Having established a baseline system, using 64 mixtures, we set out to explore several issues involved in the training process. One issue of interest is that when different classes of features are used, quite often their variances will encompass a wide range. To reduce these effects, we set a variance threshold during the training process (two experiments were performed; one with a variance floor of 0.001 instead of the standard 0.01 (Table 10 (4a)); and one with a variance floor of 0.0001 (Table 10 (4b)). Comparing results from Table 10 (4a) with Table 9 (3b) (64 mixture column), we see that reducing the variance floor increases classification performance, with good gains for loud and Lombard stress styles. However, dropping the variance threshold too low, results in a slight loss in performance.

Experiment 5: In addition to adjustments in the feature variance floor during training, the number of iterations, given the training corpus, can also effect classification performance. Too many iterations, will result in a model that is too specialized for the training set (especially true if the training token size is small). Too few iterations will produce a classifier which is to general. Again, this issue will be based on the amount and speaker set range in the available training data. In this experiment, we kept the same configuration as that for Experiment 4a, but considered increasing the number of iterations of the traditional Baum-Welch hidden Markov model training algorithm from 10 to 20. The results are summarized in Table 10 (5). Again, the additional training iterations, coupled with the adjustment in the feature variance floor, produces another slight increase in classification performance. We also tried an experiment where we used this set-up with frames which had a higher degree of voicing to see if transitional frames between voiced and unvoiced speech had much influence in the classifier performance. The results were almost the same, thus suggesting that transitional frames do not significantly impact performance for these stress conditions.

86 Table 10: Experiments using (i) normalized pitch frequency, (ii) periodicity, and (iii) pitch jitter as a three feature set for a GMM (Gaussian mixture model) stress classifier. Here, test cases explore differences in the minimum feature variance during training, the number of training iterations, and augmenting excitation based features with vocalt tract spectral features (MFCCs). The last experiment considers neutral versus grouped stress conditions. Pairwise GMM-Based Stress Classification Results Angry Loud Lombard Normal vs. (4a) variance floonO.OOl 86.4% 90.5% 82.0% (4b) variance floonO.OOOl 83.9% 88.9% 80.5% (5) Training: 20 iterations 87.4% 92.9% 81.0% (6a) with MFCCs 94.6% 95.9% 87.5% (6b) with MFCC, deltas, delta-deltas 92.6% 95.6% 86.9% (7) Neutral vs. grouped Stress 93.1% 96.2% 87.4%

Experiment 6: In the experiments thus far, we have considered different forms of features which represent excitation characteristics. However, it has been shown that stress also effects spectral structure as reflected in the vocal tract structure. In this experiment, we augment the three excitation features with traditional spectral based MFCC parameters, which generally reflect vocal tract structure. To help reduce the effect of glottal source information on the MFCC parameters, we performed a pre-emphasis (coefficient of 0.97). A 20 set filterbank was used to obtain 8 MFCC spectral features per speech data frame. The results in Table 10 (6) showed a marked improvement for all three stress conditions. We also considered the case where first and second order derivatives were included. In order to reduce the impact of the fine spectral structure, we reduced the number of static MFCC parameters from 8 to 4, and included 4 delta- MFCC and 4-delta-delta MFCC parameters (i.e., first and second order derivatives). The delta features reflect the time rate of change of the static spectral structure. While including delta and delta-delta MFCC parameters have been shown to improve recognition of speech under stress, there was either no change or a slight loss in stress classification performance when included. Other experiments were also performed where we increased the variance of the excitation features by a scale constant, so that they would have more influence over spectral features. The results were within 0.1% of the values obtained in Table 10 (6a) and 10 (6b).

We point out here, that the use of spectral structure assumes that we have some examples of the speakers) in both neutral and stressed speaking conditions. Mean normalized excitation features

87 generally are less speaker dependent, and therefore more appropriate for use when training speaker data is obtained from different speakers in similar test conditions.

Experiment 7: In this last experiment, we consider a test condition originally proposed by Womack and Hansen (1996), where instead of a binary stress classification decision, we assume that the speech is either neutral or stressed, and determine an overall detection rate. This essentially groups the three stress conditions into one class (we use all three stressed GMM models during the test, and if any one is selected over the neutral model, the input is classified as stressed). This decision process does not record an error if an incorrect stressed model is selected (i.e., if the input token is under angry stressed condition, and the loud stressed model is selected, then the input was correctly identified as being under stress). This scenario was chosen, because in many situations speakers are not producing speech under a single style, but in fact typically display a mixture of conditions. The results, Table 10 (7), are nearly the same as those for the case when MFCCs are included.

In summary, the best Gaussian mixture model based classifier for these stress conditions are as follows: excitation features include normalized pitch, periodicity, and jitter with their first and second order derivatives, use 20 iterations of the training algorithm, reduce the feature training variance threshold to 0.001, use 64 mixtures per model, and include at least some form of vocal tract spectral structure (MFCCs) if data is available.

88 7 Stress Analysis: Mt. Carmel Data In this section, we consider analysis and evaluation of the actual stressed speech from Mt- Carmel. In Section 7.1 example feature plots are compared between the excitation features discussed in Section 6.1 for sample SUSAS and Carmel speech data. Section7.2 considers stress assessment using pitch, MFCCs, and the nonlinear TEO based feature for the Mt. Carmel data. The Carmel data represents audio recordings between individuals during a law enforcement encounter with armed extremists.

7.1 Example Excitation Feature Plots In the previous section, we discussed a number of experiments to determine the usefulness of traditional excitation features for stress classification. Here, we use the same Gaussian mixture model classifier trained using the Maximum Likelihood approach. The evaluation here, however, is focused on a comparison of these features with CVSA for both Mt. Carmel law enforcement data and SUSAS speech under stress data. Several frames of processed speech output for (1) Normalized pitch, (2) Jitter, and (3) CVSA output from Matlab are considered. While it is difficult to make certain judgements from only a few examples of stressed and neutral speech, we use a comparison with examples from SUSAS which contained much more test data.

The first plot (Fig. 12) shows results for telephone speech collected from a 911 call made during the FBI raid on Mt. Carmel. Here, we see that the high-stress condition results in normalized pitch values near 1 throughout the beginning and end of the audio fragment. There is also an increase in the jitter output near the middle of the segment. For the CVSA output, we see that the variations in the output waveform are reduced for the case of the high-stressed speech, which we would expect for the case when microtremors are absent due to the presence of stress. This would also, however, contradict expectations of "blocking" which should be readily visible for speech under stress.

In order to compare these results with earlier evaluations, we repeated these evaluations with speech data from SUSAS. Data from the neutral word "fix" and the same word produced under actual stress (roller coaster environment) were processed. Fig. 13 presents feature profiles for

89 the three features (normalized pitch, jitter, CVSA). Because the classifiers considered are statistical in nature, it is difficult to visually see significant differences between the normalized pitch and jitter features. The CVSA outputs show significant differences between the neutral and stressed conditions. However, we point out that the stressed speech signal lacks the "blocking" output that is expected from the CVSA in stressed conditions. We might point out that speech data from the actual portion of SUSAS was from roller coaster rides, which potentially could include low frequency physical vibration.

90 MOUNT CARMEL STRESSED SPEECH DATA SI (Neutral) S7 (High Stress) NORMALIZED PITCH

1

D.a ■

o.s

0.4 ■

0.2

n r, A 10 15 20 25 FRAME

0.5 1 1.3 SAMPLE (* 10 )

Figure 12: Feature analysis results for speech from Mt Carmel Recording. Sentence SI and S7 were selected. Three features include (i) normalized pitch, (ii) jitter, and (iii) CVSA response.

91 SUSAS STRESSED SPEECH DATA [ACTUAL DOMAIN] Neutral: "fix" Actual (Roller Coaster): "fix"

NORMALIZED PITCH

1 1.5 2 2-5 2 3 4 4 SAMPLE(XIO ) SAMPLE («io )

Figure 13: Feature analysis results for speech from Mt Carmel Recording. Sentence SI and S7 were selected. Three features include (i) normalized pitch, (ii) jitter, and (iii) CVSA response.

7.2 Assessment Evaluation for Mt. Carmel Data In this section, a stress assessment evaluation similar to that presented in Section 5 is considered, using speech data recorded during Mt. Carmel law enforcement encounter. The audio recordings obtained consisted of telephone conversations between an extremist individual (sect leader) within the compound who called 911 emergency services from the beginning of the shooting.

92 The speech we assessed was that of the sect leader's voice during his dialogue with the 911 service. A total of 20 sentences were segmented from the talker's speech and used for and experiment in stress assessment. In that specific situation, almost all 20 sentences were spoken

- under stress. However, the degree of stress varies from time to time. For example, the sect leader explained the situation in sentences 1, 9, and 10 in relatively neutral conditions; while sentences 7 and 8 were spoken during the actual shooting, with gunshots present as background noise. It is clear from these examples that the speaker was under an extreme level of stress. Similar to the experiment using SUSC-0 data, four voiced portions per utterance were extracted for assessment (text transcriptions and extracted voiced sections are summarized in Table 11.

Table 11: Sentences from Mt. Carmel Recording used for Stress Assessment Evaluation. Note that bold uppercase characters represent voiced sections, which were used for overall stress assessment of that sentence. Sentences used for Evaluation from Speech Recorded from Mt. Carmel Shooting No. Sentence Text 1 There are mAn, seventy five men arOUNd our bUILding,... shooting at us 2 YEAH, there are seventy five mEN around our building, they are shOOting at us at MOUNt Carmel. 3 YEAH, tell them there chlLdren and women in hERE and to cALL it off 4 TELL thEM to cALL it 5 TELL thEM to pULL bAck 6 TELL thEM to pULL bAck 7 I Am UNdER fIRE 8 I have the right to defend mysELf. They started firing flRst 9 'nother chopper with mORe people and mORe guns going on. here they cOMe.. 10 That's nOt Us, thAt's thEM 11 We wAnna cEAse-fIRE, we'll tALk 12 well tALk when thEY stOp firing 13 ThEY ARE.r thEY) ARE} ]. (two different speakers) 14 They hAven't bEEN} thEY hAven't been (different speaker breaks in) 15 ThAt's thEM, thEY hAven't been.[??].. shooting.[???] (noise breaks in during speech) 16 They're, What do you thINk they doing ALL this firing on us right nOW? 17 lEAst thREE (break into two portions) hits 18 ONE (break into two) dEAd (break into two) 19 I'M tALkfNG [??] (another speaker breaks in at the end) 20 HOLd their ARE}, to IE Ave the property and we'll tALk

The assessment results are shown in Fig. 14. Instead of using the actual score difference 1 or the - y-axi s as we did in Fig. 11, we used normalized HMM score difference. In essence, the HMM scores differences are normalized for each feature, respectively, based on the correspc >nding rangf ;. This was performed because the range of HMM score differences for pitch was so large

93 that the change in score difference for the TEO-CB-Auto-Env feature and MFCC feature could not be observed clearly when all three were plotted on the same figure. As we can see, the general assessment score trend is independent upon which anchor stress HMM reference model is used (i.e., one trained using simulated stress data from SUSAS or actual stressed speech from SUSAS). Sentences assessed in this experiment have different levels and types of background noise, such as gunshots, etc. So the prospect exists that assessment results could be affected by background noise. Upon a careful listener evaluation of all 20 sentences, we found that pitch and the TEO-CB-Auto-Env feature reflected similar information regarding the degree of perceived speaker stress; while the MFCC feature was very inconsistent. We also note that the accuracy of stress assessment could be influenced by the type of recording condition. In some cases here, the speech sounds 'hollow' as if the microphone recording conditions changed (there are examples where the speaker is actually yelling and cases where his mouth could be some distance from the microphone). There are also many examples where the voiced portions are very short. In spite of these observations, it appears that relative to the first sentence, there is some degree of consistency for sentences which are more relaxed and those which are under higher degrees of stress.

94 NEUTRAL HMM MODEL VS. SIMULATED STRESS HMM MODEL T"—'"r""™""? ■ i ■ !"•—-r- STRESS 14 o o MFCC ♦——» Pitch TEO-CB-Auto-Env ! ■ NORMALIZED HMM SCORE 4 DIFFERENCE 2 0

-2 I -4 NEUTRAL 1 2 3 4 5 B 7 8 9 10 11 12 13 14 13 16 17 18 19 20 SENTENCE # (TIME) ^

NEUTRAL HMM MODEL VS. ACTUAL STRESS HMM MODEL STRESS 14 o MFCC ■ «— -♦ Pitch - -jk TEO-CB-Auto-Env

A - I = / 1\ I » NORMALIZED j£ M* /|/ * ■ HMM SCORE * IT: 1 © '. |7'. 2 I" ji - -• -© : /► v ■ DIFFERENCE : « o ?-X 9-6 \ : O .- ©—© . o - \ j 0-

Figure 14: Assessment results for speech from Mt Carmel Recording (Log likelihood ratio is shown along Y- axis while sentence number is shown along X-axis): (a) Neutral vs Simulated stress (Loud, Angry and Lombard) HMM reference models; (b) Neutral vs Actual stress HMM reference models}

95 8 CONCLUSIONS: Issues for Stress Assessment and Classification The issue of stress classification is a problem which is becoming increasingly important for law enforcement and military in the field. Past methods for voice stress analysis have focused on what is believed to be microtremors in the muscles for voice production. While there is evidence which suggests that muscle control within the speech production system could, and most likely are, influenced by the presence of stress experienced by the speaker; there is still uncertainty to what degree and how consistent this change in speech muscle control could actually manifest itself into the form of "microtremors" during the speech production process. Clearly extensive research in the medical field has considered neurological based factors that effect human speech production (for example, the work done for Parkinson's speech (L. Ramig, in Kent 1992).

In this report, we have considered previous studies on speech under stress, results from our own evaluations, experiments using features derived from commercial voice stress analyzers, and novel nonlinear based features recently formulated in the literature. All of these findings suggest that when a speaker is under stress, their voice characteristics change. Changes in pitch, glottal source factors, duration, intensity, and spectral structure from the vocal tract are all influenced in different ways by the presence of speaker stress. Our results also suggest that the features by which commercial voice stress analyzers are based upon, can at times reflect changes in the speech production system which occur when a speaker is under stress. However, as is the case with speaker control of pitch, a variety of factors could influence the presence or absence of the microtremors, which are claimed to exist in our muscle control during speech production. It is clearly unlikely that a single measure such as that based on the CVSA, could be universally successful in assessing stress (such as that which might be experienced during the act of deception). However, it is not inconceivable that under extreme levels of stress, that muscle control throughout the speaker will be affected, including muscles associated with speech production. The level and degree to which this change in muscle control imparts less/more fluctuations in the speech signal cannot be conclusively determined, since even if these tremors exist, their influence will most certainly be speaker dependent. A similar argument has been

96 made in the medical community over non-invasive voice analysis for screening of subjects with vocal fold cancer (Hansen, Gavidia-Ceballos, and Kaiser, 1998).

Many commercial voice stress analyzers are presently on the market. Some of these include:

• PSE: pyschological stress evaluator, developed by A. Bell,Verimetrics (U.S. Patent by Bell and others, 1976). • CVSA: Computerized voice stress analyzer, National Inst. Truth Ver., C. Humble. • Lantern: Diogenes Group • Truster: Makh-Shevet, Isreal company. • Several low cost voice stress analyzer kits Although the details by which these methods operate are not clearly described in their literature, the claims of success are well documented in the company literature. Most, if not all, of these methods focus on some aspect of assessing the presence of microtremors which are expected to be present when a speaker is under neutral/calm speaking conditions. These microtremors are expected to be reduced when a speaker is under stress. The results from our study here cannot prove or disprove the commercial claims. However, our evaluations using various linear and nonlinear based excitation features suggest that various types of emotion/stress can be detected in some individuals. The reliability will depend on the available training data for the classifier, and we expect that stress classification performance should be more successful if there is a means of "training" the system for a given speaker in similar conditions. Some of the claims made by these manufacturers have no basis, or are so extreme that they go against basic speech science. The Truster web-page states that their system will be able to determine deception even if the speaker is under different levels/types of emotion. Such a claim has no scientific merit, since it is not possible to cleanly separate the excitation signal into component dues to emotion and those due to deception.

More recent algorithms for voice stress analysis have been proposed using digital speech processing techniques, some of which suggest alternative excitation methods which offer the promise of better system integration within speech/speaker recognition or voice equipment for communications scenarios.

97 While research and progress have been made in the areas of stress classification and assessment, a number of important research areas require further investigation. Here, we briefly consider four points. First, in order to perform stress classification or assessment, two anchor models are needed (one for neutral and one for stress). These models should be trained using speech obtained from the actual stressful environments in which we wish to assess operators (i.e., aircraft pilot recordings if pilots are to be assessed; subject interviews in law enforcement). The type of stress which is displayed in one setting (aircraft cockpit), may not reflect the same conditions experienced in another (law enforcement questioning session). Second, further research is needed to assess the consistency of stress assessment/classification for a given speaker and for unseen speakers (i.e., explore the impact of using other training data to assess new speakers). Commercial systems assume that the same feature will be effected by all speakers. There needs to be a way of determining if a stress classification algorithm/system would prove to be useful, or if the speaker is not a viable candidate for assessment. Third, there is clearly a range of emotions and psychological factors which all contribute to speaker 'stress.' In emergency scenarios a pilot may experience a combination of fear, anxiety, fatigue, etc. at the same time. A suspect under questioning would also display natural stress even if he were not guilty. The ability to classify/assess this mixture of speaker traits is important in determining the stress state of the speaker. Finally, there exists an unknown relationship between how computer based speech systems are able to classify stress and how humans perform stress classification. This operation is well documented in the field of speech quality assessment, where there exists scientifically recognized subjective tests, which are used to determine a degree of correlation with numerical objective measures. It would make sense to explore the field to determine if standardized tests exist or could be modified to subjectively determine stress state and level in speakers, and then apply either commercial systems or research based stress classification algorithms to determine their 'correlation' to correct stress detection. This issue is important in the collection of future databases so that better stress anchor models can be used with emerging speech technology. From the research conducted here, it is suggested that speakers often vary how they convey stress in their speech, and that several speech features may be needed to

98 capture the subtle differences in how speakers convey their stress state in different voice communications scenarios.

99 Software implementations of the features presented in this report will be supplied directly to the sponsor. This will include algorithms coded in Matlab and C of linear features, CVSA, and the TEO-CB-Auto-Env measure.

100 9 References Speech Communication (1996) Special Issue on Speech under "Stress", 20(l-2):3-173, Nov.

Arslan, L. (1996)."Foreign Accent Classification", Ph.D. Thesis, Robust Speech Processing Laboratory, Duke University, July.

Baber, C, Mellor, B., Graham, R., Noyes J.M., Tunley, C. (1996)."Workload and the Use of Automatic speech recognition: The Effects of Time and Resource Demands," Speech Communication, 20(1--2) 37-54.

Bachrach, A.JA (1979)."Speech and its Potential for Stress Monitoring: Monitoring Vital Signs in the Diver," Naval Medical Research Institute TECHNICAL REPORT, Aug., 78-93.

Barnwell, T.P. (1971)."An Algorithm for Segment Durations in a Reading Machine Context," Final Technical Report 479, Research Laboratory of Electronics, Mass. Inst. Of Tech., Cambridge, MA.

Bell, A.D., Ford, W.H., McQuiston, C.R., "Physiological Response Analysis Method and Apparatus," U.S. Patent No. 3,971,034, July 20,1976.

Bond, Z.S. and Moore, T.J. (1990). "A note on Loud and Lombard Speech," ICSLP-90, Kobe, Japan, 969-972.

Cairns, D.A. and Hansen, J.H.L. (1994a). "Nonlinear Analysis and Detection of Speech Under Stressed Conditions," J. Acoust. Soc. Am. 96(6) 3392-3400.

Cairns, D.A. and Hansen, J.H.L. (1994b). "Nonlinear Speech Analysis using the Teager Energy Operator with Application to Speech Classification under Stress," ICSLP-94, n(3):1035-1038, Yokohama, Japan, Sept.

Chen, Y. (1987)."Cepstral Domain Stress Compensation for Robust Speech Recognition," IEEE ICASSP-87, Dallas, TX, 717-720.

Cestaro, V. L., "A Comparison between the Decision Accuracy Rates Obtained Using the Polygraph Instrument and the Computer Voice Stress Analyzer (CVSA) in the Absence of Jeopardy," Tech Report, DoD Polygraph Inst., August 1995.

Creelman, CD. (1962). "Human Discrimination of Auditory Duration," J. Acoust. Soc. Am. 34(5) 582-593.

Darby, J.K. (1981).Speech Evaluation in Psychiatry, Grime & Stratton, New York, New York.

Davis, S., and Mermelstein, P. (1980).Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, and Signal Process., ASSP-28(4) Aug., 357-366.

Deller, J.R., Hansen, J.H.L., and Proakis, J.G. (1999)."Discrete-Time Processing of Speech Signals," 2nd Ed., IEEE Press, New York, NY.

Flack, M. (1918)."Flying Stress," London: Medical Research Committee.

Folds, DJ., Gerth, J.M., Engelman, W.R. (1986)."Enhancement of Human Performance in Manual Target Acquisition and Tracking," Final Tech. Report USAFASM-TR-86-18, USAF School of Aerospace Med., Brooks AFB, TX.

Folds, D.J. (1987)."Response Organization and Time-Sharing in Dual-Task Performance," Ph.D. Thesis, School of Psychology, Georgia Inst. of Tech., Atlanta, GA.

101 Fry, D.B. (1955)."Duration and Intensity as Physical Correlates of Linguistic Stress," J. Acoust. Soc. Am. 27(4) 765-768.

Gardner, M.B. (1966)."Effect of Noise System Gain, and Assigned Task on Talking Levels in Loudspeaker Communication," J. Acoust. Soc. Am. 40(5) 955-965.

Goldberger, L., Breznitz, S. (l9S2).Handbook of Stress: Theoretical & Clinical Aspects, Free Press, Macmillan Pub., New York, New York.

Gong, Y. (1995)."Speech recognition in noisy environments: A survey," m Speech Communication, 16:261-291.

Hanley, C.N, Harvey, D.G. (1965)."Quantifying the Lombard Effect," J. of Hearing & Speech Disorders, 30:274- 277.

Hansen, J.H.L. (1988). "Analysis and Compensation of Stressed and Noisy Speech with Application to Robust Automatic Recognition," Ph.D. Thesis, School of Electrical Engineering, Georgia Inst. of Tech., Atlanta, GA.

Hansen, J.H.L. (1989). "Evaluation of Acoustic Correlates of Speech Under Stress for Robust Speech Recognition," IEEE Proc. 15th Northeast Bioengineering Conf., Boston, Mass., 31-32.

Hansen, J.H.L. (1993)."Adaptive Source Generator Compensation and Enhancement for Speech Recognition in Noisy Stressful Environments," ICASSP-93, Minn., MN, 95-98.

Hansen, J.H.L. (1994). "Morphological Constrained Enhancement with Adaptive Cepstral Compensation (MCE- ACC) for Speech Recognition in Noise and Lombard Effect," IEEE Trans, on Speech & Audio Proc 2(4):598-614.

Hansen, J.H.L. (1996). "Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition, Speech Communications: Special Issue on Speech Under Stress, 20(1-2): 151-173.

Hansen, J.H.L. (1998a)."Analysis of Acoustic Correlates of Speech Under Stress.Part I: Fundamental Frequency, Duration, and Intensity Effects," submitted Oct. 20,1998 to./ Acoust. Soc. Am.

Hansen, J.H.L. (1998b)."Analysis of Acoustic Correlates of Speech Under Stress.Part II: Glottal Source and Vocal Tract Spectral Effects," submitted Oct. 20,1998 to J. Acoust. Soc. Am.

Hansen, J.H.L., Bria, O.N. (1990)."Lombard Effect Compensation for Robust Automatic Speech Recognition in Noise," ICSLP-90, Kobe, Japan, 1125-1128.

Hansen, J.H.L., Bou-Ghazale, S.E. (1995). "Duration and Spectral Based Stress Token Generation for Keyword Recognition Using Hidden Markov Models," IEEE Trans, on Speech & Audio Proc, 3(5), 415-421.

Hansen, J.H.L., Bou-Ghazale, S.E. (1997)."Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database," EUROSPEECH-97,4:1743-1746, Rhodes, Greece.

Hansen, J.H.L;, Cairns, D.A. (1995)."ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments," Speech Communications, 16:391-422.

Hansen, J.H.L., Clements, M.A. (1987)."Evaluation of Speech under Stress and Emotional Conditions," Proc. J. Acoust. Soc. Am., 82(Fall Sup.):S17, Nov.

Hansen, J.H.L. and Clements, M.A. (1989)." Stress and Noise Compensation Algorithms for Robust Automatic Speech Recognition," IEEE ICASSP-89, Glasgow, Scotland, U.K., 266-269.

102 Hansen, J.H.L. and Clements, M.A. (1995)."Stress and Noise Compensation Algorithms for Robust Automatic Speech Recognition," IEEE Trans, on Speech & Audio Proc.,} 3(5):407-415.

Hansen, J.H.L., Gavidia-Ceballos, L., and Kaiser, J.F., (1998)."A nonlinear based speech feature analysis method with application to vocal fold pathology assessment," IEEE Tram. On Biomedical Engineering, 45(3):300-313, March 1998.

Hansen, J.H.L., Mammone, R., Young, S. (1994)."Editorial for the SPECIAL ISSUE: Robust Speech Recognition," IEEE Trans, on Speech & Audio Proc, 2(4):549-550.

Hansen, J.H.L., South, A.J., Swail, C, Moore, R.K., Steeneken, H.J.M., Cupples, E.J., Anderson, T., Vloeberghs, C, Trancoso, I., Verlinde, P. (1999). The Impact of Speech Under "Stress" on Military Speech Technology, Final Technical Report, NATO AC/232/IST/TG-01, March.

Hansen, J.H.L., Womack, B. D. (1996). "Feature Analysis and Neural Network Based Classification of Speech Under Stress," IEEE Trans. Speech Audio Proc, 4(4):307-313.

Hanson, B.A., Applebaum, T.H. (1990)."Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech," IEEE ICASSP-90, pp. 857- 860.

Haward, L. (1975) "Emotional Stress and Flying Efficiency," AGARD Con/. Proceedings No.181, North Atlantic Treaty Organization, C8-1, Oct.

Hecker, M.H.L., Stevens, K.N., von Bismarck, G., Williams, C.E. (1968) "Manifestations of Task-Induced Stress in the Acoustic Speech Signal," J. Acoust. Soc. Am 44(4), 993-1001.

Hecker, M.H.L. (1974) "A Study of the Relationships Between Consonant-Vowel Ratios and Speaker Intelligibility," Ph.D Thesis, Stanford University, Palo Alto, CA.

Hess, W. (1983) Pitch Determination of Speech Signals, Springer Verlag, New York, NY.

Hicks, J.W., Hollien, H., (1981a)."The Reflection of Stress in Voice-1: Understanding the Basic Correlates," The 1981 Carnahan Conf. on Crime Countermeasures, 189-195.

Hollien, H., Hicks, J.W. (1981b)."The Reflection of Stress in Voice-2: The Special Case of Psychological Stress Evaluators," The 1981 Carnahan Conf. on Crime Countermeasures, May, 196-197.

Hollien, H., Majewski, W. (1977)."Speaker Identification by Long-Term Spectra Under Normal and Distorted Speech Conditions," J. Acoust. Soc. Am., 62(4):975-980.

Hollien, H., Majewski, W., Doherty, E.T. (1982)."Perceptual identification of voices under normal, stress and disguise speaking conditions," J. Phonetics, 10:139-148.

House, A.S. (1962)."On Vowel Duration in English," J. Acoust. Soc. Am. 33(9) 1174-1178.

Jelinek, J. (1997). Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA.

Jex, H.R. (1979)."A Proposed Set of Standardized Sub-Critical Tasks For Tracking Workload Calibration," in N. Moray, Mental Workload: Its Theory and Measurement, New York: Plenum Press, 179-188.

Junqua, J.C. (1993)."The Lombard reflex and its role on human listeners and automatic speech recognizers, J. Acoust. Soc. Am., 93(1):510-524.

103 Junqua, J.C. (1996). "The Influence of Acoustics on Speech Production: A Noise-induced Stress Phenomenon Known as Lombard Reflex," Speech Comm., 20(l-2):13-22.

Junqua, J.C, Fincke, S., Field, K. (1999)."The Lombard Effect: A reflex to better communicate with others in noise," IEEE ICASSP-99,4:2083-2086.

Kaiser, J.F. (1990)."On a Simple Algorithm to Calculate the "Energy' of a Signal," IEEE ICASSP-90, pp. 381-384.

Kaiser, J.F. (1993)."Some Useful Properties of Teager's Energy Operator," IEEE ICASSP-93,3:149-152.

Klatt, D. (1973)."Interaction between two factors that influence vowel duration," J. Acoust. Soc. Am. 54(4) 1102-4.

Klatt, D. (1976)."Linguistic uses of segmental duration in English: Acoustic and perceptual evidence," J. Acoust. Soc. Am. 59(5) 1208-21.

Köhler, K.J. (1986)."Invariance and Variability in Speech Timing: From Utterance to Segment in German," Chap. 13, Invariance and Variability in Speech Processes, ed.s J. Perkell& D. Klatt, Lawrence Erlbaum Ass., Hillsdale, NJ.

Kuroda, I., Fujiwara, O., Okamura, N., Utsuki, N. (1976)."Method for Determining Pilot Stress Through Analysis of Voice Communication," Aviation, Space, and Environmental Medicine, May, 528-533.

Levinson, S.E. (1986) "Continuously variable duration hidden Markov models for automatic speech recognition," Computer Speech and Language, 1:29-45.

Lieberman, P., Michaels, S. (1962). "Some Aspects of Fundamental Frequency and Envelope Amplitude as Related to the Emotional Content of Speech," J. Acoust. Soc. Am. 34(7) 922-927.

Lippmann, R.P., Mack, M., Paul, D., (1986)."Multi-Style Training for Robust Speech Recognition Under Stress," Proc. J. Acoust. Soc. Am. 110th Meeting, QQ10.

Lippmann, R.P., Martin, E.A., Paul, D.B. (1987)."Multi-Style Training for Robust Isolated-Word Speech Recognition," IEEE ICASSP-87, Dallas, TX, 705-708.

Lively, S., Pisoni, D., van Summers, W., Bernacki, R. (1993)."Effects of cognitive workload on speech production: Acoustic analyses and perceptual consequences," J. Acoust. Soc. Am., 93(5) 2962-2973.

Lombard, E. (1911)."Le Signe de Elevation de la Voix," Ann. Maladies Oreille, Larynx, Nez, Pharynx, 37,101-119.

Malkin, F.J., Christ, K.A. (1985)."Human Factors Engineering Assessment of Voice Technology for the Light Helicopter Family," U.S. Army Human Engineering Lab. Tech. Report, 1-20, June.

Maragos, P., Kaiser, J.F., and Quatieri, T.F. (1993a)."Amplitude and Frequency Demodulation Using Energy Operators," IEEE Trans, on Signal Proc, 41(4): 1532-1550.

Maragos, P., Kaiser, J.F., and Quatieri, T.F. (1993b)."Energy Separation in Signal Modulations with Application to Speech Analysis," IEEE Trans, on Signal Proc, 41(10):3025-3051, Oct.

Martin, E.A., Lippmann, R.P., Paul, D.B., (1987)."Two-Stage Discriminant Analysis for Improved Isolated-Word Recognition," ICASSP-87, Dallas, TX, April, 713-716.

104 Murray, I.R., Arnott, J.L., Rohwer, E.A. (1996)."Emotional stress in synthetic speech: Progress and future directions," Speech Communication, 20:85-92.

Murray, I.R., Baber C, South, A. (1996)."Towards a Definition and Working Model of Stress and Its Effects on Speech," Speech Comm., 20 (l-2):3-12.

Nicholson, A.N., Hill, L.E., Borland, R.G., Krzanowski, W.J. (1973)."Influence of Workload on the Neurological State of the Pilot During the Approach and Landing," Aerospace Medicine, 44(2) 146-152.

Ofuka, E., Valbret, H., Waterman, M., Campbell, N., Roach, P. (1994) "The role of F0 and duration in signaling affect in Japanese: anger, kindness, and politeness," ICSLP-94,3:1447-1450.

Paul, D.B. (1987)."A Speaker-Stress Resistant HMM Isolated Word Recognizer," IEEE ICASSP-87, pp. 713-716.

Pearsons, K.S., Bennett, R.L., Fidell, S. (1977)."Speech Levels in Various Noise Environments," Office of Health and Ecological Effects, Report No. EPA-600/1-77-025.

Peckham, J.B. (1979)."A Device For Tracking The Fundamental Frequency of Speech and its Application in the Assessment of 'Strain' in Pilots and Air Traffic Controllers," Tech. Report 79056, Royal Aircraft Est., May, 1-55, 1979.

Perkell, J.S., Klatt, D.H. (ed.'s) (1986).Invariance and Variability in Speech Processes, Lawerance Erlbaum Ass., Hillsdale, N.J.

Pickett, J.M. (1980). The Sound of Speech Communication, University Park Press, Baltimore, Maryland.

Pisoni, D.B., Bernacki, R.H., Nusbaum, H.C., Yuchtman, M. (1985)."Some Acoustic-Phonetic Correlates of Speech Produced in Noise," ICASSP-85, Tampa, Fl, 41.10.1-4.

Poock, G.K., Armstrong, J.W. (1981)."Effect of Operator Mental Loading on Voice Recognition Systems Performance," Naval Postgraduate School Tech. Report, Aug.

Poock, G.K., Armstrong, J.W. (1981)."Effect of Task Duration on Voice Recognition System Performance," Naval Postgraduate School Tech. Report, Sept.

Rabiner, L., Juang, B.-H. (1993). Fundamentals of Speech Recognition, Prentice-Hall, 1993.

Ramig, L. (1992)."The role of phonation in speech intelligibility: A review and preliminary data from patients with Parkingson's disease," in Kent, R.D. Intelligibility in Speech Disorders, John Benjamins Pub. Co., Philadephia, PA, 1992.

Rajasekaran, P.K., Doddington, G.R., Picone, J.W. (1986)."Recognition of Speech Under Stress and In Noise," IEEEICASSP-86,} Tokyo, Japan, 733-736.

Reed, L. (1985)."Military Applications of Voice Technology," Speech Technology, Feb., 42-50.

Rostolland, D. (1982)."Acoustic Features of Shouted Voice Part I," Acustica, 50 118-125.

Rostolland, D. (1982)."Phonetic Structure of Shouted Voice Part II," Acustica, 51 80-89.

Ruiz, R., Absil, E., Harmegnies, B., Legros, C, Poch, D. (1996)."Time and spectrum-related variabilities in stressed speech under laboratory and real conditions, Speech Communication, 20:111-130.

105 Shipp, T., Izdebski, K., "Current Evidence for the Existance of Laryngeal Macro and Microtremor," Journal of Forensic Sciences, 26(3):501-505, July 1981.

Simonov, P.V., Frolov, M.V. (1977)."Analysis of the Human Voice as a Method of Controlling Emotional State: Achievements and Goals," Aviation, Space, & Environmental Sciences, Jan., 23-25.

Simpson, C.A. (1985). "Speech Variability Effects on Recognition Accuracy Associated With Concurrent Task Performance by Pilots," Psycho-Linguistic Research Associates, Technical Report, April.

Stanton, B.J., Jamieson, L.H. and Allen, G.D. (1988)."Acoustic-Phonetic Analysis of Loud and Lombard Speech in Simulated Cockpit Conditions," IEEE ICASSP-88: Inter. Conf. on Acoust, Speech, Sig. Proc, pp.331-334.

Stanton, B.J., Jamieson, L.H. and Allen, G.D. (1989)."Robust recognition of loud and lombard speech in the fighter cockpit environment," IEEE ICASSP-89, pp. 675-678.

Steeneken, H.J.M., Hansen, J.H.L. (1999)."Speech under Stress Conditions: Overview of the Effect of Speech Production and on System Performance," IEEE 1CASSP-99,4:2079-2082.

Summers, W.V., Pisoni, D.B., Beraacki, R.H., Pedlow, R.I., Stokes, M.A. (1988)."Effects of Noise on speech production: Acoustic and Phonetic Analysis," J. Acoust. So. Am. 84(3):917-928.

Sweet, W. (1995). "The Glass Cockpit," IEEE Spectrum, September 1995, pp. 30-38.

Streeter, L.A., Macdonald, N.H., Apple, W., Krauss, R.M., Galotti, K.M. (1983)."Acoustic and Perceptual Indicators of Emotional Stress," J. Acoust. So. Am. 73(4): 1354-1360.

Takizawa, Y., Hamada, M. (1990). "Lombard speech recognition by formant-frequency-shifter LPC cepstrum," ICSLP-90: Inter. Conf. on Spoken Lang. Proc, pp. 293-296.

Teager, H. (1980)."Some Observations on Oral Air Flow During Phonation", IEEE Trans. Acoust., Speech, Signal Proc, 28(5):599-601, Oct.

Teager, H., Teager, S. (1983)."A Phenomenological Model for Vowel Production in the Vocal Tract," in Speech Science: Recent Advances, edited by R.G. Daniloff (College-Hill, San Diego), pp.73-109.

Thomson, D.L., Chengalvarayan, R., "Use of Periodicity and Jitter as Speech Recognition Features" Proc. IEEE ICASSP'98, Seattle, Washington, May 1998.

Umeda, N. (1975)."Vowel duration in American English," J. Acoust. Soc. Am. 58(2):434-445.

Umeda, N. (1977)."Consonant duration in American English," J. Acoust. Soc. Am. 61(3) 846-858. van Santen, J.P. (1995)."Computation of timing in text-to-speech synthesis," in Speech Coding and Synthesis, Kleijn and Paliwal, eds., Elsevier N.Y. van Santen, J.P., Hirschberg, J. (1994)."Segmental effects on timing and height of pitch contours," ICSLP-94, 2:719- 722.

VanDercar, D. H„ Greaner, J., Hibler, N. S„ Spielberger, C. D., and Bloch, S., "A Description and Analysis of the Operation and Validity of the Psychological Stress Evaluator," Journal of Forensic Sciences, 25(1): 174-188, Jan. 1980.

106 Wang, X., Pols, L.C., ten Bosch, L.F. (1996)."Analysis of context-dependent segmental duration for automatic speech recognition," ICSLP-96,2:1181-84.

Whitmore, J., Fisher, S. (1996)."Speech during sustained operations," Speech Communication, 20:55-70.

Williams, C. E., and Stevens, K. N. (1969). "On Determining the Emotional State of Pilots During Flight: An Exploratory Study," Aerospace Medicine, 40 1369-1372.

Williams, C.E., and Stevens, K.N. (1972). "Emotions and Speech: Some Acoustic Correlates," J. Acoust. Soc. Am., 52(4) 1238-1250.

Womack B.D., Hansen, J.H.L. (1996)."Classification of Speech under Stress Using Target Driven Features," Speech Comm., 20(1-2): 131-150, Nov.

Yost, W.A. (1994). Fundamentals of Hearing, 3rd Edition, Academic Press, San Diego, CA., pp. 153-167.

Zhou, G., Hansen, J.H.L., Kaiser, J.F. (1998a)."Classification of Speech under Stress Based on Features from the Nonlinear Teager Energy Operator," IEEE ICASSP-98,1:549-552, Seattle, WA.

Zhou, G., Hansen, J.H.L., Kaiser, J.F. (1998b)."Linear and Nonlinear Speech Feature Analysis for Stress Classification," ICSLP-98,3:883-886, Sydney, Australia.

107 PROPOSED CHARGING LETTER

CERTIFIED MAIL - RETURN RECEIPT REQUESTED

The National Institute for Truth Verification (“NITV”) a/k/a NITV Federal Services, LLC a/k/a NITV Government Services, LLC a/k/a NITV, LLC 1 1400 Fortune Circle West Palm Beach, Florida 33414

Attn: Charles Humble Chief Executive OfJicer

Dear Mr. Humble:

The Bureau of Industry and Security, U.S. Department of Commerce (“BIS”), has reason to believe that the National Institute for Truth Verification’ of West Palm Beach, Florida, (“NITV”) has committed eleven violations of the Export Administration Regulations (the “Regulations”),2 which are issued under the authority of the Export Administration Act of 1979, as amended (the “Ac~”).~Specifically, BIS charges that NITV committed the following violations:

Charge 1: 15 C.F.R. 5 764.2(a): Exporting an Item without the Required License:

As described in greater detail in Schedule A, which is enclosed herewith and incorporated herein by reference, on one occasion on or about April 3,2003, NITV engaged in conduct prohibited by

’ dk/a NITV Fedcral Services, LLC; a/k/a NITV Government Services, LLC; a/k/a NITV, LLC.

The Regulations are currently codified in the Code of Federal Regulations at 15 C.F.R. Parts 730-774 (2006). The charged violations occurred between 2001 and 2006. The Regulations governing the violations at issue are found in the 2001 through 2006 versions of the Code of Federal Regulations (1 5 C.F.R. Parts 730-774 (2001-2006)). The 2006 Regulations set forth the procedures that apply to this matter.

50 U.S.C. app. $$ 2401-2420 (2000). Since August 21,2001, the Act has been in lapse and the President, through Executive Order 13222 ofAugust 17, 2001 (3 C.F.R., 2001 Comp. 783 (2002)), which has been extended by successive Presidential Notices, the most recent being that of August 2, 2005 (70 Fed. Reg. 45,273 (August 5,2005)),has continued the Regulations in effect under the International Emergency Economic Powers Act (50 U.S.C. $5 1701 - 1706 (2000)). NITV Proposed Charging Letter Page 2 of S the Regulations by exporting a computer containing voice stress analyzer software, classified as ECCN4 3A981, from the United States to South Africa, without the Department of Commerce license required by Section 742.7 of the Regulations In so doing, NITV committed one violation of Section 764.2(a) of the Regulations.

Charge 2: 15 C.F.R. !j 764.2(a): Expor ing Technology Without the Required License:

As described in greater detail in Schedule A, which is enclosed herewith and incorporated herein by reference, on or about September 10,2001 through on or about September 15, 2001, NITV engaged in conduct prohibited by the Regulations by exporting technology specially designed for the use of voice stress analyzer equipment (ECCN 3E980), from the United States to South Africa, without the Department of Commerce license required by Section 742.7 of the Regulations. Specifically, NITV released information, in the form of technical data and/or technical assistance provided during its Certified Examiners Course (ECCN 3E980), to a national of South Africa who was not lawfully admitted for permanent residence in the United States and was not a protected individual under the Immigration and Naturalized Act (8 U.S.C. Section 1324b(a)(3)). The release of this technology in the United States to a national of South Africa is deemed to be an export of the technology to South Africa under Section 734.2(b)(ii) of the Regulations. In so doing, NITV committed one violation of Section 764.2(a) of the Regu 1ati o n s .

Charge 3: 15 C.F.R. !j 764.2(a): Exporting Technology Without the Required License:

As described in greater detail in Schedule A, which is enclosed herewith and incorporated herein by reference, on or about July 22,2002 through July 27,2002, NITV engaged in conduct prohibited by the Regulations by exporting technology specially designed for the use of voice stress analyzer equipment (ECCN 3E980), from the United States to South Korea, without the Department of Commerce license required by Section 742.7 of the Regulations. Specifically, NITV released inforniation, in the form of technical data andor technical assistance provided during its Certified Examiners Course (ECCN 3E980), to a national of South Korea who was not lawfully admitted for permanent residence in the United States and was not a protected individual under the Immigration and Naturalized Act (8 U.S.C. Section 1324b(a)(3)). The release of this technology in the United States to a national of South Korea is deemed to be an export of the technology to South Korea under Section 734.2(b)(ii) of the Regulations. In so doing, NITV committed one violation of Section 764.2(a) of the Regulations.

4 The term “ECCN” refers to an Export Control Classification Number. See Section 772.1 of the Reg ii 1at i on s, NITV Proposed Charging Letter Page 3 of 5

Charges 4 - 6: 15 C.F.R. 5 764.2(a): Exporting Technology Without the Required Licenses:

As described in greater detail in Schedule A, which is enclosed herewith and incorporated herein by reference, on three occasions on or about February 24, 2003 through March 1, 2003, on or about March 1,2004 through March 6,2004, and on or about February 6,2006 through February 1 1, 2006, NITV engaged in conduct prohibited by the Regulations by exporting technology specially designed for the use of voice stress analyzer equipment (ECCN 3E980), from the United States to Mexico, without the Department of Commerce licenses required by Section 742.7 of the Regulations. Specifically, NITV released information, in the form of technical data and/or technical assistance provided during its Certified Examiners Course (ECCN 3E980), to nationals of Mexico who were not lawfully admitted for permanent residence in the United States and were not protected individuals under the Immigration and Naturalized Act (8 U.S.C. Section 1324b(a)(3)). The release of this technology in the United States to nationals of Mexico is deemed to be an export of the technology to Mexico under Section 734.2(b)(ii) of the Regulations. In so doing, NITV committed three violations of Section 764.2(a) of the Regulations.

Charges 7 - 10: 15 C.F.R. 5 764.2(b): Causing the Export of Items Without the Required Licenses:

As described in greater detail in Schedule A, which is enclosed herewith and incorporated herein by reference, on four occasions on or about September 15,2001, on or about July 27, 2002, on or about March 1, 2003, and on or about March 6, 2004, NITV caused, aided, or abetted the doing of an act prohibited by the Regulations. Specifically, NITV caused the export of computers containing voice stress analyzer software (ECCN 3A981), from the United States to South Africa, South Korea, and Mexico without the Department of Commerce licenses required by Section 742.7 of the Regulations. Specifically, NITV provided foreign nationals attending NITV’s Certified Examiners Course in the United States with computers containing voice stress analyzer software. These foreign nationals then exported these itenis from the United States to South Africa, South Korea, and Mexico, without the Department of Commerce licenses required by Section 742.7 of the Regulations. In so doing, NITV committed four violations of Section 764.2(b) of the Regulations.

Charge 11: 15 C.F.R. €764.2(e):j Acting With Knowledge That a Violation of the Regulations Was About to Occur:

On or about February 6, 2006, NITV transferred items exported from the United States with knowledge that a violation of the Regulations would occur. Specifically, NITV transferred technology specially designed for the use of voice stress analyzer equipment (ECCN 3E980) to a NITV Proposed Charging Letter Page 4 of 5 national of Mexico, when NITV knew or had reason to know that a Department of Commerce license was required to release this technology in the United States to a national of Mexico. NITV had reason to know that a license was required for the release of this technology, as agents of the Office of Export Enforcement, Bureau of Industry and Security had met with employees of NITV on January 27,2005, and provided NITV with information about the Regulations. In addition, NITV had reason to know that a license was required for the release of this technology, as BIS had issued a proposed charging to NITV on December 27,2005. That proposed charging letter informed NITV that the release of such technology to nationals of Mexico is deemed to be an export of the technology to Mexico under Section 734.2(b)(ii) of the Regulations. In addition, that proposed charging letter informed NITV that the release of such technology required a license pursuant to Section 742.7 of the Regulations. In so doing, NITV committed one violation of Section 764.2(e) of the Regulations.

* * * *

Accordingly, NITV is hereby notified that an administrative proceeding is instituted against it pursuant to Section 13(c) of the Act and Part 766 of the Regulations for the purpose of obtaining an order imposing administrative sanctions, including any or all of the following:

The maximum civil penalty allowed by law of $1 1,000 per violation;s

Denial of export privileges; and/or

Exclusion from practice before BIS.

If NITV fails to answer the charges contained in this letter within 30 days after being served with notice of issuance of this letter, that failure will be treated as a default. See 15 C.F.R. $9 766.6 and 766.7. If NITV defaults, the Administrative Law Judge may find the charges alleged in this letter are true without a hearing or fiirther notice to NITV. The Under Secretary of Commerce for Industry and Security may then impose up to the maximum penalty for the charges in this letter.

NITV is fiirther notified that it is entitled to an agency hearing on the record if it files a written demand for one with its answer. See 15 C.F.R. 9 766.6. NITV is also entitled to be represented by counsel or other authorized representative who has power of attorney to represent it. See 15 C.F.R. $9 766.3(a) and 766.4.

The Regulations provide for settlement without a hearing. See 15 C.F.R. $ 766.18. Should NITV have a proposal to settle this case, NITV or its representative should transmit it to the attorney representing BIS named below.

' 15 C.F.R. 9 6.4(a)(2). NITV Proposed Charging Letter Page 5 of5

The U.S. Coast Guard is providing administrative law judge services in connection with the matters set forth in this letter. Accordingly, NITV’s answer must be filed in accordance with the instructions in Section 766.5(a) of the Regulations with:

US. Coast Guard ALJ Docketing Center 40 S. Gay Street Baltimore, Maryland 2 1202-4022

In addition, a copy of NITV’s answer must be served on BIS at the following address:

Chief Counsel for Industry and Security Attention: Janies C. Pelletier, Esq. Room H-3839 United States Department of Comnierce 14th Street and Constitution Avenue, N.W. Washington, D.C. 20230

James C. Pelletier is the attorney representing BIS in this case; any communications that NITV may wish to have concerning this matter should occur through him. Mr. Pelletier may be contacted by telephone at (202) 482-5301.

Si nc ere1y ,

Michael D. Turner Direct or Office of Export Enforcement __=_

w N

Nr P W0 I cicl B W

w - N miD W m 0 mw 0to

w W W LJ Y m iD 53 53 53 e00 000 0co 0to 0to lJNITED STATES DEPARTMENT OF COMMERCE BUREAU OF INDUSTRY AND SECURITY WASHPJGTON, D.C. 20230

In the Matter of 1 ) The National Institute for I’ruth 1 Verification (‘‘NITV”) 1 dWa NITV Federal Services, 1.LC 1 a/k/a NITV Government Services, LIX ) dk/a NI‘I’V, LLC 1 1400 Fortune Circle 1 West Palm Beach, I;lorida 33414 1 1 Respondent 1

SETTLEMENT AGREEMENT

‘l’his Settlement Agreement (“Agreement”) is made by and between the National Institute for Truth Verification’ (“NITV”) and the Bureau of Industry and Security. IJ.S. Department of

Comiiiercc (“,IS”) (collectively referred to as “Parties”). pursuant to Section 766.18(a) of the

Export Administration Regulations (currently codified at 15 C.F.R. Parts 730-774 (2006))

(“Regulations”)*, issued pursuant to the Export Administration Act of 1970. as amended (50

U.S.C. npp. $5 240 1-2420 (2000)) (”Act”):3

WI IEIII

- __ ~~ I dk/a NI I’V I

* ‘l’hc charged violations occurred between 200 1 aiid 2006. The Regulations governing the violations at issue are found in the 2001 through 2006 \wsions of the Code of’ Federal Regulations (I 5 C.F.K. Parts 730-774 (2001-2006)). The 2006 Iiegulations set forth the procedures that apply to this matter.

3 Since August 2 1, 200 I, the Act has been in lapse and the President, through Executive Order 13222 of August 17, 2001 (3 C.F.R., 2001 Comp. 783 (2002)), as extended most recently by the Notice of August 2, 2005, (70 Fed. Reg. 45,273 (August 5, 2005)). has continued the Regulations in effect under the International Emergency Economic PoLvers Act (50 1J.S.C.$5 1701- 1706 (2 00 0)). WHEREAS, BIS has issued a proposed charging letter to NITV that alleged that NITV committed 1 1 violations of the Regulations, specifically:

I. One Violution of 15 C' F R S; "64 2(4 - Exporting an Item 11 ithoirr the Required

Zkente On one occasion on or about April 3, 2003, NI I'V engaged in conduct

prohibited by the Regulations by exporting a computer containing voice stress

analyzer software, classified as ECCN' 3,498 1, from the LJnited States to South

Africa, without the Department of Coinnierce license required by Section 742.7 of

the Regulations

2. One Vi'iolcrtiotiof 13 (' F' R S; 764 2(~)- Euporting Techmlogj IVithoiit the

Keqiriretl 1,icenne On or about September 10, 200 1 through on or about

September IS. 2001, NITV engaged in conduct prohibited by the Regulations by

exporting technology specially designed for the use of \ oice stress analyzer

equipment (1:C'C'N 3E980), from the 'IJnited States to South I\ti.~ca.without the

Ilepartment of ('ommerce license required by Section 7-13 7 of the Regulations.

Specificall), N1 1 V released information, in the form of technical data and/or

technical assistance provided during its Certified I

31Y80), to a national of South Africa nlio \\as not la!\ fdly adinittcd for

permanent rcsidence in the IJnited States and iz as not a protected individual under

the Inimigration and Naturalized r\ct (8 IT S C Section 1124b(a)(3)) I'he release

of this technology in the IJnited States to a national of' South Africa is deemed to

be an export ot'tlie technology to South Africa under Section 734.2(b)(ii)of the

Regulations.

'I'he term "ECCN" refers to an Export Control Classification Number. See Section 772.1 of the Regulati 011s.

Settlenient Agreeinent NI IV Page 2 of 8 3. One Violcrtion of15 C F’R $ 764 2(a) - Exporting Technolog? It’ithout the

Required Licenbe On or about July 22, 2002 through July 27, 2002, NITV

engaged in conduct prohibited by the Regulations by exporting technology

specially designed for the use of voice stress analyzer equipment (ECCN 3E980),

from the llnited States to South Korea, without the Department of Commerce

license required by Section 742.7 of the Regulations. Specifically, NI‘I’V released

information. in the form of technical data and/or technical assistance provided

during its C’ertified Examiners Course (ECCN 3E980). to a national of South

Korea mho \vas not lawhlly admitted for permanent residence in the I Jnited

States and was not a protected individual under the Immigration and Naturalized

Act (8 IJ.S.(’. Section 1324b(a)(3)). The release of this technology in the United

States to a national of South Korea is deemed to be an export ot’the technology to

South Korea under Section 714 2(b)(ii) of the Regulations.

4. Three I’io1dioti.s of 15 1: I< 9 764 Z(ii) - E.ymrting I’echnology Without the

I(c.‘iretl I,ic~~.cr,On or about February 24, 2003 through Ivfcirch 1, 2003, on or

about March 1 , 2004 through March 6. 2004. and on or about February 6, 2006

through I~chr~iary1 I, 2006, NITV engaged in conduct prohibited by the

Regulations by exporting technology specially designed for the use of voice stress

aiial~rer equipment (ECCN 3E980). from the 1Jnited StLitcsto hlexico. without

the Ilepartment of (’ommerce licenses required by Section 742.7 of the

Regulations. Specifically, NIT V released information, in the form of technical

data and/or technical assistance provided during its Certiiied I ‘xaminers Course

(IX’CN 31:980), to nationals of Mexico who were not lawfullj admitted for

Settlemenl Agreement N II’V Page 3 of8 permanent residence in the United States and were not protected individuals under

the Immigration and Naturalized Act (8 U.S.C. Section 1324b(a)(3)). The release

of this technology in the United States to nationals of Mexico is deemed to be an

export of the technology to hlexico under Section 734,2(b)(ii) of the Regulations.

5. /:ow C’iolirtions of 15 CyF R $ 764 2(b) C‘cming the Eyxwt of Items Withorit

ihe Neyiiired Licenses 011or about Septeniber 15, 200 1, on or about July 27,

2002, on or about hlarch 1, 2003, and on or about March 6, 2004, NITV caused,

aided, or abetted the doing of an act prohibited by the Regulations. Specifically,

NI‘I‘V caused the export of computers containing voice stress analyLer software

(LK’(’N 3A98 l), from the United States to South Africa, South Korea, and

Mexico without the Department of Commerce licenses requircd by Section 742.7

of thc Regulations. Specifically, NI I V provided foreign nationals attending

NITV’s Certified I’xaminers Course in the IJriited States \kith computers

con t a i n i n g voi c c stress maly Ler softlv are. These fore i gn n a t i c ) n a I s t h en exported

these items from the IJnited States to South Africa. South Koica, and Mexico,

without the Ilepartment of Commerce licenses required by Section 742.7 of the

transl’erred items exported from the United States ~i th kno\\ Icdge that a violation

of the Ikgulations \vould occur. Specifically, NIT V trnnsferrcd technology

specially designed for the use of \ oice stress analyzer equipment (i:CCN 31<980)

to a national of Mexico. when NITV knen or had reason to know that a

Settlenient A gree men t NI I‘V Page 4 of 8 Department of Commerce license was required to release this technology in the

IJnited States to a national of hlexico. NI‘IV had reason to know that a license

was required for the release of this technology, as agents of the Office of Export

Enforcement, Bureau of Industry and Security had met with employees of NITV

on January 27, 2005, and provided NITV Lvith information about the Regulations.

In addition, NI‘I‘V had reason to know that a license \vas required for the release

of this technology, as BIS had issued a proposed charging to NI’I’V on December

27, 2005. That proposed charging letter informed NI‘TV that 11ie release of such

technology to nationals of hlexico is deemed to be an export of the technology to

Mexico under Section 734.2(b)(ii) of the Regulations. In addition, that proposed

charging letter informed NITV that the release of such technology required a

license pursuant to Section 742.7 of the Regulations.

WI IERIIAS, N1TV has reviewed the proposed charging letter and is aware of the allegations made against it and the adniinistrati\~esanctions ivhich could be imposed against it if the allegations a1c lhund to be true;

WI IL~KI~IZS,NI’l‘V fully uiiderstantls tlie terms ot this Agreement r\ncl the Ordcr

(“Order”) that tlie Assistant Secretary of Commerce for 1-xport hforcement will issue il’ he approves [his Agreenient as the ha1 resolution of this matter:

Wf II~RIIAS,NI’L’V enters into this Agrccment \,oliintarily and with full knov,~ledgeof’its rights;

Wl IIXIiAS, NI’I’V states that no promises or representations 1iai.e been mnde to it other than the agreements and considerations herein expressed;

Settlernent Agreement NITV Page 5 of 8 WL IfIREAS, NITV neither admits nor denies the allegations contained in the proposed charging letter;

W1 IERI;AS, NI I’V wishes to settle and dispose of all matters alleged in the proposed charging letter by entering into this Agreement, and

Wl ll:l

NOW ‘TI lI~RI3~OlU~,the Parties hereby agree as folloix s

1. 131s has jurisciiction over NI 1 V. iinder the Regulations, in connection with the matters alleged in the proposed charging letter.

2. ’I he following sanctions shall be imposed against NI 1’V in complete settlement of the violations of the Regulations set forth in the proposed charging letter:

;I NII’V shall he assessed a civil penalty in the amount of‘$77.000. of \vhich

$10,250 shall be paid to the [J.S 1)cpartment of (‘otiimetce not later than July 24,

2006; $i0.250 shall be paid to the LJ S. I)epartment of C’ommerce not latcr than

October 24,2006; $19.250 shall be paid to the I J S lkpartment ol’C‘oninierce not

later than .laniiary 24. 2007; and $19.250 shall be paid to the (1 S. 1)epartment of

C‘omnierce not later than April 24. 2007.

b . I he tiniclq pa\inient ofthe civil penalty agreed to in paraggruph 2 a is hereby

niadc a condition to the granting? restoration. or continiiing validity of‘ any export

licensc, permission, or priITilege granted. or to be granted. to NITV, Failure to

make tiniely payment 01’ the ci\.il penalty set lbrth abo1.e shall result in the denial

of all 01’ NI‘I’V’s export privileges under the I

from the date of iniposition of the penaltl..

3. Sub,ject to the approval of this Agreement pursuant to paragraph 8 hereof,

Settlement Agreeinent NI’I’V Page 6 of 8 NlTV hereby waives all rights to further procedural steps in this matter (except with respect to any alleged violations of this Agreement or the Order, if entered), including, I\ ithout limitation, anq right to: (a) ;in administrnti\~ehearing regarding the allegations in any charging letter; (b) request a relirnd of any civil penalty paid pursuant to this Agreement and the Order, if entered;

(c) request any relief' tiom the Order, if entered. including \\ ithout limitation relief from the ternis of n denial order under 15 C'.F.li. 5 764.3(a)(2); and ((1) seek judicial re\ iew or otherwise contest the validity of this Agreement or the Order, if entered.

4. ilpon entry of the Order and timely payment of the $77.000 cii il penalty, 131s will not initiate any tiirther ~Idministrativeproceeding against NITV in connection ivith any violation of the Act or the licgulations arising out of the transactions identified in the proposed charging letter.

5. 131s will make the proposed charging letter, this Agreement, and the Order, if entered, available to the public.

0. 'l'his Agrecmcnt is for settlement pitrposcs only. 'l'lierefore, if this ~Igreemcntis not accepted and the Order is not issued by the i\ssistant Secretary of Conimerce fix Iixport l~nforcetiientpursuant to Section 766.18(a) of'the Iiegulations, no Party may use this Agreement in any atlministrati\.e or judicial proceeding and the Parties shall not he boirnil by the terms containcil in this i\greenietit in any sirbseqiietit ailministrati\-e or judicial proceeding.

7. N o ;i g re e time ti t . 11 tide r s t and i n g , re p res e 11tat i o ti or i 11t e rp r e t a t i o ti not c' o tit a i tic d i ti this

Agreement may be used to tarq or otlier\\ise affect the terms of this Agreeinent or the Order, if entered, tior shall this Agreement serve to bind, constrain, or other\+ise limit my action by any other agency or department 01' the l1.S Go\ crtinient LX ith respect to the facts ,Ind circumstances addressed herein,

Selllemcnt Agreement NllV Page 7 of 8 8. This Agreement shall become binding on BIS only if the Assistant Secretary of

Commerce for Export Enforcement approires it by entering the Order, lvhich \\ill have the same

fbrce and effect as a decision and order issued after a full administrative hearing on the record.

0. Ikhsignatory affirms that he has authority to enter into this Settlement Agreement and to hind his respecti\ c party to the terms and conditions set forth herein

Michael 1). ‘I’urncr Ilirector C‘liai rnian and Chi e f Exec u t i \’e 0 fficer

( 1 ff i c c o f I h poI t I3, ti ti) rc e nic I i I

Set 1 lem en t Agreement NI‘I’V Page 8 of 8 The International IJSLL (print) issn 1748-8885 IJSLL (online) issn 1748-8893 Journal of Speech, Article Language and the Law

Charlatanry in forensic speech science: A problem to be taken seriously

Anders Eriksson and Francisco Lacerda

Abstract A lie detector which can reveal lie and deception in some automatic and perfectly reliable way is an old idea we have often met with in science fiction books and comic strips. This is all very well. It is when machines claimed to be lie detectors appear in the context of criminal investigations or security applications that we need to be concerned. In the present paper we will describe two types of ‘deception’ or ‘stress detectors’ (euphemisms to refer to what quite clearly are known as ‘lie detectors’). Both types of detection are claimed to be based on voice analysis but we found no scientific evidence to support the manufacturers’ claims. Indeed, our review of scientific studies will show that these machines perform at chance level when tested for reliability. Given such results and the absence of scientific support for the underlying principles it is justified to view the use of these machines as charlatanry and we argue that there are serious ethical and security reasons to demand that responsible authorities and institutions should not get involved in such practices. keywords: lie detector, charlatanry, voice stress analysis, psychological stress evaluator, microtremor, layered voice analysis, airport security

Affiliations Anders Eriksson: Gothenburg University Francisco Lacerda: Stockholm University email: [email protected]

IJSLL vol 14.2 2007 169–193 doi : 10.1558/ijsll.2007.14.2.169

©2007, equinox publishing LONDON 170 The international journal of speech, language and the law

Introduction may be found in all walks of life, especially in activities where there is a possibility of making money, and forensic speech science is no exception. The old disreputable voiceprint technique is still around and used by many private investigators in the US in particular. In Germany, a physics professor specialized in crystallography appears in courts claiming to have invented an automatic speaker recognition method based on methods borrowed from crystallography but refusing to subject his methods to independent testing or revealing exactly how the method is supposed to work. These are just two examples. In the present paper, however, we will limit the study of charlatanry to the two most widely used types of so called ‘lie detectors’. We will explain how they work, what principles they are claimed to be based on and how they have performed when tested for reliability. It should be stated right away that at the present time no method for reliable lie detection is known and it is not even known if it should be possible to develop such methods in the future. There are nevertheless several products on the market claimed to be working lie detectors. They do not always call their products lie detectors but use some euphemism like ‘stress analyzer’ or ‘emotion analyzer’, but by looking at how the vendors present their products there can be no question that lie detectors are what they want us to believe their products are. Here are two quotations to illustrate what we mean: Diogenes Digital Voice Stress Analysis application, was originally used in determining attempts at deception in law enforcement activities. In the world today you may hear the words ‘lie detector’ in reference to this type of technology, but this type of technology actually detects deception in the human voice. (The Diogenes Company home page 1) Professionals in the field of lie detection know that there is no ‘true’ lie detec- tor, as lying is not a unified set of feelings that can be measured. … However, LVA is capable of detecting the intention behind the lie, and by doing so can lead you to identifying and revealing the lie itself. (Nemesysco home page 2) In this paper we will not address the question of whether these products should properly be referred to as lie detectors, emotion detectors or decep- tion detectors, but we will use the term ‘lie detector’ when we refer to these products. The focus of this paper will be on the discrepancies between the claims the producers and vendors make and what their products are capable of delivering. Charlatanry in forensic speech science 171

Validity vs. reliability The concepts of validity and reliability are much used in psychology, statistics and many other areas and we would like to use a slightly simplified version of that distinction to guide us when judging the lie detectors. There are many dif- ferent aspects of validity which in psychological research and statistics appear under different labels (test validity, internal validity, content validity etc.). We need not be concerned with these details, but may content ourselves by observ- ing what they have in common. The validity of a test is the degree to which it measures what it is intended to measure. Reliability on the other hand has to do with precision and consistency. How accurately does the method measure what it is intended to measure? How much will the results vary if the measurements are repeated by a given researcher or by other researchers? To keep this distinction in mind has methodological implications. It seems reasonable, from a methodological point of view, to begin by determining the validity of a suggested method before it makes much sense to study its reliability. If the method can be shown to lack validity altogether it will as a consequence also be unreliable and carrying out a reliability test meaningless. If the validity is not known it will be a ‘black box’ whose reliability, if any, will remain unexplained. We must keep in mind, however, that validity and reliability are not all or nothing concepts. A method may be valid to a degree and reliability may range from very poor to almost perfect. At the far end of the negative scale we find things like . It would be a complete waste of time to design experiments to determine how precisely horoscopes may predict future events when we know that the validity of the method is non-existent. At the positive end of the scale we find methods like DNA testing whose validity is solidly supported by scientific evidence and whose reliability is extremely high, albeit not perfect. With respect to lie detectors (as well as stress, emotion or deception detectors) the starting point when judging them ought to be their validity, the most impor- tant question being: Have the basic principles upon which they are claimed to be based been verified in scientific studies? If we look back at the history of lie detector testing, we find that this question is seldom asked and even more seldom studied. There are, on the other hand, scores of reliability tests of the ‘black box’ type. We find this rather surprising and quite unsatisfactory. We would like to argue that the preferred order of things should be first to examine the validity of the procedures to make sure that the applied methods are valid and only when that has been established proceed to reliability tests. The focus here will therefore not be on the reliability tests but on a rather detailed analysis of the scientific underpinnings of the methods, i.e. their valid- ity. Based on these considerations we then explain why reliability tests have 172 The international journal of speech, language and the law shown the studied products to be unreliable. We will be concerned with the two most widely used types of lie detectors, the so called voice stress analyzers (VSA), also referred to as psychological stress evaluators (PSE), and a newer type of analyzer said to be based on a multiple layer analysis of the voice. The latter comes in many different shapes as commercial products but they are all said to be based on what is called layered voice analysis (LVA). We will show that the first type lacks demonstrable validity and that the validity of the latter type is to be found at the astrology end of the validity spectrum. As we will see later in the paper, producers and vendors of the first type, the voice stress analyzers, claim that their products are based on a neurophysi- ological theory of microtremor and sometimes cite scientific papers to boost the credibility of their products. We will show that those claims are completely unfounded by consulting a wide range of papers on microtremor and in par- ticular the papers the vendors themselves often make reference to. The vendors of the second type make fantastic claims about how their methods can be used to monitor the brain activity underlying lies and deception by analyzing the voice signal, but never mention any scientific discoveries that lend support to such claims. Even though we maintain that the examination of the so called lie detec- tors should have started by asking questions about their validity we will take advantage of the fact that reliability studies exist. We have examined many such studies, but in this paper we will mainly refer to two recently published reports (Hollien and Harnsberger 2006 and Damphousse et al. 2007). They are both of excellent quality and carried out under the assumption that the tested products should be given a fair chance to demonstrate their reliability. That means that the authors cannot be accused of being biased by any prejudice about the validity of the principles behind the lie detectors as any corresponding study we would have performed might have. Hollien and Harnsberger make their unbiased intentions explicit: As stated, the primary objectives of this project were to carry out highly controlled research that would at once be 1) impartial to all sides of the prior VSA controversies – i.e., those which led to the need for this research and 2) rigorous enough to address questions concerning the validity and sensitivity of the systems involved. (p. 7) Damphousse et al. do not state this explicitly, but there is no doubt that their approach has been equally unbiased. These studies are very useful complements to what we will have to say. They show that these lie detectors perform at chance level as far as reliability is concerned and we will explain why this is so. In our view, however, no more reliability studies of these two types are needed. The case should be considered Charlatanry in forensic speech science 173 closed from that point of view; the evidence against them is just too overwhelm- ing to motivate any more reliability tests.

The Voice Stress Analyzer (VSA) VSA or PSE are not brand names but a type of analyzer marketed under dif- ferent brand names by several different companies. As far as we are aware, however, they all claim to be based on the same underlying principle: detecting and monitoring so called microtremors in the voice production mechanism by analyzing the speech signal. Here is a typical quote from the home page of one of the companies that sells a VSA analyzer: Micro tremors are tiny frequency modulations in the human voice. When a test subject is lying, the automatic, or involuntary nervous system, causes an inaudible increase in the Micro tremor’s frequency. (National Institute for Truth Verification, home page 3) The first device claimed to be based on this principle, the Psychological Stress Evaluator (PSE), was produced and sold by a company (Dektor) formed by three former police officers (Bell, Ford and McQuiston) in the early 1970s. This and subsequent analyzers are presented as applications of scientific discoveries made by a group of researchers, primarily Lippold, Redfearn and Halliday, at the University College London. Here are two typical descriptions: In 1971, British physiologist Olaf Lippold discovered the muscle micro- tremor. Lippold found that voluntary muscles in the arm generate a physiological tremor or micro-vibration at about ten per second when the subject is relaxed. When the subject is aroused or stimulated, the microtremor tends to disappear. Lippold’s theory relates to the voice in that muscles in the throat and larynx show microvibration that diminishes with stress through the vagus nerve. (Ridelhuber and Flood 2002, abstract) The Diogenes Digital Voice Stress Analysis program … has been produced to detect, process and display changes in voice pattern using the ‘Lippold Microtremor’. …The Microtremor in laryngeal muscles has been shown to reflect the level of stress being experienced by the individual due to decep- tion. (Diogenes home page 4)

Are the VSA/PSE lie detectors really based on the discoveries made by Lippold and his colleagues? The source most often cited by the vendors is an article published by Lippold in the Scientific American in 1971. The article,Physiological Tremor (the term used in their publications), is a summary of work begun in the early 1950s. 174 The international journal of speech, language and the law

Its main focus is a description of research aimed at determining the origin and function of physiological tremor. Lippold describes several experiments he and his colleagues have performed in order to verify or discard various possible theories. At the time of the publication they seem to have settled for the idea that the function of microtremor is as part of a feedback system by which voluntary muscle movement is fine-tuned. Lippold compares it to a mechanical servomechanism or a thermostat. Nowhere is there any suggestion that these discoveries may be used in lie detectors or similar applications. All the experiments described are concerned with muscles that control body movement, primarily leg, arm and finger muscles. Psychological stress is never mentioned. The experiments are concerned with the effects of physical tension, muscle temperature or blood flow. There is thus no obvious link between ideas on which the VSA/PSE detectors are based and the results cited in the Lippold article. One or two other papers appear as references in the VSA/PSE promotional materials. There was thus a slight possibility that some other paper might have hinted at such applications. In order to be sure that we had not missed anything we have gone over a large number of papers published between 1952 and 1983 by the London based group of researchers of which Lippold was a member but have not found a single mention of anything pointing in that direction. We may therefore say with confidence that there is no obvious connection between their research and the subsequent construction of tremor based lie detectors.

Microtremor and voice production As we have mentioned above, the work by Lippold and colleagues was exclu- sively concerned with studies of muscles that control body movement. It is mentioned in passing by them and others that physiological tremor may be found in all voluntary muscles, but that is an assumption and not based on an extensive testing of various types of muscle. It is nevertheless possible that if corresponding experiments had been performed on the muscles that control voice production the same results would indeed have been found. This possibility has been explored in an experimental study performed by Shipp and Izdebski (1981). In their experiment they used one young male subject. Hooked-wire electrodes were placed in the cricothyroid and the posterior cricoarytenoid muscles and EMG signals were recorded during conversational speech and during sustained phonation. A technical descrip- tion of the method may be found in Shipp et al. (1970). To verify the system’s capability to detect microtremor, EMG activity was also sampled from the biceps muscle where such tremor is known to occur. The authors had to limit their study of the larynx to the recordings made during sustained vowel Charlatanry in forensic speech science 175 phonation since ‘EMG activity during conversational speech changed so rapidly … that at the present sampling rate no Fourier analysis could be made of these signals’ (p. 504). For the recordings made during sustained phonation, the analysis ‘failed to reveal any periodic component in the frequency band from 1 to 20 Hz; the electrical energy was randomly distributed throughout the spectrum’. In contrast, the reference recording taken from the biceps muscle to ensure the appropriateness of the method ‘revealed a prominent energy peak at 9 Hz, indicating periodic contraction within the range of normal physiological tremor rate’. That is, applying identical measurement methods to both the larynx muscles and the biceps muscles failed to reveal any prominent spectral peaks in the 10 Hz region in the larynx muscles while the recordings from the biceps were in perfect agreement with results from other comparable studies. In view of the fact that there is only one study directly testing the micro- tremor hypothesis, no matter how convincing the results may seem, there is always the possibility that a differently designed experiment might have resulted in a different result. It is, for example, possible that the negative result was due to the fact that they were looking for tremor in the wrong frequency region. This is not meant as criticism of their study. The objective of the study was to verify or falsify the claims made by the lie detector proponents that there is microtremor in the 10 Hz region, and from that perspective it was obviously the right thing to do. But if we were to search for microtremor in general we would have to be open to the possibility that it could be found in a completely different frequency range. Ocular microtremor (OMT), for example, seems to occur at much higher frequencies. A recent study by Bolger et al. (1999) found tremor in the 70–103 Hz range with an average of 84 Hz. Several earlier studies cited in their paper have reported tremor rates ranging from 30 Hz to 100 Hz. In a paper by McAuley and Marsden (2000) summariz- ing a large number of studies we find reported frequencies between 1 Hz and 100 Hz. What frequencies we should expect from muscles involved in speech production, if any at all, is not possible to say based on these studies. We have found one study that has some bearing on speech, however. In a study by Smith and Denny (1990) EMG measurements were used to study the activity of the diaphragm during speech and breathing. Activity was registered in the 20–110 Hz range in deep breathing. In speech, however, the 60–110 Hz range was significantly reduced. There are a few EMG studies where the larynx muscles have been studied. None of them mention microtremor, and two of them (Hirano and Ohala 1969 and Hirano et al. 1970) cite no frequency measures at all. The study by Faaborg-Andersen (1957) has a section on frequency data for four major larynx muscles during silence and phonation. Firing rates for single motor units were 176 The international journal of speech, language and the law found to range between 5 and 50 Hz for different muscles and different phases of phonation. We may also observe that there were substantial differences in the behaviour of the studied muscles (cricothyroid, arytenoid, posterior cricoarytenoid and vocal muscles). If these signals were to have any effect on the resulting auditory signal (which of course we know nothing about) we should expect frequency components distributed over the whole frequency range and rapidly varying spectra. Even based on this brief summary we may thus conclude that the 10 Hz peak found for the muscles studied by Lippold and others is by no means universal. Anything in the 1–100 Hz range at least is possible and only specific experi- mental studies can determine what the frequencies are in any given case. We may summarize these findings by saying that the only scientific study explicitly involving the larynx muscles found no microtremor at all. The two most likely explanations for this finding are that 1) there is no microtremor at all in these particular muscles, or 2) microtremor in them does not occur in the 10 Hz region like in the much larger muscles controlling body movement, but in some other, probably higher, frequency region and that we are unlikely to find any stable frequency peaks since the larynx muscles are typically in rapid motion.

Conclusions concerning the question of microtremor Based on the literature survey we have made, we feel confident in saying that there is no scientific evidence to support the idea that microtremor in the 10 Hz region occurs in the muscles involved in speech production. After a thorough study of relevant scientific papers published during the past 50 years we have not found a single study whose results lend support for this idea. Claims like ‘The Microtremor in laryngeal muscles has been shown to reflect the level of stress being experienced by the individual due to deception’ (Diogenes Company home page) are thus simply untrue. But even if we speculate that there may be microtremor also in these mus- cles, there is a very long way to go before we arrive at a voice stress detector. First we must demonstrate that such tremor is possible to detect in the speech signal, which is of course not necessarily the case and must be demonstrated separately. Secondly it must be shown that the tremor is affected by psycho- logical stress and that tremor fluctuations as a function of stress are rapid enough to be detected within the time frame of single utterances, not to speak of the single words or syllables that are often claimed to contain enough information for vocal microtremor analyses. Studies involving other muscles tend to show that similar effects apply over much longer durations. Fatigue, for example, may have an effect over a day or more. Restricting the blood flow to Charlatanry in forensic speech science 177 a muscle damps physiological tremor but it takes between 30 and 60 seconds before the tremor is fully damped. Such time windows are hardly useful to detect stress in single, often short, utterances like ‘yes’ or ‘no’. It is also worth pointing out that while physiological tremor has mostly been found and measured in muscles under static tension, the speech organs responsible for voice production are typically in constant motion. The period time for the assumed microtremor, 100 milliseconds, is in fact longer than most speech segments in connected speech. This means that during one single period of the supposed microtremor frequency, multiple adjustments of the larynx muscles are often needed to produce continuous speech. Thirdly it remains to be demonstrated that the effects of stress caused by lie or deception may be reliably separated from stress caused by other factors. Regardless of method, this last requirement is at present not possible to meet and we do not even know to what extent it ever will be. Finally we will cite two comments we found on one of the Diogenes home pages containing references to ‘Studies Validating Voice Stress Analysis’. For a paper by Lippold et al. (1957) the comment reads: ‘Lippold, Redfearn and Vuco begin exploring the correlation between muscle activity and stress’. The actual paper reports a study of variation in the grouping of action potentials in the calf muscle as a function of contraction, stretching, fatigue and cooling and the word ‘stress’ is never even mentioned. A study by Lippold (1970) is said to be the study where: ‘Lippold first discovers the physiological tremor in the human voice in the 8–12 Hz range’ whereas the paper itself is exclusively concerned with the study of physiological tremor in the left hand middle finger. In summary, as our survey has shown, the VSA approach completely lacks demonstrable scientific validity. In fact, all available evidence indicates that its validity is non-existent.

Reliability studies of commercially available VSA-based lie detectors As we have said above we hold that before it is really meaningful to test the reliability of a method it should be demonstrated that we have reasonable grounds for assuming that the method is valid. There are nevertheless a large number of studies where the assumed reliability has been investigated. It is not our intention in this paper to discuss these studies in any detail. The papers are easy enough to find for those who want to consult them. A comprehensive list of references may be found in Damphousse et al. (2007). It comes as no surprise to us, of course, that the lie detectors almost invariably perform at chance level when tested by qualified researchers. Here are two typical conclusions, one from an early and one from a recent study: 178 The international journal of speech, language and the law

Both trained and untrained analysts were unable to … sort the voice-stress patterns consistently, at a greater than chance level (Lynch and Henry 1979: 91) The findings generated by this study led to the conclusion that neither the CVSA [Computerized Voice Stress Analyzer] nor the LVA were sensitive to the presence of deception or stress. Several analyses of subsets of the data were undertaken to explore any possibility that either system could perform under even more controlled conditions, but no sensitivity was observed in any of these analyses either (Hollien and Harnsberger 2006: 41) We will end this part of the paper by reporting some of the results found in the study by Damphousse et al. mentioned above. This study differs from most otherwise similar studies in that the veracity of the tested statements was decided by comparison with irrefutable physical evidence: … we interviewed arrestees in the Oklahoma County jail about their recent illicit drug use. Answers by the respondents were compared to the results of a urinalysis to determine the extent to which they were being deceptive. Then, their ‘actual deceptiveness’ was compared to the extent to which deception was indicated by the VSA programs (Damphousse et al. 2007: 26) We have chosen to present this particular study for two reasons. First, it rep- resents the situation today because the equipment tested was the most recent model. That eliminates excuses that the results were obtained on an outdated model and that significant improvements have been made on later models. Second, the test situation comes as close as one can get to a real field work situ- ation. But of course none of this helps if the equipment is an invalid product. All previously published research conducted in a lab setting has failed to find support for VSA theory or technology … Our research therefore comple- ments previous research by failing to find support for the VSA products in a real world (jail) setting. In addition, the programs do not seem to have very high inter-user reliability even though the programs were relatively easy to learn and implement (Damphousse et al. 2007: 89)

A technical note on the Voice Stress Analyzers A few studies have looked at the VSAs from a technical point of view and the ‘sophisticated hardware’ some of them advertise seems to be no more than simple low pass filters (VanDercar et al. 1980: 176–179). Today most of that is done by software, but the basic principles are likely to be the same. The low pass filtered signal is then supposedly subjected to a frequency analysis. For at least one of the products this is made explicit. ‘Micro tremors are tiny frequency Charlatanry in forensic speech science 179 modulations in the human voice’ (National Institute for Truth Verification, NITV home page). TheCVSA (of the NITV) has not been tested for frequency modulation sensitivity but a similar product, the Diogenes Lantern has. Haddad et al. (2002) produced synthetic signals with fundamental frequencies of 80 Hz or 160 Hz. These were then frequency modulated with frequencies varying from 1 Hz to 25 Hz and the resulting signals were tested using the Lantern VSA detector. It turned out that the VSA analyzer was almost completely insensitive to variation in frequency modulation. The authors drew the conclusion that: ‘Since there was no variation of indicated stress from different input signals, it can be assumed that the systems tested do not use microtremors as indicated in their claims’ (p. 11). Interestingly, they also found that the sensitivity of the system was not, as one might expect, tuned to the frequency range claimed to be crucial in the microtremor analysis 8–12 Hz but to a rather different range: It was determined, late in the testing phase of this project, that the Diogenes Lantern System measures the energy change of the spectrum envelope between 20 Hz and 40 Hz. This is what the Diogenes Lantern System claims to be microtremors. It is the change of energy in the speech envelope. (Haddad et al. 2002: 11) It thus seems as if at least this particular VSA analyzer does not analyze fre- quency changes at all and is not even operating in the claimed frequency region. Results like these certainly raise the question as to whether we may trust any of their claims. Not even the technical specifications seem to be correct.

The LVA analyzer LVA is an acronym for Layered Voice Analysis. The company that manufactures the LVA is called Nemesysco and their product line comprises many different products, but they are all basically utilizing the same technology. Here is one example of how the LVA technology is presented: LVA uses a patented and unique technology to detect ‘brain activity traces’ using the voice as a medium. By utilizing a wide range spectrum analysis to detect minute involuntary changes in the speech waveform itself, LVA can detect anomalies in brain activity and classify them in terms of stress, excite- ment, deception, and varying emotional states, accordingly. (Nemesysco home page 5) All this is possible, according to Nemesysco since ‘every ‘event’ that passes through the brain will leave its traces on the speech flow’. And here is how the extraction of brain events is accomplished: 180 The international journal of speech, language and the law

LVA has two basic formulas comprised of [sic] unique signal-processing algorithms that extract more than 120 emotional parameters from each voice segment. These are further classified into nine major categories of basic emotions. Depending on the goal of the analysis, up to eight ‘final analysis’ formulas can also be applied to the emotional parameter data. These include: Lie stress analysis, Arousal level, Attention level, Deception patterns match, and additional methods for veracity assessments. (Nemesysco home page 6) One would assume that such extraordinary discoveries must be widely published and discussed, but the fact of the matter seems to be that they are completely unknown to the scientific community. The solution to this apparent mystery will become clear in the following paragraphs where we describe in detail what the LVA technology is all about. Another aspect of the LVA technology that is highlighted in all their promo- tional materials is how extremely technically advanced it is. Layered Voice Analysis (LVA) is the most sophisticated truth detection technology available today … LVA is based on the technology of vocal stress analysis calculated from a series of sophisticated algorithms that detect states of stress. (V-solutions home page 7)

LVA fiction meets reality Contrary to the claims of sophistication – ‘The LVA software claims to be based on 8,000 mathematical algorithms applied to 129 voice frequencies’ (Damphousse et al. 2007: 15) – the LVA is a very simple program written in Visual Basic. The entire program code, published in the patent documents (Liberman 2003) comprises no more than 500 lines of code. It has to be said, though, that in order for it not to be possible to copy and run the program as is, some technical details like variable declarations are omitted, but the complete program is unlikely to comprise more than 800 or so lines. With respect to its alleged mathematical sophistication, there is really nothing in the program that requires any mathematical insights beyond very basic secondary school mathematics. To be sure, recursive filters and neural networks are also based on elementary mathematical operations but the crucial difference is that these operations are used in theoretically coherent systems, in contrast to the seem- ingly ad hoc implementation of LVA. Let us begin with a short technical description. In the verbal description of the program for the patent documents, the author describes the program as ‘detecting emotional status of an individual based on the intonation informa- tion’. But whereas intonation in phonetics means variation in pitch encoded by fundamental frequency (albeit almost always accompanied by other prosodic Charlatanry in forensic speech science 181 factors) the author of the LVA mistakenly believes that what he calls ‘thorns’ and ‘plateaus’ represent intonation. A ‘thorn’ is defined in the following way (Liberman 2003): A ‘thorn’ is a notch-shaped feature. For example, the term ‘thorn’ can be defined as: a. a sequence of 3 adjacent samples in which the first and third samples are both higher than the middle sample. or b. a sequence of 3 adjacent samples in which the first and third samples are both lower than the middle sample. And plateaus are defined in the following way (Liberman 2003): A ‘plateau’ is a local flatness … A sequence may be regarded as flat if the difference in amplitude between consecutive samples is less than a pre- determined threshold. Thorns and plateaus are illustrated graphically in Figure 1.

Figure 1. The diagram above is based on a corresponding diagram in the LVA patent documents. It illustrates a portion of a digitized signal and what is meant by ‘Thorns’ (marked with circles) and a ‘Plateau’ (marked with an ellipsis).

The speech signal consists of pressure oscillations relative to the ambient atmos- pheric pressure, so inevitably there will be all sorts of amplitude variation due 182 The international journal of speech, language and the law to the very nature of the signal. These variations capture a mixture of aspects related to both the voice source and the characteristics of the frequency and phase transfer function of the vocal tract (along with sub-glottal cavities). In the transmission process the signal is further influenced by ambient acous- tics, and if the signal is recorded, factors like the type of microphone used and linearity of the recording system will also modify the signal. Producers of the LVA and other voice based detectors show no signs of being aware of this complexity. When an analog signal is digitized the complex continuous variation found in the original signal is replaced by a simplified discrete representation. How closely this representation matches the original depends on the sampling parameters but the match will never be perfect. It is in the digitization proc- ess that the ‘thorns’ and ‘plateaus’ are created. There is obviously an indirect relationship between thorns and plateaus and the original waveform, but the number of thorns and plateaus, which is the very basis for all computations in the LVA, depends crucially on sampling rate, amplitude resolution and the threshold values defined in the program. It is therefore correct to say that these computations are basically no more than statistics based on digitization artefacts. The lie detection is performed in two steps. The first step is called ‘calibra- tion’. In this step, speech samples meant to represent neutral emotion are recorded. Based on these recordings a baseline is calculated. The baseline is no more than a simple statistical summary of the number of thorns, the number of plateaus, the distribution of plateau lengths (they are allowed to vary between 3 and 20 samples) and the range of variation in these factors. When this step is completed, the actual emotion detection may begin. In this step the statements to be tested are recorded and the statistics for these statements are compared with the baseline. Based on the deviations from the baseline the various emotional states are computed. Here is an example copied from the patent documents: A crLIE value to 50 may be deemed indicative of untruthfulness, values between 50 and 60 may be deemed indicative of sarcasm or humor, values between 60 and 130 may be deemed indicative truthfulness, values between 130 and 170 may be deemed indicative of inaccuracy or exaggeration, and values exceeding 170 may be deemed indicative of untruthfulness. (Liberman 2003) And that is all there is. There is nothing special with these computations, except that there is no theoretical basis for them or independent motivation for the proposed ranges. First of all, the creation of digitization artefacts is completely Charlatanry in forensic speech science 183 independent of what the recorded sound represents. The program would analyze any sound the same way, be it a man speaking, an idling car engine, a dog barking or a tram passing by. Secondly, the number and distribution of thorns and plateaus depend crucially on a number of factors that have to do with how the digitization is performed. Different sampling frequencies and amplitude resolutions would produce different results. Exactly at which moment in time the sampling begins can also have an effect. We initially intended to use the code published in the patent documents to make a running copy of the program, but the code is rather messy and not particularly well structured and we decided it would not be worth the time and effort to clean up the code in order to convert it into a running program. The Damphousse et al. group report that the program crashed repeatedly during their experiments so it is obviously rather unstable too. It is rather easy, how- ever, to reconstruct what the program is supposed to do by following the code together with the verbal comments and explanations in the patent documents. We therefore decided to simulate the program in Mathematica instead to get better control and be able to monitor the computations more closely. The presentation of analysis results in the program is modelled on what we, in the beginning, jokingly called ‘the horoscope principle’, a description we have come to regard as more and more accurate in the course of our work. Let us exemplify what we mean by a short example based on our simulations. In one of our tests we used an interview with a well-known Swedish politician. Using the threshold settings suggested in the patent documents we got the following result for the main output labels: Untruthfulness, Low stress, Thinking less than in the calibration, Normal excitement Looking at it superficially this is not an implausible profile for a politician. But as we will explain in the following paragraph, the combination of output labels and variables is not motivated and anyone of the logically possible permuta- tions of variables and labels would work equally well. Choosing the same input but a different variable/label combination produced the following analysis: ‘Truthfulness, High stress, Thinking less than in calibration, Normal excite- ment’ which is an equally plausible description as is ‘Truthfulness, Normal stress, Thinking more than in calibration, Low level of excitement’ and so on. And all these results are, of course, completely detached from anything that has to do with the mental state of the speaker. The output of an analysis is structured much along the same lines as horo- scopes. By presenting the result as a combination of several statements it is possible to achieve an overall description that seems reasonably balanced 184 The international journal of speech, language and the law and plausible – a little bit of negative information and a little bit of positive or neutral information, and general enough not to seem implausible. Rather large intervals for each emotional degree makes extreme combinations less likely, further enhancing the apparent reasonableness of the output. For this insight into human psychology at least we must give the author of the program some credit. In our simulations we have used the output labels thresholds suggested in the patent documents, but as we will explain in the following paragraph, the cor- respondence between the labels and what they represent (e.g. emotional stress level vs. average number of thorns), is perfectly arbitrary. We have also noticed that they are given slightly different wordings in different applications, and radi- cally different wordings in special applications like the so called ‘love detector’.8 From the producers’ point of view this makes a lot of sense. Why waste time and energy inventing a new program when all that is needed to build a love detector for example is to rename ‘cognitive stress’ as ‘sexual attraction’? To sum up by saying that there is absolutely no scientific basis for the claims made by the LVA proponents is an understatement. The ideas on which the products are based are simply complete nonsense.

Definitions of some fundamental variables in the LVA program Haddad et al. (2002: 23) present a table which summarises variable descrip- tions for Diogenes Lantern (VSA) and the LVA based Vericator by Nemesysco. We have adapted their list and added in column three the definitions of the variables in the program (Table 1). A comparison between the two sets of definitions is a telling illustration of the discrepancy between LVA fiction and reality. It should be obvious even to the technically less advanced reader that the assumption of a correlation between for example ‘emotional stress level’ and ‘the average number of thorns’ is completely arbitrary. To further illustrate the randomness of the approach we would like to make the reader aware that there is nowhere in the program or the text in the patent document any motivation why, for example, the average number of thorns should represent ‘emotional stress level’ and not ‘global stress level’. The assign- ment of interpretations to variables is also completely arbitrary and the program would work perfectly well producing the same type of analysis if we changed the combination of variables and interpretations around in any of the possible permutations. So why was the particular combination listed above chosen? You will not find the answer in any of the documents and our personal guess is that there is no particular reason. Charlatanry in forensic speech science 185

Table 1: Variable descriptions (modified from Haddad et al. 2002)

Variable As described to the user As defined in the program

SPT A numeric value describing the The average number of thorns relatively high frequency range. Vericator associates this value with emotional stress level SPJ A numeric value describing the The average number of plateaus per relatively low frequency range. sample Vericator associates this value with cognitive stress level

JQ A numeric value describing the The standard error of plateau length distribution uniformity of the relativity low frequency range. Vericator associates this value with global stress level

AVJ A numeric value describing the The average plateau length average range of the relativity low frequency range. Vericator associates this value with thinking level

Reliability studies of the LVA The LVA is much newer than the VSA-based machines so there are only a few reliability studies. We will again quote the ones by Hollien and Harnsberger (2006) and Damphousse et al. (2007): The performance of LVA on the VSA database … was similar to that observed with CVSA. That is, neither device showed significant sensitivity to the presence of stress or deception in the speech samples tested. The true positive and false positive rates were parallel to a great extent. (Hollien and Harnsberger 2006: 40) Although the LVA instrument tended to perform better than the CVSA instrument, both programs failed consistently to correctly identify respond- ents who were being deceptive. (Damphousse et al. 2007: 88) As was the case with the VSA, the vendors of the LVA equipment complained that the negative results in the above cited studies were due to the fact that the research teams had not properly followed the required procedures. But in both studies, the research teams had been trained by following the in-house training provided by the vendors. Also the test procedures had been thoroughly discussed with the vendors. 186 The international journal of speech, language and the law

We would like to counter a possible objection to our description of the LVA program right away. The vendors might object that we have based our verdict only on the patent documents and not on the ‘real thing’ and that the present version is vastly improved. We would not be particularly worried by any objec- tions along those lines. First of all the Nemesycso Company makes reference to the patent in all its documentation. There is no mention anywhere that what is described in the patent documents is not what the current machines are based on. Quite on the contrary, the company is seeking patents in more countries all the time using the same description. Secondly, we have read the correspondence between the Hollien group and the LVA vendors regarding their complaints about the methodology. In these documents there is a rather detailed technical discussion mentioning functions used in the program, suitable thresholds and so on. All this information is in perfect agreement with what we have found in the patent documents and so is the brief function description in Haddad et al. There is thus no indication that the code has been changed in any substantial way. It will surely have been updated with respect to graphical interface and other details but the basic principles are most certainly the same.

Who is Mr Liberman? We might as well have asked: Who is Nemesysco, the company behind the LVA products, because Mr Liberman and Nemesysco seem to be one and the same. Damphousse et al. (2007: 14) report as follows: ‘The LVA was developed in Israel by Amir Lieberman [sic] who applied mathematic algorithm science to voice frequencies’, giving the impression that the program is based on some advanced mathematical theory. As we have pointed out, this is far from the truth. When we first became aware of the LVA, in connection with an attempt in 2004 to introduce the LVA on the Scandinavian market, we too were given the impression that Mr Liberman was indeed a high ranking Israeli mathematician. We do not know the origin of these rumours. It has been said that the information once appeared on the Nemesysco home pages but we have not been able to confirm this. Screening the Nemesysco home pages we became highly suspicious of these claims, however. To acquire more information about the person behind the products we consulted an Israeli colleague who is an active speech science researcher and asked him if he knew of a mathematician by that name. He did not. A controversy arose between us and the Scandinavian representatives of the LVA whom we, after a careful study of the LVA claims, accused of trying to peddle a bogus Charlatanry in forensic speech science 187 product. This controversy, partly fought in a newspaper, caught the interest of a journalist, Arne Lapidius, who was working in Israel for the Swedish daily Expressen. After some research he managed to locate Mr Liberman, a 32 year old (in 2004) businessman in a small office in the town of Natania. The business appeared to be a one-man operation. Mr Lapidus interviewed 9 Mr Liberman about his academic background and was told that he basically had none. He has no degree (never had time to get one, he explains) but has taken some courses in marketing at an Israeli open university. As we have explained above, the LVA is a simple program written in rather amateurishly used Visual Basic. Given what we now know about Mr Liberman, that is about what one would expect rather than ‘8,000 mathematical algorithms applied to 129 voice frequencies’ (Damphousse et al. 2007: 15). What still remains for us to understand is how insurance companies, security agencies, police depart- ments can be willing to invest hundreds of thousands of dollars, pounds, and euros in equipment without ever asking who are behind the products, what are their qualifications, what are the scientific principles upon which the products are based. The program code is part of the patent documents and may be downloaded from patents on-line. Any qualified speech scientist with some computer background can see at a glance, by consulting the documents, that the methods on which the program is based have no scientific validity. Why did those who so willingly invested huge amounts of money not even bother to look? For us this is the real puzzle.

Sales figures for some of the cited products While, as we have seen, the voice stress detectors are not of any real use as the lie or stress detectors they are claimed to be, they have certainly not been without success in other areas. One such area is making money for the vendors. The programs are sold pre-installed on (usually) laptop computers. The National Institute for Truth Verification, for example, sells their program (CVSA) pre- installed on a Dell laptop computer for $US9,995. And this is the least expensive option. A quick price search on the Internet shows that the computer itself can be found for around $US2,000. To be a ‘certified examiner’ one is required to go through a training programme organized by the vendor at the cost of $US1,440 per student. This company alone claims to have sold their products to over 1,400 agencies throughout the US. Even if each agency has bought only one laptop of the cheapest kind and sent only one person to the training we are talking about a gross income of more than $US16,000,000. The LVA is even more expensive. They charge around $US25,000 for a comparable laptop/training package. 188 The international journal of speech, language and the law

Success stories and moral issues If we consult the home pages of the lie detector vendors we will find a completely different picture of the reliability of their products. Here are two examples concerning police investigations. The subject was shown the deceptive charts and, following several hours of interrogation confessed The subject was then confronted with the results of the CVSA as well as other information connecting him to the crime and he gave a full confession (National Institute of Truth Verification home page 10) But lie detectors have also been used with great success by insurance com- panies: A car insurer which introduced phone lie detectors says a quarter of all vehicle theft claims have been withdrawn since the initiative began. Admiral started using Digilog voice stress analysis technology in May, in an attempt to stamp out fraudulent claims. When policy holders call, they are told they are being recorded and their voices are being analysed. (BBC, 30 October 2003 11) and banks: Fraud detection savings have increased six fold since the introduction of voice recognition analysis software in 2003, Halifax Bank of Scotland General Insurance reported this week. In 2005, of the total claims referred for investigation, 39% were assessed using the DigiLog VRA technology, which identified 44% as High Risk prompting further assessment. Claimants withdrew their claims voluntarily in half of these cases. (Post online, 28 September 2006 12) And there is more to come: Lie detectors will be used to help root out benefit cheats, Work and Pensions Secretary John Hutton has said. So-called ‘voice-risk analysis software’ will be used by council staff to help identify suspect claims. It can detect minute changes in a caller’s voice which give clues as to when they may be lying. The technology is already used by the insurance industry to combat fraud and will be trialled by Harrow Council, in north London, from May. (BBC, 5 April 2007 13) Insurance companies seem to favour an application of the LVA called ‘voice-risk analysis’ which judging from the description on the Nemesysco home page is Charlatanry in forensic speech science 189 basically the lie detector with a different name. The insurance companies do not define success in terms of confessions like the police, of course, but in terms of increased benefits and reduced costs. How are we to explain these success stories in the face of what we have said above about the complete lack of both validity and reliability of these products? The explanation lies in what in the scientific literature is referred to as ‘the Bogus Pipeline Effect’; ‘The expectation is that subjects will answer more honestly if they believe that the truth can be tested for accuracy’ (Damphousse et al. 2007: 82) or ‘no one wants to be second-guessed by a machine’ to put it in the words of the originators (Jones and Sigall 1971: 349). This hypothesis has been confirmed in many studies. A short but clear description and useful references may be found in Damphousse et al. (2007: 82). Their investigation includes an attempt to quantify the Bogus Pipeline Effect in a lie detector study. We are not aware of any other comparable study where that has been done. Their own experimental investigation did not include a study of the Bogus Pipeline Effect but material from a previous experiment carried out at the same prison made such a study possible. Conditions in the earlier study were basically the same except for the absence of a lie detector. In both studies subjects were informed about the use of urinalysis and in the Damphousse et al. study they were also informed that their answers were to be analysed by a lie detector. By comparing the two studies it was possible to isolate the influence of the Bogus Pipeline Effect caused by the lie detector information. It turned out that the effect was substantial. In the Damphousse et al. study only 14% lied about recent drug use compared to 40% in the study where no lie detector was used or mentioned. The authors conclude that ‘Arrestees who thought their interviewers were using ‘lie detectors’ were much less likely to be deceptive when reporting recent drug use’. It is important to point out that the remarkably strong effect is the effect of informing the subjects about the use of a lie detector only. Whether the lie detector actually does anything or is even physically present is irrelevant. Telling the subjects that a lie detector will be used, but without actually using one, will have the same effect as long as the subjects believe that a lie detector is used. This is the important message to keep in mind when judging reports by the police about how the use of a lie detector helped them get a confession or when insurance companies inform us about how much money they have saved by a decrease in insurance fraud. Bringing down false claims from 40% to 14% is likely to correspond to millions of pounds or dollars. So if U.K. insurance companies claim they cut their costs by millions of pounds by using ‘lie detectors’, or US police officers say they can make suspects confess by showing them the results of the ‘Voice Stress Analysis’, or social security administrators say they may bring down benefit fraud, we have 190 The international journal of speech, language and the law no reason to question this. The question in these cases is not about reliability but about moral principles. We know from the reliability tests reported above and from our own study of the scientific validity of these gadgets that they have no ability to reveal lies and deception as such. To inform customers or suspects that you have a lie detector capable of distinguishing between deception and the truth is simply untrue, a lie if you wish, and that raises a number of moral questions to consider. Should we accept that insurance companies increase their profits by lying to their customers? Is the use of lies acceptable if it makes a suspect confess? Do we want councils to bring down social benefit costs by lying to their clients? We find no reason to answer these questions by providing our own personal views, but we think that anyone who has an insurance policy, who applies for social benefits or is concerned with the methods used in criminal investigations, should ask these questions and pose them to those responsible for deciding the policies in the respective area. We would like to make one more observation related to confirming evidence of the type: ‘and then they confessed’. Haddad et al. (2002) fall into the same trap of counting confessions as proof of test reliability: ‘Both suspects confessed and were subsequently convicted of murder’ (p. 16). The study also contains two reports by police officers using the Lantern VSA in their work. One of them reports: ‘I have found in several cases that a person ‘fails’, if you will, on all relevant/crime questions, but has been found to have not committed the crime’. Statements like these are seldom heard, but when we look at the reli- ability tests we learn that false positives are about as common as true positive (Hollien and Harnsberger 2006: 37) which is precisely what one would expect from a machine that operates at chance level. Conveniently forgetting about the false positives will of course boost the reliability figures in all cases where confessions are used as the criterion. The Bogus Pipeline effect in combination with ‘forgetting the false positives effect’ goes a long way in explaining reports like ‘Over the past three years I would say that I have achieved a success rate of about 97 percent on tests vs. confessions’ (Criminal Investigator, Michael G. Adsit, cited in Haddad et al. 2002: 17).

Security issues There have been rumours in the press for some time that lie detectors will be used in security surveillance, for example at airports. TheNemesysco Company is marketing a product called the Gate Keeper (GK1 is the current model) using the LVA technique that is meant for such applications. They claim that it is already in use at Moscow International Airport (Domodedovo) but we have not been able to obtain independent confirmation. If this is true and more Charlatanry in forensic speech science 191 airports will follow, we are not only looking at morally dubious business but at a very real threat against airport security. Since the LVA technique is totally unreliable it would mean that part of airport security will be based on decisions no more valid than throwing a pair of dice. Obviously the GK1 will not replace other systems and procedures, but it is serious enough if it is allowed to divert attention from real security work by letting security personnel waste their time and effort on a completely meaningless task.

Is there anything we can do to prevent charlatanry in forensic speech science? Charlatanry, fraud, prejudice and superstition have always been with us. If we look back in history and compare with what we see today there is little that gives us hope that progress in science will diminish the amount of supersti- tious nonsense we see around us. Astrology, for example, seems to be more popular than ever and totally unaffected by how many times astronomers explain that it is complete nonsense. We are therefore somewhat pessimistic about the possibility of efficiently removing charlatanry from forensic speech science. But we hope that responsible authorities like the police and security services will listen to scientifically trained experts in the field rather than to smooth talking and wishful thinking from vendors of bogus lie detectors and similar gadgets. That is probably where we should invest our efforts. We must also take great care when we present our results so that the issue does not appear as a scientific controversy, which it is not. No qualified speech scientist believes in this nonsense so there is absolutely no controversy there, and it is very important that this becomes clear. We have included sufficient detail in this paper to provide the reader with useful arguments in the struggle against charlatanry. We hope that the effort will not turn out to be totally without effect.

Notes 1 http://www.diogenescompany.com/ This web reference, and all others listed in this article, were checked on 28 October 2007. 2 http://www.nemesysco.com/technology-lvavoiceanalysis.html 3 http://www.cvsa1.com/CVSA.htm 4 http://www.diogenescompany.com/vsaprogram.html 5 http://www.nemesysco.com/technology-lvavoiceanalysis.html 6 http://www.nemesysco.com/technology-lvavoiceanalysis.html 192 The international journal of speech, language and the law

7 http://www.vsolutions.org/ 8 http://www.love-detector.com/ 9 The interview with Mr Liberman (in Swedish) appeared in Västerbottens- Kuriren on 17 December 2004. It is not available online. 10 http://www.nitv1.com/realcases.htm 11 http://news.bbc.co.uk/1/hi/uk/3227849.stm 12 http://www.postmagazine.co.uk/public/showPage html?validate=0&page=post_login2&url=%2Fpublic%2FshowPage html%3Fpage%3D346755 (Login required) 13 http://news.bbc.co.uk/1/hi/uk/6528425.stm

References Bolger, C., Bojanic, S., Sheahan, N. F., Coakley, D. and Malone, J. F. (1999) Dominant frequency content of ocular microtremor from normal subjects. Vision Research 39: 1911–1915. Damphousse, K. R., Pointon, L, Upchurch, D. and Moore, R. K. (2007) Assessing the Validity of Voice Stress Analysis Tools in a Jail Setting. Report submitted to the U.S. Department of Justice. http://www.ncjrs.gov/pdffiles1/nij/grants/219031.pdf Faaborg-Andersen, K. (1957) Electromyographic Investigation of intrinsic laryngeal muscles in humans. Acta Physiologica Scandinavica 41, Supplement 140. (147 p.) Haddad, D., Walter, S., Ratley, R. and Smith, M. (2002) Investigation and Evaluation of Voice Stress Analysis Technology, Final Report. National Institute of Justice, NCJRS 193832. http://www.ncjrs.gov/pdffiles1/nij/193832.pdf Hirano, M. and Ohala, J. (1969) Use of hooked-wire electrodes for electromyography of the intrinsic laryngeal muscles. Journal of Speech and Hearing Research 12: 362–373. Hirano, M., Vennard, W. and Ohala, J. (1970) Regulation of register, pitch and intensity of voice. An electromyographic investigation of intrinsic laryngeal muscles. Folia Phoniatrica 22: 1–20. Hollien, H. and Harnsberger, J. D. (2006) Voice Stress Analyzer Instrumentation Evaluation. Final Report, CIFA Contract – FA 4814–04–0011. http://www.clas.ufl.edu/ users/jharns/Research%20Projects/UF_Report_03_17_2006.pdf Jones, E. E. and Sigall, H. (1971) The bogus pipeline: a new paradigm for measuring affect and attitude. Psychological Bulletin 76(5): 349–364. Liberman, A. (2003) Apparatus and Methods for Detecting Emotions. US Patent 6638217 B1. http://www.freepatentsonline.com/6638217.html Lippold O. C., Redfearn, J. W. T. and Vučo, J. (1957) The rhythmical activity of groups of motor units in the voluntary contraction of muscle. The Journal of Physiology 137: 473–487. Charlatanry in forensic speech science 193

Lippold, O. C. (1970) Oscillation in the stretch reflex arc and the origin of the rhythmical, 8–12 c/s component of physiological tremor. The Journal of Physiology 206: 359–382. Lippold, O. C. (1971) Physiological tremor. Scientific American 224: 65–73. Lynch, B. E. and Henry, D. R. (1979) A validity study of the psychological stress evaluator. Canadian Journal of Behavioural Science 11: 89–94. McAuley, J. H. and Marsden, C. D. (2000) Physiological and pathological tremors and rhythmic central motor control. Brain 123: 1545–1567. Ridelhuber, H. W. and Flood, P. (2002) Policy review: CVSA Is a valid law enforcement tool. Law Enforcement Executive Forum August 2002: 95–100. (http://www.leeforum. com/) Shipp, T., Fishman, B. V., Morrissey, P. and McGlone, R. E. (1970) Method and control of laryngeal EMG electrode placement in man. Journal of the Acoustical Society of America 48: 429–430. Shipp, T. and Izdebski, K. (1981) Current evidence for the existence of laryngeal macro- tremor and microtremor. Journal of Forensic Sciences 26: 501–505. Smith, A. and Denny, M. (1990) High-frequency oscillations as indicators of neural control mechanisms in human respiration, mastication, and speech. Journal of Neurophysiology 63: 745–758. VanDercar, D. H., Greaner, J., Hibler, N. S., Spielberger, C. D. and Block, S. (1980) A description and analysis of the operation and validity of the psychological stress evalua- tor. Journal of Forensic Sciences 25: 174­–188. I llllt ilililil til lilt lilt tilt ltilt ill] ililt tilt ]il til] ilil til till us007321 855B2 o2) United States Patent rtot Patent No.: us 7,321,855 82 Humble (+sl Date of Patent: Jan.22,2008

(54) METHOD FOR QUANTIFYING 6,427,137 82 4 7/2002 Petrushin ...... 104t273 PSYCHOLOGICAL STRESS LEVELS USING 7,165,033 B1 * 1/2007 Liberman ...... '1041270 VOICE PATTERN SAMPLES 1,191,134 B2* 3/2007 Nunallv 704t270 (76) Invcntor: Charles Humble, I1400 Fortunc Cir., * West Palm Beach, FL (US) 33414 cited by examiner Primary Examiner-Samuel G Neway ( * Noticc: Subject to any disclaimer, the term of this ) (74) Attomey, Aqent, or Firm-Mcl{ale & Slavin, P.A. patcnt is extended or adjustcd under 35 U.S.C. 154(b) by 833 days. (s7) ABSTRACT

(21) Appl. No.: 101737,530 A computer-implcmented mclhod of assigning a numeric (22) Filed: Dec. 15,2003 score to a voice pattem sample of a human subject wherein the scorc is indicativc of the psychological stress level ofthe (6s) Prior Publication Data human subjcct. A verbal utlerance of a human subiect is convcncd into clcctrical signals to providc a subjcci wavc US 2005/0131692 A1 Jun. 16. 2005 pattem. Thc pattern is quantified and compared with known (51) Int. Cl. voice patlern charactcristics which cxhibit a sequential pro- Gr0L 21/00 (2006.0i) gression in thc degree of blocking in the pattern, wherein GI0L 1t/00 (2006.01) cach ol the known voice paltems is assigned a numcricai valuc range. A numcrical valuc obtained from itcrativc (52) U.S. Cl. 7041270;7041273 calculations is assigned to the subject wavc patlern based on (58) Field of Classification Search 704/2'13 the comparison. Thc numerical value represcnts the degree Sce application file lbr complete search history. o1 blocking prcscnt in thc subjcct wave paLtern which (56) References Cited correlal.es to thc amounl. of psychologiczri strcss exhibited by the human subicct. U.S. PATENT DOCUMENTS

3,911,034 A * 7119'16 Bcll et al...... 346/33 R 19 Claims, 11 Drawing Sheets

CALCULATE THE PUT ARRAY OF RELEVANT PATTERN INTEGER DATA LOCATION

REMOVE BEGINNING AND ENDING ZEROS REFINE THE PATTERN SUBJECT RELEVANT PATTERN LOCATION TO NUANCE TESTING

CALCULATE PRELIMINARY PATTERN FREQUENCY, AMPLITUDE AND NUMBER OF CYCLES CALCULATE DATA

CALCULATE DATA FROM REFINED RELEVANT PATTERN

TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

TVSA3 : Voice Stress Analysis Freeware.

NEW! Working on a freeware Windows version of this for all you truth seekers out there.

Check back soon.

This page is originally from the website of Paul B. Dennis, who maintains one of the best sites on the web. You may contact him [ Paul B. Dennis. ], or go to his site, the Anti-New World Order page.

TVSA3: Voice Stress Analysis Freeware.

This text is complete with links to referenced documents and web sites on the internet. Because it is intended for dissemination off of the internet as well as on, the convention of surrounding links with [ square brackets ] to denote them is used, with web addresses listed at the end. For convienience, both text and internet style "htm" copies of this document are included in TVSA3.zip. Load the htm into your browser for best viewing.

1. About Voice Stress Analysis 2. The Final Revolution 3. Download TVSA3.zip 4. TVSA3 Operating Instructions. 5. About The Bug in Version One. 6. Version Differences.

About Voice Stress Analysis

Voice Stress Analysis is a type of lie detector which measures stress in a person's voice. That makes it recordable, and means it can be used on tape recordings older than VSA technology itself is.

The use of Voice Stress Analysis (VSA) as a lie detector became popular in the late 1970s and 80s. Today, American employees are protected from VSA in the [ Employee Polygraph Protection Act ], and police recruits must submit to VSA verification of statements made on the employment application according to the [ California Highway Patrol ] and [ Palm Beach Police Department ] web sites.

On the web site of the [ Missouri State Fire Marshal ] this quote is found:

Several Division investigators are also trained in the operation of Computerized Voice Stress Analysis equipment. This equipment has proven to be invaluable during follow-up investigations.

These official web sites, and many more not listed here, argue that important people in law enforcement believe VSA works well. On the other side of the argument is the [ Department of Defense's Official 1996 Position Statement on VSA ], which states that it doesn't work at all, according to their studies. Hmmm... http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 1 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

The California Penal Code 637.3 includes this:

(a) No person or entity in this State shall use any system which examines or records in any manner voice prints or voice stress patterns of another person to determine the truth or falsity of statements made by such person without his or her express written consent given in advance of the examination or recordation.

There two other smaller sections, one giving police the right to do this without anyones permission, and the other is the fine of $1,000 or actual damages.

This law makes it illegal for anyone in California, except for law enforcement officials and foriegn diplomats (immunity), to use VSA devices on radio or TV programs. Try getting President Clinton's written consent!

Really! Try to get it!

Phone his telephone staff and ask for a blanket VSA waiver for all his past and future "public" statements. Ask why you can't have one. Ask for one from your local reps too.

In both principle and execution VSA is a simple technology. Researchers found that frequencies in the human voice in the 8 to 12 Hz range are sensitive to honesty. When a person is being honest the average sound in that range is generally below 10 Hz, but is usually above 10 Hz in dishonest situations.

All muscles in the body, including the vocal chords, vibrate in the 8 to 12 Hz range. This is considered a feedback loop, similar to a thermostat/heater that will maintain an average temperature by raising the temperature a little above the setting, switch off, and not come back on until the temp is a little below it. Just as the temperature swings up and down over time, so too do the muscles tighten and loosen as they seek to maintain a constant tension. This is known to be caused by the production and release of a chemical, as explained in the Scientific American Article "Psychological Tremor" Vol. 224, No. 3, 1971. In moments of stress, like when you tell a lie that you dare not get caught at, the body prepares for fight or flight by increasing the readyness of its muscles to spring into action. Their vibration increases from the relaxed 8 to 9 Hz, to the stressful 11 to 12 Hz range.

Such vibrations are a measure of stress, levels of which change from moment to moment. Everybody knows that stress, also known as anxiety or being up-tight, can be brought on by a simple thought or memory; of a loved one's passing on, for example, or suddenly remembering some dangerous obsticle in the future. Some people have high average stress levels, and some have low, and averages changes from day to day along with mood. What all people have in common is that their stress levels are constantly changing within their current range, changes which indicate the "percieved jeopardy" or "danger" of statements being made. A lie is often dangerous, humiliating, or injurous to get caught at, so lies tend to stand out on stress measurments.

TVSA3 is a simple program which takes digital audio files as input, and outputs new ones with a changing tone in the backgrounds indicating the changing stress levels. Higher tones mean higher stress. It has one control: a threshold setting which determines how high the voice stress frequency must be to trigger the background tone. The threshold setting is treated as a percentage of the stress range found in a given recording. If set to 90%, for example, the device will only make sounds when the stress level is in the top 10% of the recording overall. This automaticly calibrates the device to a person's present stress range.

The software treats all people and recordings as equal, because of its auto-calibration feature. This creates a problem when two people are talking together. The person with the highest average stress level might

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 2 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

appear to be making a lot of lies, while the low stress person is the only one really lying. Such recordings will need to be edited into seperate recordings for each person, and then recombined after VSA processing. I'm planning a utility for this, because it is in interviews that most revelations will come about.

Knowing that a freeware VSA lie detector is being used by large numbers of voters will increase the average stress levels of politicians and officials who routinely lie through their teeth, but the tell-tale fluctuations will still be there. Likewise, even though a sociopath might have a low and narrow stress range (I'm guessing; It might not be narrow), the fluctuations themselves will indicate fear, and will read as clearly as anyone elses.

Potentially, knowing that a freeware VSA lie detector is being used by large numbers of voters will stop the bad guys from wanting the limelight, and encourage the good ones to come forward to fill the void. It may be the total, non-violent, libertarian revolution that much of the world is hoping for. It depends on how many people agree with it and spread the word that an alternative exists to perpetual tyranny.

Programmers interested in developing more complex VSA applications will find complete [ source code ] included in the zip file.

TVSA3 is based on the VSA device published in the April 1980 issue of Popular Electronics. The circuit was provided by a company in the business of selling them complete, and also selling parts kits to the PE readers. The parts list runs about $20 today.

If the "Vocal Truth Analyser" works, then TVSA3 works better. It has, in effect, a higher order filter that reduces frequency leakage from outside the 8 to 12 Hz range. In the PE circuit, a high resolution signal proportional the the average target frequency is developed, and then is spoiled with an inadequate three flickering lights for output: True, Neutral, and False. It also has no threshold adjustment for sensitivity.

The TVSA3 system comes with a simple method that anyone can use to prove that the software does what is advertised.

The Popular Electronics and Scientific American articles both confirm the 8 to 12 Hz target range, and can be found at the library.

Because of the jeopardy requirement, lie detectors can only be tested in two possible ways: By threat of physical discomfort for failing the test, electric shocks for example, or by offering a significant cash jackpot for fooling the detector. I don't have a willing subject (like a prison inmate) for the first option, or cash resources for the second. I'm told there is a third way, but I haven't researched it yet.

I did find more circumstantial evidence though. Compare the stress levels in Nixon's famous "I'm not a crook" speech to your own stress levels. You'll see a dramatic difference. Several recordings of different lengths exist on the web and can be found by searching for "I'm not a crook" and ".mp3" in AltaVista. You'd better get his permission first though...

Another interesting resource turned up in my surfing. [ This detective claims ] that the best selling VSA software [ (according to the DoD) ] cannot be purchased by private citizens, but only by "Law Enforcement." A few weeks ago I had the web address of that company but don't recall seeing such a prohibition. Shortly after that I lost my bookmarks and email archive in a catastrophic computer crash. Now I can't find the site again. It's the company which produces the software that the DoD tested: CVSA. I'd like to post that link here because it has links to many police force web sites that list C-omputerized- VSA as equipment. If you run into it let me know.

This [ commercial VSA website ] has a lengthly reading list about VSA and lie detection, including this http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 3 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

intreguing title:

Detecting Deception: The Promise and the Reality of Voice Stress Analysis. Frank Horvath, PhD. Journal of Forensic Sciences, JFSCA, Vol 27, No. 2, 1982. pp. 340-351.

It's only at our largest university and hospital, and I haven't been out to see it yet.

Obviously, VSA can be a valuable tool to aid in forming "opinions" about public speakers and the things they say. When do they feel stress? Used properly, it has the potential to clean-up politics overnight. Contrary to the cynical modern views of many, there are still a few honest politicians to choose from, -it's just hard to know which ones they are. Organizations which exist to oppose government and institutional corruption can use it to root out agent provocateurs, stool pidgins, and dishonest "leaders." For these purposes, and as a general attack on serious crime, I created TVSA3 as freeware, and encourage software developers to create more powerful shareware versions.

TVSA3 does the basic job well, and may be improved more by this author if any more bugs show up. I don't want to spend my time as a programmer, though. Since I don't want to spend a thousand or more hours putting the desirable bells and wistles into it, I'd rather leave the free version crude, and thereby allow dedicated programmers to compete for a more viable market. It can be speeded up several or many times over because I designed a solid amount of overkill into the number crunching, and because no machine language optimization was included. A shareware version is made all the more attractive to software developers by offering this program in its most basic form: It's a simple DOS program.

If you have a web site in favor of promoting a political solution based on verifiable honesty, please offer a copy of TVSA3 for download there. If you do, you should visit [ my web site ] to get linked on a prestegious list of honesty promoting web sites. You can use this page if you want. Anybody interested in creating web literature on the uses of VSA can make their documents easily locatable to seekers by using the term "TVSA" or "TruthVSA" in them, and indexing them in AltaVista.

Sadly, I have strong reasons to believe that most of the organizations in the world that claim to be helping us find the truth, are not at all what they seem to be. Most will firmly refuse to get on the bandwagon until they see they have no choice. Therefor, do not believe that the software is worthless if told so by "wiser" heads. Check it out yourself.

The public had this same chance in the 1970s, when VSA made front page news. That chance was taken away by establishment science a few days later when they proved that VSA was just a hoax, again on the front page. This time they can't take away the chance, because VSA has since proved itself beyond doubt. It's now up to the people to choose, and to do so by helping other people become aware, and especially by making the program available on their web sites.

In historical parrallel, Cold Fusion recieved the same treatment. In historical parrallel, Cold Fusion is now commercially available, and it puts out not just tiny amounts of energy, but huge amounts. If you check out these two links, you will find that it's very true, patented, distributed, and researched at universities. This link will take you to the [ Clean Energy Tecnologies Inc. ] web site, and [ this one ] will take you to over 30 CETI resources including reviews, diagrams, micrographs and pictures. Other hard-science energy stuff too.

VSA: The Final Revolution

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 4 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

A "final" revolution can be defined as one which establishes a commonly acceptable government, and unbreakable safeguards to keep it that way. The American Constitution was an attempt at a final revolution, but those safeguards are now being broken with ever increasing frequency. Tyranny is entrenched and on the rise.

Today, governments that are commonly acceptable are rare. Instead of acting in the public's interest, they routinely act in their own interest. As time goes on, we see our governments sinking ever deeper into buracracy and corruption, and feel helpless to correct the trend.

Voice Stress Analysis is the core of THE Final Revolution. The priciple is simple, and the method is simpler.

If the first place, we must recognize that use of lie detector devices is illegal without permission from the persons to be tested. Those laws are good ones, because they provide serious punishment for abuse of such devices. Such abuse could be to publicly humiliate a love-struck and shy teenager, for example, and could result in tragic suicides in some cases.

Forget about illegal use of lie detection, and focus on what's legal and justifiable. It's legal to use them with permission, and we can ask for permission from public officials. Most will refuse, and provide us with many reasons why. These reasons will inevitably boil down to "I don't think politics should be a competition of stress levels." Others, knowing that they can win a stress contest, because they are uncorrupted, will grant us permission, and will advance their political careers by doing so.

You can see that we have just divided the rightous from the wolves with this simple test. It is not even who has the most acceptable stress patterns at that point, but simply who has, and who hasn't, the confidence for competition in that game. The rightous will be proud to state their loyalty to the people, knowing that many will be testing those statements, but the corrupt will flinch at the mere suggestion that they follow suit.

Pride, honesty, and loyalty are the most powerful political platform items. The first candidate who gives blanket permission to test his or her public statements as part of an election platform will immediately shame the opposition into following suit, or will probably win office if they refuse. That's a simple political reality, and quite a gimmick to make people remember the candidates who use it. They can offer the free VSA software on their websites, or even give it away door to door on disks.

This is all legal. It's simply a matter of our own choice to prefer honest leaders. The Final Revolution is already underway, but it needs help to happen quickly. All the mainstream news, and most of the "alternative" news services, will completely ignore the VSA revolution. They are almost all in league with organized corruption. The revolution will grow only slowly without the aid of supporters to help bypass the enemy owned mass media.

The Final Revolution is a do-it-yourself project: There is no revolutionary organization. The only job that needs to be done is to distribute the software and ideology widely on the internet, and word of mouth will then spread it to the uncomputerized part of the public. The best way to aid the revolution is to add a catchy tag line about it to your email signature file, and then participate in online discussions about any subject that strikes your interest. Hundreds or thousands of people will notice the little ad.

I use this as my sig file:

Download a free Voice Stress Analysis lie detector. The enemy hates it when you do that! >>>>>> http://mypage.direct.ca/p/pbdennis/ <<<<<<

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 5 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

Of high importance is that private webmasters become aware, so that some fraction will help distribute the software. Email to those millions of people would be very helpful. Also, a single email to an underdog political candidate has the potential to inform a whole communty or district about the revolution.

That's the whole revolution. In theory, the corrupt political leaders will be dethroned as soon as the public sees a political division over the subject of VSA waivers, and the newly elected leaders will be more loyal to the people. What's more, it has the chance to happen quickly if even 1% of the people recognize its importance enough to notify 100 more people, and so on.

[ Download TVSA3.ZIP now: 70K ]

Created in August of 1997 by [ Paul B. Dennis. ]

TVSA3 Operating Instructions

TVSA3 can be run from the DOS command line, from a batch file, or from a windows icon. It takes three command lines parameters: input files, output path, and a stress threshold value. It does not take any runtime input. It's like pkzip: you load it and wait till it's done. Later you load the output sound file into a sound player to "judge" the new recording.

Here's an example of a command line:

TVSA3 c:\tvsa\input\*.mp3 c:\tvsa\output\ 75%

This will only trigger a tone when the stress reading is in the top 25% of the recording's range. Valid values are from 1% to 100%.

If your computer is less than a 486 and has no FPU math chip, you can use an optional exe file, tvsa2fpu.exe, that will operate a few times faster than TVSA3.exe in that case:

TVSA3fpu c:\tvsa\input\*.mp3 c:\tvsa\output\ 75%

You may also ommit the % sign and use an absolute value between 8 and 12 Hz for the threshold. Absolute values are accepted in 1/10ths of a Hz, but the decimal point is ommitted: 8.5 Hz would be entered as 85, for example. Valid values therefor range from 80 to 120. This absolute style of input would be used by tinkerers, perhaps, who would like to do absolute comparisons between speakers.

TVSA3 c:\tvsa\input\*.mp3 c:\tvsa\output\ 85

Besides a significant increase in accuracy and inclusion of a general VSA testing system, the major change from version 1 and 2 to version 3 is that the arbitrary scale of 1 to 1000 has been replaced with the direct Hz scale.

A log file, tvsa-log.txt, is created and updated in the same directory as the executable file. It contains a basic breakdown of the VSA data (now in Hz) in files processed, and will give you a good idea about what absolute values would be useful in tinkering. This info also appears on the screen after each input file is processed.

You can find Nixon's famous quote on the web in several places to compare to other recordings. Frankly,

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 6 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

though, I can add and subtract frequencies from audio files quite easily, and so can anybody else. In July of 1997, nobody expected the public to suddenly aquire VSA technology, so there's no chance that any files on the web were manipulated at that time. What for? But this will not be certain in the future. Someday, altered sound files may become common.

Is a highly stressed public statement a lie? You won't know for sure, but with experience you will begin to trust that the device is telling you something important, and to partially base your "opinions" on it. You may notice it making the noise at unexpected places, and use that as a starting point for research into the indicated subject. Instead of a lie detector, the device is better thought of as an opinion generating aid and research tool when used in this fashion. It can also be used as a standard lie detector, but you'll have to read a book or two about that first: Such tests are not as simple as they look.

To use TVSA3 you need a sound card to make input files, and to listen to output files. Although a 16 bit card is preferred, 8 bit cards can be used with less accuracy. Actually, I see no obvious difference between the results of 8 bit and 16 bit, but haven't checked it out enough to be sure. You'll want a fast computer with a math coprocessor (FPU) also, everything from 486s up have them standard, but if you have more time than money for an FPU upgrade you can run the TVSA2FPU.EXE version on any IBM computer. Actually, TVSA3 will run on any IBM computer too. The difference is that, although both programs are identical, TVSA3.EXE uses a very slow FPU emulator if a real FPU isn't found, whereas TVSA2FPU.EXE uses regular software math routines at all times, which are at least 3 times faster than the emulator.

Input files must be mono 16 bit and sampled at 11025 Hz, the most standard available rate. 8 bit files must first be converted to 16 bits before they can be analysed. That can be accomplished with a windows shareware copy of [ Cool Edit ], which can record as well, or with [ Convert2.zip ] which contains a powerful DOS freeware audio translation program that will convert almost any format to any other. There's another convert2.zip on the web that is about archiving bbs files or something, so don't get the wrong one.

Although TVSA3 is designed to expect signed, raw, intel mode audio files, it will produce identical results with single block windows ".mp3" files; mono only. Few users even know how to record multiblock wav files so that probably isn't a problem. All the wave files I tested worked, but there is no guarantee that every one will if you didn't make it yourself. If results come out strange, use the raw method.

The output files have the same name and type as the input files, but with the extender ".vsa" instead. Depending on your sound player, you may need to change the name to *.mp3 if wave files are used. If you use raw data files, then you'll have to tell your sound software what kind of file it is before you can load it: 16 bit, mono, 11025 Hz, signed, intel mode, pcm or raw. If you use wav files you don't have that bother.

Using the software mentioned above, 16 bit output files can be reduced to 8 bit for storage or trading. The extra resolution will no longer be important.

Don't alter sampling rates on files. If it's not 11025, or that number multiplied by a whole factor (like 11025*2 = 22050), it probably won't work even though it sounds ok.

Sound files are huge: a 16 bit 11025 Hz sound file grows at about 1.3 megabytes per minute. A large, fast hard drive is recommended. Input files are not deleted, so be sure there is as much free space on the disk as the files you process.

Telephones may or may not carry the VSA frequencies, depending on the telephone equipment in your area, but radio and TV always does unless it's filtered out at the source. Same goes for Public Address systems. http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 7 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

Newscasters usually believe all the news regardless of reality, so they are not worthy candidates for analysis. Witnesses and opinion givers, on the other hand, are prime targets.

This document is under slow improvement. It doesn't get much priority because time is always pressing, but it's as complete as it needs to be, if not very concise or prosaic. If you have any problems, questions, or suggestions: [email protected]

About The Bug in Version One

Users who attempted to use version one, TruthVSA.zip, may have been stumped by the bug. It seems that it refused to run when no files existed in the output directory. I found that out when I cleared the directory to test version two. A couple of people reported the problem but others reported running it ok too. I was surprised that there turned out to be a real problem, since it worked 100% for myself and others. Many people would have pointed it at an existing junk directory and have had no problems, but others would have created a new empty directory, failed to run it a few times, and concluded it was a hoax. Oops. As it turns out, there's no simple way to test for a directory in Turbo Pascal. I had to copy a subroutine from the swag files to fix it.

There was also a math bug that caused a crash on some files. It was a low probability event and slipped by the version 1 test files.

Version Differences

TruthVSA changed a bit as better ideas for input and accuracy were worked out. Each version uses a different command line format making them incompatable, and therefor requiring of new version numbers even though the changes were slight.

Version 1 used an arbitrary scale of 1 to 1000 for threshold input and log data output, and only allowed an absolute threshold setting. The scale was not very linear but worked ok.

Version 2 used the same scale but added a more powerful percentage option to the threshold setting. Some technical buffing improved its output a bit.

Version 3 uses a linear scale in Hz (cycles per second) for input and log file output, rather than the earlier arbitrary one. It's a bit more accurate now, as near perfect as it's likely to get, but about 15% slower than the first two versions. Version 3 added testvsa.exe as technical proof of TruthVSA's real and dependable operation.

Version 4: As far as I know, there won't be a version 4. Version 3 does everything I want a VSA lie detector to do, and has passed a much more rigourous series of tests than version 1. Hopefully, a professional programmer will produce a windows VSA device with lots of colorful buttons to push and graphs to look at. I'll link to it from the [ front page ] of this web site if or when that happens.

Employee Polygraph Protection Act.

http://www.c2corp.com/0051g.htm"

California Highway Patrol web site.

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 8 of 9 TVSA3 : Voice Stress Analysis Freeware. 06/03/14 09:22

http://www.chp.ca.gov/recruiting/selection.html

Palm Beach Police Department.

http://legal.firn.edu/muni/palmbch/employme.html

The Missouri State Fire Marshal's web site.

http://www.dps.state.mo.us/dps/msfs/dfsinvst.htm

The DoD's Official 1996 Position Statement on VSA.

http://www.polygraph.org:83/voice.htm

TVSA3 Source Code.

http://mypage.direct.ca/p/pbdennis/truthvsasource.txt

This detective claims...

http://users.ctinet.net/reside/

A commercial VSA web site's reading list about VSA and lie detection.

http://www.diogenesgroup.com/

Download TVSA3.zip

http://mypage.direct.ca/p/pbdennis/truthvsa.html

Cool Edit

http://www2.nitco.com/shareware/audio/win3x.html

Convert2.zip

This is on a military web site. I wonder how long they'll keep the file available... http://www.chips.navy.mil/oasys/graphics/files.html

Paul B. Dennis

Paul B. Dennis Home Page : http://mypage.direct.ca/p/pbdennis/ Paul B. Dennis email address : [email protected]

Back To The Farm House.

Back To The Political Page.

Mail to: What Really Happened

http://whatreallyhappened.com/RANCHO/POLITICS/VSA/truthvsa.html#download Page 9 of 9