Monday, October 28, 2019

Flawed Assumptions Pave a Path to Disaster

When MCAS (Maneuvering Characteristics Augmentation System) was implicated after Lion Air JT610 plunged into the sea, tragically taking 189 lives, the spotlights converged on the malfunction of a single Angle of Attack (AoA) vane. My first thoughts were that Boeing had some how overlooked this scenario or viewed it as inconsequential, based on blind faith that no matter what, the pilot would remain vigilant taking correct and timely action as the safety backstop. I could not wrap my head on how repeated applications of MCAS did not create unlimited authority in malfunction which would create a HAZARDOUS hazard.

Boeing had declared MCAS malfunction a MAJOR hazard.

JT610 Final Report
Why wasn't an MCAS malfunction treated as HAZARDOUS, which would have mandated a dual-channel, fail-safe design?

The answer lies in a number of buckets, which overflow beyond just MCAS:
  • a desire to justify design rather than direct safety
  • over-use of a convenient test condition restriction
  • blind reliance on unproven pilot response
  • misunderstanding the ramifications from removing an under-appreciated safety interlock
  • ignoring escalation from the combination of persistent hazards
  • incorrectly applying a convenient probabilistic factor to dodge the obvious conclusion
  • overlooking the ramifications from extending Speed Trim to provide Stall Identification
This discussion is based on information released in the JATR Report and the JT610 Final Report. I have spliced together a commentary that includes in-line references to these reports. I know this makes for a long read, but the references give much credence to the points and a foundation to draw from. I may have made mistakes unwittingly or through misunderstanding. I will correct them when I can, and I will keep a change log here. I hope you will find this discussion interesting.
31 Oct - added figures, DP Davies info to Ineligible Discount (loss of function Vs malfunction)

Boeing considered all the hazards that were encountered on Lion Air JT610. There is nothing improper listed specifically to the MCAS system safety assessment, no hiding of information nor lack of close coordination. 

Issues were dispositioned using faulty, but openly-declared, stove-pipe reasoning.

Ultimately a number of named individuals, in this case Engineering Unit-Members (EU-M), have to sign off attesting to the safety of the system. Their analysis is clearly documented with method and assumption, there is no hiding. An EU-M signs off knowing this is their sworn duty to be complete, accurate, honest and forthcoming.

Every design has weaknesses. The majority of the safety analysis is focused on the corner conditions. Judgement comes into play, which is why we have to rely on experts to be the EU-M. Nowhere is it more plainly evident for expertise is with assumption and method, someone has to attest these are reasonable and acceptable. The preferred strategy is to rely on data to conclude pass/fail, rather than judgement. Simulations are used to test judgements. But simulations can be stacked in their procedures towards justifying the assumptions.

I have seen countless design changes go down in flames once we handed it over to the pilots. Sometimes they would break it before we even got airborne. These were changes we had tested and were confident they would work. It is easy to rely on a couple of pilots to take the onus for airplane-wide thinking in a haphazard manner.

If you go into a test saying that if we follow these steps it will pass, you will follow those steps and it will pass. Then you can use that data to justify your assumptions. This is where group-think is evident, and where EU-M must act as individuals, to insist on evidence where there is doubt. There should have been a lot more doubt.

The process should have recognized that removing the aft column cutout switch created the HAZARDOUS mandate. 

Repetitive MCAS malfunction should have been declared HAZARDOUS.

The combination of air data, stall warning and MCAS persistent malfunction should have been declared CATASTROPHIC.

HAZARDOUS would have gotten the project to exercise the dual processor (command and monitor) FCC (Flight Control Computer) already built-in

Boeing was fully aware of the potential for an AoA sensor to malfunction. Boeing was fully aware that a malfunction of a single AoA sensor could cause MCAS to fire repeatedly. These were accounted for, but discounted ultimately, by flawed assumptions and precedent.

The first fatal assumption relates to the level of mistrim encountered if MCAS malfunctions. Boeing assumed a maximum travel of 0.6 units in all cases. Boeing assumed that this level of mistrim was manageable by column response alone. 

Subsequent to responding to the MCAS runaway malfunction, Boeing assumed a pilot re-trims the airplane to neutralize column forces. Thus, a subsequent MCAS malfunction is no greater hazard than the first.

From the JATR Report:
Observation O6.9-H: Boeing concluded that multiple erroneous MCAS activations were not worse than a single erroneous activation, based on the assumption that the crew would return the aircraft to a trimmed state (consistent with AC 25-7C guidance) following each activation.
The level of mistrim is based on how fast the stabilizer is moving and how long before stopping the stabilizer from continued motion. Boeing assumed that the runaway would normally always be stopped leaving plenty of elevator authority. This assumption has prevailed with great success, starting with the 707; 60 years of service across 707, 727, 737, 747, 757, 767. 

The removal of the aft column cutout switch undermined Boeing's assumptions that limit the mistrim to 0.6 units, and that sufficient elevator authority would always be preserved.

Stark evidence reveals that the pilot will not rapidly stop the stab runaway from MCAS malfunction, even in the normal flight envelope, because of the override of the aft column cutout.

From the JATR Report:
Finding F6.12-B: Basic assumptions about trained and qualified flight crew response to malfunctions used in the design and certification of the B737-8 MAX did not appear to hold in the two accident cases, based on preliminary information.
Boeing assumed that the 737 MAX pilot would not only pull the column back but also, within three seconds, that the pilot would use manual electric trim to oppose MCAS malfunction. The manual electric trim command will cause MCAS to stop its command. However, MCAS will "wake up" again five seconds later and start the process over again.

For every Boeing airplane until the 737 MAX, uncommanded stabilizer runaway is stopped by column motion only. But for the MAX, with MCAS malfunction, the pilot has to also apply manual electric trim within the same three second window.

The MCAS malfunction hazard should have been elevated to HAZARDOUS because the aft column cutout was removed, and worsened by tripling the stab trim rate, which made severe mistrim likely, especially with persistent malfunction.

The second fatal assumption belies a process breakdown in how failure rate is accounted, specifically, proportional to how often the airplane operates outside of the normal flight envelope. 

Boeing applies a factor based on the likelihood that the airplane will leave the normal flight envelope and enter the operational flight envelope as 1 in 1,000 (1E-3). Boeing uses this probability to discount the required MCAS failure rates that are unique to the operational flight envelope. 

If MCAS malfunction were HAZARDOUS, it would have to rely on two sensors and two hardware paths. 

Boeing dodged the dual mandate (fail-safe or fail-passive) by discounting the required malfunction rate applied to the HAZARDOUS hazard of 1E-7 by 1E-3 (the probability of being in the operational flight envelope). With this logic, they calculated the maximum MCAS malfunction rate in the operational flight envelope to be 1E-4 (1 in 10,000) malfunctions per hour. Where else has Boeing applied this discount?

Boeing concluded that it is acceptable for it to be ten times more likely to have MCAS malfunction in the operational flight envelope which creates a HAZARDOUS hazard, then it was to have MCAS malfunction in the normal flight envelope which creates a MAJOR hazard. Boeing should have not discounted the failure rate assigned to the operational flight envelope, which would have levied a HAZARDOUS hazard mandate against MCAS.

From the JATR Report:
Observation O3.2-A: The JATR team observed that the SSA takes credit for the probability that the aircraft will be flying in certain portions of the flight envelope, as documented in AC 25-7C. 
A probability of 1E-3 for the aircraft in the operational flight envelope (OFE) was used in combination with the probability of the system failure to achieve the 1E-7 minimum probability required for the “hazardous” MCAS failure condition. 
Use of AC 25-7C is not a standard industry approach for 25.1309 compliance. 
The JATR team’s view of the intent of the probability of 1E-3 for the OFE in the HQRM is to select flight test cases for handling qualities evaluation, not to support the quantitative aspects of 25.1309 or 25.672(c) compliance.
Speed Trim, with the aft column cutout intact, is still a MAJOR hazard. 

The combination of a MAJOR hazard from airspeed and altitude disagree combined with a MAJOR hazard from false stall warning (stick shaker, minspeed/PLI anomoly, feel force increase) is probably a HAZARDOUS hazard all by itself (without MCAS malfunction on top).  Adding MCAS malfunction to the mix, in the form it was delivered, has proven to be CATASTROPHIC.

From the JATR Report:
Finding F6.9-A: Evaluating worst-case scenario for the AOA failures was not adequate to identify the hazardous effects (including complete flight deck effects) of the single AOA failures.
A single AoA vane malfunction (actual rate is around 1E-7, declared rate is 1E-5 malfunctions per hour) should not create HAZARDOUS or CATASTROPHIC hazards. The answer is probably a third AoA source, possibly synthetic, to be used in conjunction with enhanced built-in test, to isolate a vane malfunction. This issue is generic across many Boeing airplanes, and should be treated appropriate to the risk it presents, and particularly focused on new or revised designs. There have been 30 occurrences of AoA malfunction (erroneous high angle reading) over an 18 year period, across many Boeing models, without any significant incident, until the MAX MCAS malfunction was added.

Finally, starting with the 737NG, Speed Trim System was modified to become a stall identification function. There are safety features associated with stall identification that do not appear to have been satisfied, notably the need to not trigger based on a single sensor (in this case, it is airspeed). Implementations of Speed Trim System on 757 and 747-400 are both fail-safe. It seems Speed Trim System should have been Fail-Safe from the 737-700.

Following are three detailed discussions:
  • Tail Wagging the Dog
  • Level of Mistrim
  • Ineligible Discount

Tail Wagging the Dog

The safety process is supposed to work from setting requirements top-down and verifying design bottom-up. There seemingly were no enhancements to safety from malfunction, only functional necessities and flawed assumptions.

From the JATR Report:
Finding F6.6-A: There is a perception that the FHA reports are not used to drive the design; rather, they are used to document the design as already defined. 
The STS and flight control computer (FCC) FHAs were updated reports from the B737 NG, and in the JATR team’s assessment, they did not appear to be used as tools to identify new hazards related to MCAS and drive design mitigations. 
As an example, in the hierarchy of safety solutions, mitigation by design should be prioritized over warnings and training/procedures. 
By documenting the as-is configuration, Boeing concluded that pilot training and procedures were sufficient to ensure safety.
Opportunity for Fail-Safe design was squandered because of a failure to use equipment already available, to its full capability. The Fail-Safe mandate should have been obvious from the beginning, once the aft column cutout override was added. If nothing less, to follow the other Boeing model implementations (which are all Fail-Safe).

From the JATR Report:
Recommendation R6.1: The FAA should ensure applicants improve adherence to fail-safe design concept principles when designing or modifying systems. 
The FAA should encourage applicants not to design only for compliance, but also to follow basic principles to design for safety when developing or changing system functions. 
This should include elimination of hazards and use of design features, warnings, and procedures.
  • Observation O6.1-A: Proper flight crew action was considered an adequate mitigation to risks such as erroneous activation of MCAS.
  • Finding F6.1-A: The JATR team identified that the design process was not sufficient to identify all the potential MCAS hazards. As part of the single- channel speed trim system, the MCAS function did not include fault tolerant features, such as sensors voting or limits of authority, to limit failure effects consistent with the hazard classification.
  • Finding F6.1-B: The use of pilot action as a primary mitigation means for MCAS hazards, before considering eliminating such hazards or providing design features or warnings to mitigate them, is not in accordance with Boeing’s process instructions for safe design in the conception of MCAS for the B737 MAX.
  • Finding F6.1-C: The JATR team found that there was a missed opportunity to further improve the system design through the use of available fail-safe design principles and techniques presented in AC 25.1309-1A and in EASA AMC 25.1309 in the MCAS design.
The safety analysis was thorough and complete, but fragmented. In general, the analysis was limited to areas that changed, reusing prior certification basis to complete the process. It was bad assumptions, driven by justifying a design, rather than establishing a safety benchmark that challenges the design.

From the JATR Report:
Finding F6.5-A: An integrated SSA to investigate the MCAS as a complete function was not performed. 
The safety analyses were fragmented among several documents, and parts of the SSA from the B737 NG were reused in the B737 MAX without sufficient evaluation. 
The FCC is capable of high integrity fail-safe design features. Each FCC is a dual processor (command and monitor processor) that can be used together to overcome any single point failure, including AoA sensor disagreement. 

The 737 Speed Trim System had been in service without significant issue for over thirty years. 

It appears that the motivation was to shoe-horn MCAS into Speed Trim System. That may have been sensible except for the removal of the aft column cutout feature, and the single-sensor malfunction leading to persistent and overwhelming MCAS malfunction combined with other MAJOR hazards. 

The new risks were "written off" by clever analysis that
  • failed to properly account for expected pilot action
  • ignored workload increase from simultaneous failures in multiple systems
  • gave short-shrift to repetitive malfunction without workload increase
  • discounted an obvious mandate for fail-safe design
  • did not reveal the hazards from excessive mistrim 
  • without any basis, assumed pilot would use manual electric trim as quickly as before when pilots were only expected to pull on the column in response to runaway
Boeing is ignoring the fact that speed trim is a fail-safe, dual-inputs, dual-processor design on 757 and 747-400 (both with the aft column cutout switch active). 

The 737 Speed Trim System single threaded design is a singularity in augmentation systems that have full control of the stabilizer. It survives without greater hazard because of the ability for the aft and forward column cutout switches to constrain its malfunction. The removal of the aft column cutout broke that protection which should have itself driven a fail-safe design mandate in compensation and to bring the design into conformance with other models and in general for augmentation systems.

Level of Mistrim

Mistrim reflects the displacement of the column under static condition. In general, trim is applied to neutralize column forces. Mistrim has greater import when stall warning is active, as the Elevator Feel Shift feature is activated, making column pull forces much higher than normal. An MCAS malfunction trims the stabilizer to drop the nose, the pilot has to pull against EFS elevated feel forces to counter it.

Another aspect of mistrim relates to binding the manual stab trim wheel, in this case making it very difficult to trim Airplane Nose Up, as appears to have been manifested in ET302.

The level of mistrim scales the degree of hazard.  A small mistrim affords easy response, a large mistrim can create difficulty in control. 

Boeing accounted for the level of mistrim in all scenarios using the same amount, a value of 0.6 units. This seems to have been the standard value for mistrim based on a presumption of the time to respond to a "runaway". 

The System Safety Analysis starts very early in the design process, notably with the Preliminary System Safety Assessment and the Functional Hazard Assessment.

From the JT610 Final Report:
As part of the MCAS development phase, in late 2012, Boeing performed a preliminary functional hazard assessment of MCAS using piloted simulations in their full motion Engineering Flight Simulator (E-Cab). 
Several hazards were assessed at that time, however, this section of the report will focus only on the following two hazards: uncommanded MCAS operation up to its maximum authority (0.6 degrees of aircraft nose down stabilizer) and uncommanded MCAS operation equivalent to a 3 second stabilizer trim runaway. 
Boeing used two scenarios to assess this hazard: a runaway at MCAS activation during a wind-up turn maneuver (operational envelope) and a wings-level recovery from a stabilizer runaway during level flight (normal flight envelope). 
To perform these simulator tests, Boeing induced a stabilizer trim input that would simulate the stabilizer moving at a rate and duration consistent with the MCAS function. 
Using this method to induce the hazard resulted in the following: motion of the stabilizer trim wheel, increased column forces, and indication that the aircraft was moving nose down. 
Boeing indicated that this evaluation was focused on the flight crew response to uncommanded MCAS operation, regardless of underlying cause.

MCAS command authority started out applicable to high altitude (high Mach) high-AoA conditions. Boeing originally proposed that MCAS would have 0.6 units of command. A malfunction of MCAS would result in 0.6 units of stabilizer mistrim. 

Later Boeing expanded the MCAS command to 2.5 units at low altitude (low Mach) high-AoA conditions. A malfunction of MCAS would result in 2.5 units of stabilizer mistrim. 

In this context, a malfunction is a false trigger of MCAS. 

In another context, MCAS output is falsely set by malfunction of hardware or software, against intended function. This has different characteristics, in that the failure may be persistent, commanding continuous stab trim changes without regard.

From the JT610 Final Report:
The FHA evaluations were conducted by Boeing in their Engineering Cab using FAA guidance regarding flight crew response to flight control failures requiring trim input that is contained in FAA Advisory Circular AC25.7C. In particular, Boeing uses the following assumptions in its flight controls FHAs:
  • Uncommanded system inputs are readily recognizable and can be counteracted by overriding the failure by movement of the flight controls in the normal sense by the flight crew and do not require specific procedures.
  • Action to counter the failure shall not require exceptional piloting skill or strength.
  • The flight crew will take immediate action to reduce or eliminate increased control forces by re-trimming or changing configuration or flight conditions.
  • Trained flight crew memory procedures shall be followed to address and eliminate or mitigate the failure.
Boeing advised that these assumptions are used across all Boeing models when performing functional hazard assessments of flight control systems and that these assumptions are consistent with the requirements contained in 14 CFR 25.671 & 25.672 and within the guidance contained in FAA Advisory Circular (AC) 25-7C for compliance evaluation of 14 CFR 25.143.
The requirements document also indicated that the preliminary functional hazard assessments of MCAS were re-evaluated by flight crew assessments in the motion simulator and by engineering analysis and determined to have not changed in hazard classification as a result of the increase in MCAS authority to 2.5 degrees.
The FCC provides the software functions that calculate the MCAS command. A command processor provides those logical software outputs to the FCC hardware discrete outputs that direct stab trim to move, and that can also override the aft column cutout and select high trim speed.

A false trigger of MCAS would be manifested by the stab moving until either 
  • using manual electric trim,
  • the appropriate trim cutout is applied, or 
  • grabbing the trim wheel; 
upon which the stab would stop moving. 

Boeing assumed that elevator would always be available to counteract anytime the stab would move for one MCAS step.

Stabilizer trim is susceptible to runaway without MCAS as a factor. Boeing has provided measures to limit the stabilizer mistrim in a runaway to ensure that the elevator remains effective. 

The aft column cutout and the forward column cutout are switches that automatically stop the stabilizer from trimming in opposition to the column. 

If the stabilizer is running away by trimming the airplane nose down, the pilot flying will respond with aft column travel to command sufficient airplane nose up elevator to exactly offset the stabilizer mistrim. 

At the point the aft column cutout is reached, the stabilizer trimming is stopped, to which the level of mistrim is now prescribed by the rigging of the aft column cutout switch. If the switch is rigged at 50% travel, then the pilot still has the remaining 50% of travel available. It does not matter what type of runaway, when the pilot responds with column travel, the trim will stop with the same level of mistrim. 

Boeing assumed that a pilot will take one second to recognize the runaway malfunction (by noting the nose starting to drop), and then three seconds to respond to the malfunction, which in this case is by pulling back on the column sufficiently to trip the aft column cutout. 

Where does this three-second mistrim criteria originate? 

From 25.255 (a) (1):
25.255 Out-of-trim characteristics.

(a) From an initial condition with the airplane trimmed at cruise speeds up to VMO/MMO, the airplane must have satisfactory maneuvering stability and controllability with the degree of out-of-trim in both the airplane nose-up and nose-down directions, which results from the greater of - 
(1) A three-second movement of the longitudinal trim system at its normal rate for the particular flight condition with no aerodynamic load (or an equivalent degree of trim for airplanes that do not have a power-operated trim system), except as limited by stops in the trim system, including those required by § 25.655(b) for adjustable stabilizers; or
(f) In the out-of-trim condition specified in paragraph (a) of this section, it must be possible from an overspeed condition at VDF/MDF to produce at least 1.5 g for recovery by applying not more than 125 pounds of longitudinal control force using either the primary longitudinal control alone or the primary longitudinal control and the longitudinal trim system.
25.255 (f) enforces the requirement that at the level of mistrim (from three seconds), there shall be sufficient load factor to pull out of a dive at 1.5 g.

By this interpretation of 25.255, the most mistrim you must demonstrate is three seconds. How can pilot response to stopping the runaway stabilizer be assured within three seconds? The answer lies in the aft and forward column cutout switches.

The aft column cutout switch performs two functions.
  1. Stop the stabilizer from running away
  2. Limiting the level of mistrim to preserve elevator effectiveness.
Boeing has used the aft and forward column cutout switches starting with the 707 and then including 727, 737, 747, 757, and 767. In all those airplanes, over all those years, the column cutout switches performed exactly as designed. 

Boeing consistently assumed that the most mistrim to account for was 0.6 units. This was the level of mistrim resulting in three seconds of runaway. This value is debatable, if accounting for one second of recognition and if accounting for relevant stab trim rates.

The level of mistrim seemingly was independent of other factors, such as MCAS command authority in the range of 0.65 units to 2.5 units.  

From the JT610 Final Report:
In March 2016, Boeing determined that MCAS should be revised to improve wings-level, flaps up, low Mach stall characteristics and identification. The MCAS was revised such that depending on AOA, it would be capable of commanding incremental stabilizer to a maximum of 2.5 degrees at low Mach decreasing to a maximum of 0.65 degrees at high Mach.
If MCAS were to achieve its command, the mistrim range should be 0.65 units to 2.5 units.

JATR also questioned how 0.6 units was selected. The normal trim rate for autopilot, flaps up, is 0.09 deg/sec. Assuming four seconds (one to recognize, three to respond), the stab should move about 0.4 units. Perhaps a fifty percent addition to make it worse case, that gets to 0.6 units.  Under MCAS control, the rate is 0.27 deg/sec. A four second runaway would be about 1.1 units. 

From the JT610 Final Report:
Observation O3.8-A: Out-of-trim characteristics, per the requirements of 25.255, were found acceptable for a 0.6 unit nose-down out-of-trim condition. 

This out-of-trim value was determined by 3 seconds of trim input at the flaps-up main electric stabilizer trim rate of 0.2 degrees per second, which is greater than the autopilot trim rate. 
Observation O3.8-B: The higher MCAS trim rate of 0.27 degrees per second was not selected for the demonstration of compliance with § 25.255, even though failures could result in un-commanded stabilizer trim movement at this rate.
From the JATR Report:
Observation O2.8-C: Although the above guidance is aimed at test pilots conducting test flights, applicants seem to use this guidance as a design assumption that the pilot will be able to respond correctly within 4 seconds of the occurrence of a malfunction. 
For example, in the case of the B737 MAX, it was assumed that, since MCAS activation rate is 0.27 degrees of horizontal stabilizer movement per second, during the 4 seconds that it would take a pilot to respond to an erroneous activation, the stabilizer will only move a little over 1 degree, which should not create a problem for the pilot to overcome.
Any reference to 0.6 units is more likely a long held historical value to relate to the pilot response with opposing column and tripping the column cutout switch. 

From the JATR Report:
Finding F2.8-D: The 3-second reaction time assumption dates back decades, to where the performance of the autopilot was constantly monitored by the crew in flight (e.g., guidance given in AC 25.1329-1A, Automatic Pilot Systems Approval, dated July 8, 1968). 
However, with increasing reliability and advances in flight deck alerting and displays, it may no longer be appropriate to assume that the pilot flying will be monitoring the automation as closely as in the past.
For the 737 MAX, with MCAS, the pilot response to stop the mistrim from growing is to the point of opposing manual electric trim.  

Boeing recognized that removing the aft column cutout switch would change the pilot response to mandate or strongly encourage rapid manual electric trim to stop MCAS and to quickly neutralize column forces. 

While not knowing the detail for JT610, the crew from JT043 could have instantly confirmed that the pilot response was clearly inconsistent from expectation. 

There was neither special emphasis nor imperative for quickly using manual electric trim to oppose MCAS in the multi-operator message sent by Boeing on 10 Nov. 2018, shortly after the JT610 disaster, nor in any other communication revealed so far.

A description of MCAS was provided, revealing the level of stab trim authority (2.5 deg) and the high trim rate (0.27 deg/sec). The Boeing message states that you can use the column trim switches or aisle stand cutout switches to oppose MCAS.

The Boeing message does not make mention specifically that the aft column cutout switch function is disabled. It is apparent by its omission in the list of countermeasures. 

The Boeing message does not emphasize the critical criteria they base proper pilot response on:
  • The flight crew will take immediate action to reduce or eliminate increased control forces by re-trimming or changing configuration or flight conditions.
The ops bulletin Boeing released on 6 Nov 2018 background discussion does state that manual electric trim can be used to stop MCAS. It further states that the cutout switches can deactivate MCAS. It also states that MCAS has unlimited authority. This is really a remarkable admission.

It was apparent, once the flight data was available, that the assumptions Boeing had relied on were not holding up. The level of mistrim was excessive. Pilots did not reliably respond within three seconds. Pilots did not always trim back to neutralize column forces. Pilots did not rely on manual electric trim to counter nose down trim malfunction.  

Boeing had an option, once realizing that with the assumptions flawed, to immediately disable MCAS, and to fix the system based on HAZARDOUS malfunction (as they are doing).

The only action that was going to prevent ET302 was for Boeing to either turn off MCAS or fix MCAS. Clearly fixing MCAS wasn't a quick undertaking. Boeing tip-toed with their fix, starting with detecting disagreement (no single sensor), then by assuring it could never be re-triggered, then limiting command authority to preserve elevator authority, and finally to bring in a monitor processor with tandem hardware control to meet fail-safe in the presence of a single fault.

The issues were plainly obvious. I came to question the HAZARDOUS (fail-safe) assumption myself when I published 737 MCAS - Failure is an Option on 15 Nov 2018 and 737 FCC Pitch Axis Augmentation - Command Integrity Mandate for Dual Channel, Fail-Safe on 20 Nov 2018, both before any flight data was even available, having concluded MCAS was single-thread.

Boeing failed to properly re-assess the situation, doubling down on their assumptions instead of immediately disabling MCAS to remove any chance of further disaster. This stay-the-course, admit-no-fault mentality was plainly evident with the 787 battery fiasco, where the answer was to install a chimney.

The JATR report expressed a need to emphasize the use of main electric trim as a countermeasure to MCAS malfunction and that the level of mistrim needs to be conservative, and that significant mistrim can lead to greater hazards.

From the JATR Report:
Recommendation R3.8: The FAA should review the prescriptive use of 3 seconds under 14 CFR 25.255 (Out-of-Trim Characteristics) for the evaluation of mis-trim conditions, especially for automatic trim systems where pilot recognition is relied upon to detect and arrest runaway failures. 
The rate of trim used by these automatic systems should also be considered in showing compliance to § 25.255. 
Finding F3.8-A: Section 25.255 applies to jet upset events and uses a prescriptive 3 seconds as the amount of out-of-trim that could occur before pilot reaction. 
For automatic trim systems, the 3-second reaction time may not be appropriate, depending on the cockpit alerting philosophy and trim system architecture and controls. 
Recommendation R3.9: The FAA should review the AFM procedure for stabilizer runaway and ensure that adequate emphasis is placed on the importance of using main electric stabilizer trim to return to a trimmed state.  

Crew error should be considered in the event that aisle stand stabilizer cutout switches are used before returning to trim conditions. 
Finding F3.9-A: Certain stabilizer runaway failures may generate significant out- of-trim conditions. 
Main electric stabilizer trim is considered the primary means to stop runaway stabilizer in Boeing’s assumptions and validation tests. 
The degree of stabilizer mis-trim and resulting transient from steady-state flight may result in hazardous or even catastrophic failure conditions. 
Recommendation R3.10: The FAA should review the Boeing assumption of a 4-second pilot reaction time to stabilizer runaway failures to ensure that a conservative value is used, since pilot action is required to counter these failures. 
Finding F3.10-A: Manual stabilizer trim wheel forces increase with increased speed and degree of out-of-trim condition. 
The degree of out-of-trim condition is dependent on pilot recognition and reaction technique and time. 
Manual stabilizer trim wheel forces could become significant when assumed pilot reaction times are reasonably exceeded, especially for high-speed conditions. 
During stabilizer runaway conditions where main electric stabilizer trim is not available, either due to system failures or the erroneous selection of stabilizer cutout switches prior to returning to trim, the crew must use the manual stabilizer trim wheel to return to a trimmed condition.
The FAA issue an AD on 7 Nov 2018 stating the importance of following the stabilizer runaway non-normal checklist if encountering an AoA vane malfunction.

From the JT610 Final Report:

The AD provided the following entry. It states that the pilot should control with column and main electric trim as required. It then expresses the need to set the cutout switches. The comment, if relaxing the column causes the trim to move, is referring to the column cutout switch, which would not be active with MCAS. 

From the JT610 Final Report:

The statement "Electric stabilizer trim can be used to neutralize column pitch forces before moving the STAB TRIM CUTOUT switches to CUTOUT does not match with the imperative stated as their foundation assumptions on pilot behavior. The statement should have been written with emphasis and insistence "Electric stabilizer trim should be or shall be used to neutralize..."

From the JT610 Final Report:

Boeing assumed that opposing MCAS malfunction with electric trim would be achieved within the three seconds normally allotted to the pilot pulling back on the column. Instead of automatically stopping mistrim by simple aft column cutout, the pilot uses manual electric trim to stop it and to retrim.

From the JT610 Final Report:
The assessment was also based on an assumption that the flight crew was highly reliable to respond correctly and in time. Boeing followed FAA guidance that the flight crew would respond within 3 seconds to any changes in flight condition. The assessment was that each MCAS input could be controlled with control column alone and subsequently re-trimmed to zero column force while maintaining flight path.
The expected pilot response in a runaway (nose down) is to pull the column. With the aft column cutout switch in place, pulling the column alone will stop the stab from trimming further. For MCAS, the pilot has to both pull the column to counter the mistrim while simultaneously using the manual electric trim to oppose the runaway. Boeing placed complete confidence in their assumptions that pilots would promptly use manual electric trim in this case. 

From the JT610 Final Report:
In the event of repetitive MCAS activation with repeated electric trim inputs by flight crew, but without sufficient flight crew response to return the aircraft to a trimmed state, the control column force to maintain level flight could eventually increase to a level where control forces alone may not be adequate to control the aircraft. 
During the accident flight, the DFDR recorded a control force of 103 lbs., after repetitive MCAS activation was responded with the FO had responded with inadequate trim to counter MCAS. At this point, the flight crew was unable to maintain altitude.
It has been Boeing's proposition that an MCAS malfunction to its full authority can be easily countered by elevator. The situation highlighted is high altitude, where the MCAS authority has been shown to be a MAJOR hazard, except outside the normal flight envelope it is HAZARDOUS.

Boeing determined that a mistrim of 0.6 units at low altitude was only a MINOR hazard. Boeing did not conduct any flight tests that confirmed, but they claim with good confidence, that a 2.5 unit step of stabilizer at low altitude is controllable by column as well.

The data from JT610 and JT043 shows that the pilot can counter MCAS with manual electric trim and column, but that the excursions can exceed 0.6 units and even 2 units. It also shows that pilots failing to neutralize column forces after each MCAS malfunction can lead to loss of control.

Stab trim - JT610

Stab trim - JT043

A simulator test was conducted **after** the JT610 accident to judge the hazard from successive MCAS commands without trimming in between.  The conclusion was that two successive commands could not be countered by column alone.

From the JT610 report:
The fourth case of the third scenario was for the flight crew to feel the control column forces during single and repetitive MCAS activations while trying to keep the aircraft level. The AOA bias was introduced and the MCAS function activated. The observations were as follows:
  1. Significant aft control column force was necessary to hold the control column after one activation of MCAS.
  2. After two full applications of MCAS and no restoring electric manual trim up, one participant characterized the control column force as “too heavy.”  
Prior to the 737 MAX Certification, Boeing revised the hazard assessment due to the change in MCAS authority at low altitude. There was no change in classification.

From the JT610 Final Report: 

* Major only for high altitude conditions
** Piloted Simulation not Required 
When assessing unintended MCAS activation in the simulator for the FHAs, the function was allowed to perform to its authority and beyond before flight crew action was taken to recover. Failures were able to be countered by using elevator alone. Stabilizer trim was available to offload column forces, and stabilizer cutouts were available but not required to counter failures. This was true both for the preliminary FHAs performed in 2012 and for the reassessment of the FHAs in 2016. 
The uncommanded MCAS command to the maximum nose down authority at low Mach numbers was evaluated in the Boeing 737-8 (MAX) cab and rated as Minor. According to Boeing, Engineering analysis determined no low Mach piloted simulation to be required as this failure is less critical than MCAS function operation to maximum authority. Stabilizer motion for three seconds would not reach maximum authority in low Mach conditions.  
The high Mach uncommanded MCAS command and subsequent recovery is the critical flight phase in establishing the hazard rating for erroneous MCAS commands. The existing high Mach evaluations remain valid as the aerodynamic configuration has not changed significantly since the preflight evaluations, and the 3 second stabilizer motion is the same magnitude. 
According to Boeing, engineering analysis determined that the existing high Mach evaluations remain valid as the aerodynamic configuration had not changed significantly since the pre-flight evaluations, and the MCAS authority limit at high Mach did not change significantly in the flight test update.  
As the ratings for these high Mach evaluations were more severe than for low Mach, the overall flight envelope hazard ratings remain the same as the pre-flight evaluations. 
During the process of developing and validating the Functional Hazard Analysis (FHA), Boeing considered four failure scenarios including uncommanded MCAS function to the maximum authority limit of 2.5° of stabilizer movement. 
However, the uncommanded MCAS function to maximum authority was only flight simulated to high speed maximum limit of 0.6°, but not to low speed maximum limit of 2.5° of stabilizer movement. 
To perform the simulator tests, Boeing induced a stabilizer trim input that would simulate the stabilizer moving at a rate and duration consistent with the MCAS function. 
Boeing also indicated that engineering and test pilots discussed the scenario of repeated uncommanded MCAS activation during development of the Boeing 737-8 (MAX) and deemed it no worse than single uncommanded MCAS activation because it was assumed that the pilots would trim out uncommanded trim inputs to maintain control of the aircraft.  
Repetitive MCAS activations without adequate trim reaction by the flight crew would escalate the workload and hence failure effects should have been reconsidered. 
During FHA, the simulator test had never considered a scenario in which the MCAS activation allowed the stabilizer movement to reach the maximum MCAS command limit of 2.5° of stabilizer movement. Therefore, their combined flight deck effects were not evaluated. 
Boeing indicated the following key conclusions supporting the MAJOR hazard classification.

From the JT610 Final Report:
In a 2019 presentation to the investigation team, Boeing indicated that the MCAS hazard classification of “major” for uncommanded MCAS function (including up to the new authority limits) in the Normal flight envelope were based on the following conclusions:
  • Unintended stabilizer trim inputs are readily recognized by movement of the stabilizer trim wheel, flight path change or increased column forces.
  • Aircraft can be returned to steady level flight using available column (elevator) alone or stabilizer trim.
  • Continuous unintended nose down stabilizer trim inputs would be recognized as a Stab Trim or Stab Runaway failure and procedure for Stab Runaway would be followed.
 The conclusions Boeing professed are not evident in the data from both Lion Air flights.

From the JT610 Final Report:
Without prior knowledge of MCAS functions, the flight crew would depend on the visual and motion cues, prior training for runaway stabilizer, and general training on pitch control to be able to analyze the situation and recognize the non-normal condition. 
Review of the DFDR data showed that during both the accident and the previous LNI043 flights, the flight crew responded within 2-3 seconds using control column to control the flight path and subsequently trimmed out column forces using electric trim. 
In the previous LNI043 flight, the flight crew required 3 minutes and 40 seconds rather than seconds to recognize and understand the problem, during which repetitive uncommanded MCAS activations occurred. 
During the accident flight, recognition of the uncommanded stabilizer movement as a runaway stabilizer condition did not occur thereby, the execution of the non-normal procedure did not occur.
Using manual electric trim as a countermeasure to persistent MCAS malfunctions proves generally unsuccessful and greatly distracting.

From the JT610 Final Report:
A combination of repetitive MCAS-commanded coupled with flight crew electric trim input led to a flight condition that considerably increased the flight crew workload of maintaining control. 
The previous LNI043 flight showed that repetitive MCAS-commanded stabilizer movement was able to be countered by the flight crew by repeatedly trimming out erroneous aircraft nose down trim and was only able to be stopped by Stabilizer Trim Cutout switches, enabling the flight crew to safely continue flight and land in Jakarta. 
Boeing admits in their ops bulletin from 6 Nov that the cutout switch is available to stop persistent MCAS malfunction. The safety analysis assumes indefinite manual electric trim is just a MAJOR hazard. The issue is MCAS malfunction a runaway stabilizer? There has been no end of testimony that a pilot would recognize MCAS malfunction as runaway and perform the checklist and use the cutout switches.

The FAA insists on using the runaway stabilizer checklist and using the cutout switches in their AD from 7 Nov 2018. It makes it clear that non-compliance could result in catastrophe.

From AD 2018-23-51:

It took JT610 tragedy to demand re-evaluation of the assumptions around MCAS malfunction and expected pilot behavior. Overnight, it was recognized that pilots must be vigilant, that the hazard could lead to loss of control.

From the JT610 Final Report:
Boeing reasoning that the stabilizer cutout is available but not required is incorrect. 
During the FHA, Boeing did not adequately assess the effect of repetitive MCAS activation.  
The repetitive MCAS-commands coupled with insufficient flight crew electric trim inputs, may have led to increasing flight crew workload.
Per Boeing, stabilizer trim cutouts switches were available but not required to counter MCAS activations. 
The only procedure that directs selecting the stabilizer cutout switches is the Runaway Stabilizer non-normal checklist (NNC). This NNS is used to stop un-commanded stabilizer trim wheel movement, which would stop MCAS-commanded stabilizer trim movement. 
However, erroneous MCAS activation does not look like a typical stabilizer runaway, which is continuous un-commanded (runaway) movement of the stabilizer. 
During the accident flight, the stabilizer movement was not continuous; the MCAS commands were bounded by the MCAS authority (up to 2.5°); the pilots were able to counter the nose-down movement using opposing manual electric trim inputs; and after the pilots released the manual electric input and MCAS was reset, there was not another MCAS command for 5 seconds. 
Repetitive, false MCAS commands result in higher workload than a single false MCAS command. The existence of repetitive commands is yet another mandate for declaring this malfunction HAZARDOUS.

As is evident, Boeing made no provision to modify the MCAS command authority for application when the airplane is not flying at high AoA. 

As was evident in ET302, MCAS command applied at low AoA (high airspeed) is overly aggressive. 

The removal of the aft column cutout took away any limit to MCAS over-authority. 

It should be noted that in addition to preserving elevator authority, MCAS should stop if the airplane experiences negative normal acceleration. 

Knowing the potential for large mistrim would have also revealed the issues regarding recovery using the manual wheel trim only.

Ineligible Discount

The safety analysis is inherently based on assumptions and combinational methods. The assumptions around combinational methods are the most insidious. Boeing applied a factor, a reduction in system integrity requirement of one thousand, that allowed a HAZARDOUS hazard to be ignored.  Without the discount, MCAS would have been developed as a dual-channel, fail-safe solution. Even with the discount, it is not apparent if the resultant software design assurance was discounted as well, which does not seem reasonable at all.

Hot Potato

Three systems are factored into the Safety Analysis, each done as a project:
  1. ADIRS (source of AOA vane data to MCAS)
  2. EDFCS (MCAS function)
  3. Stab Trim Control System (MCAS actuator)
  • ADIRS assumes that its AoA output may malfunction at a 1E-5 rate. 
  • EDFCS only responds to AoA inputs that are not valid, and makes no further mandate on integrity. It leaves integrity to ADIRS.
  • Stab Trim is left holding the bag, for it is the system that actually presents a hazard.
ADIRS was largely unchanged.

EDFCS (Enhanced Digital Flight Control System) includes the Flight Control Computer (FCC) which hosts Speed Trim System and MCAS. The FCC is changed by the addition of the MCAS software function and use of output discretes. It is implied that these output discretes were available in the same FCC that was used on 737 NG, the software being the only difference.

From the JT610 Final Report:
CP13471 indicated that certification of the MCAS implementation and function will be addressed in certification plan (CP13474), “737-8 Amended Type Certificate – Flight Controls – Autoflight (EDFCS/FCC).”
The development of EDFCS certification plan (CP13474) began with Boeing's initial submission of CP13484, revision NEW to the FAA for review in March, 2014. 
A review of CP13474 found that changes to the EDFCS for the 737-8, as compared to the baseline 737-800, were limited to the Flight Control Computer (FCC) software only. CP13474 indicated that the FCC Operational Program Software (OPS) will be revised to add the MCAS function. 
The Stabilizer Trim Control system is changed by the addition of relays that can bypass the aft column cutout function and that can allow override of the trim speed control.

From the JT610 Final Report:
The development of the Elevator and Stabilizer Trim Control system certification plan (CP13471), began with Boeing’s initial submission of CP13471, labeled “NEW”, to the FAA for review in March 2014. 
According to CP13471, one of the changes to the Stabilizer Trim Control system from the baseline 737-800 (NG) was the incorporation of the MCAS. Implementation of this new function required two new analog discrete signals, generated by the FCCs, to be sent to components within the stabilizer system. 
One discrete will override the control column cut-out switches located in the First Officer’s Column Switching Module in the “pull” direction when MCAS is operating to prevent the stabilizer command from cutting out during the pilot maneuver. 
The second discrete overrides the flap position input to enable the higher stabilizer trim motor (STM) operating speed with flaps retracted when MCAS is operating. 

Functional Hazard Assessment (FHA)

The Stab trim FHA and the EDFCS FHA offered specific MCAS considerations.

From the JT610 Final Report:
The Functional Hazard Assessment section of Appendix “G” summarized the FHA that was performed as part of the 737 MAX Stabilizer Trim Control System Safety Analysis, and addressed each system function and the result of loss of availability or loss of integrity of that function. 
The analysis considered all phases of flight for both the Normal and Operating flight envelopes, interfacing systems, and established the effect category for each failure condition. 
Hazard assessments were determined in consideration of the impact to crew workload for the maximum flight time and longest diversion time (where a diversion is required). 
An NTSB review of the FHA found that it identified and classified, pursuant to the guidance in AC 25.1309-1A, the following six hazards related to MCAS: 
CP13474 According to this FHA, the EDFCS Functional Hazard Assessment  for the 737-8 will be based on the FHA for the 737NG as documented in the document titled, "Enhanced Digital Flight Control System, Autothrottle, and Yaw Damper Safety Analysis, Model 737-600/700/800/900."
The FHA was to be updated to address any functional hazards associated with the addition of the Maneuvering Characteristics Augmentation System (MCAS), and other system changes. 
A review of the functional hazard assessment found that it addressed each system function and the result of loss of function or erroneous operation.
Because MCAS only operates with the autopilot off, one hazard contained within the assessment was relevant. This hazard is: "Autopilot Malfunction at Low Altitude Which Results in Unsafe Flight Path in an A/P OFF, Single Channel, or Fail-Passive Configuration (FHA 1).

System Safety Assessment (SSA)

Three System Safety Assessments are applicable: Air Data Inertial Reference System (ADIRS), the Stabilizer and the EDFCS.

From the JT610 Final Report:
The ADIRS SSA is relevant because AoA is processed by ADIRS for use by the FCC. The Angle of Attack Failure section of the SSA includes only AoA resolver circuit failures (open circuit, high impedance, etc.) that can be detected by the associated computer (ADIRU or SMYD). The SSA does not discuss the category of AoA sensor failures not related to the electrical circuitry that could provide misleading (erroneous) data to the ADIRU (e.g. frozen or seized vane with limited or no motion, or a bent or broken vane resulting in angular offset). AoA values are transmitted as "valid" to user systems, because the ADIRU does not detect these faults. 
It was determined that the ADIRU, air data module (ADM), pitot probe head and AoA vane (and heat) have potential undetected failure modes that may result in undetected, and misleading data.  An NTSB review of the functional failure rates table found the following information for AoA sensors. 
The review also found that Boeing considered the effects of a single AoA sensor providing "erroneous data" within the lower branches of a fault tree with the "Top Event" titled "Misleading Air Data from the Left and Right ADIRU - Airspeed/Altitude". The fault tree showed there were two failure conditions that contributed to this top event: 
  • Misleading Air Data from the Left ADIRU, and 
  • Misleading Air Data from the Right ADIRU

    There are four failure conditions that contribute to the "Misleading Air Data from the Left ADIRU" hazard. One of these conditions was titled "Erroneous AoA-L data from the Captain's side". The fault tree showed the following two ways (or failure conditions) that could lead to "Erroneous AoA-L data from the Captain's side".
  • Failure of the AoA-L vane/Annunciation
  • incorrect AoA output from the ADIRU-L output
For the "failure of AoA-L vane / annunciation", the fault tree showed that this event could occur by the combined (ANDed) result of the following two failure conditions:
  • Loss of AoA-L Heat Annunciation
  • Erroneous AoA-L Sensor 
In 2019, Boeing advised NTSB of an error in this fault tree in that the two conditions should not have been combined with an AND gate. In a June 28, 2019, revision to the SSA, "Erroneous AoA-L data from the Captain's side" is revised to show three separate conditions combined with an OR gate, meaning any one by itself could result in erroneous AoA data:
  • Erroneous AoA-L Sensor
  • Incorrect AoA output from ADIRU-L output
  • Loss of Power to AoA-L Heater

Because the ADIRS top level event was a combination of Misleading data from both ADIRUs, there was no further elaboration. A combination of two AOA misleading events equates to no greater than 1E-9, or sufficient to meet a catastrophic hazard. Effectively, the safety assessment of the ADIRS protected its own functions and leaves it to downstream users to deal with a likelihood of 1E-5 failure rate to receive AoA erroneous data.

It should be noted that the ADIRS SSA does not account for stall warning; that is the SMYD and it was not discussed (presumably unchanged from the 737 NG). ADIRS does declare a MAJOR hazard if one source of airspeed and/or altitude is misleading. 

The EDFCS SSA is focused on the software and hardware contribution to failures of the EDFCS equipment.

From the JT610 Final Report:
With regards to MCAS, the SSA indicated that the inclusion of the new MCAS function creates new failure modes affecting the probability of runaway stabilizer trim which cannot be arrested by the column cutout switches.
As previously described, the MCAS function normally activates only during manual flight, and operates by trimming the horizontal stabilizer in the nose-down direction while the aircraft is executing a high-AOA maneuver. 
Any erroneous activation of the MCAS ENGAGE output will energize the bypass relay and prevent aft column inputs from interrupting nose-down automatic trim commands.
To prevent a failure condition of an erroneous MCAS activation preventing the column cutout mechanism from interrupting an uncommanded nose-down automatic stabilizer trim command, Boeing modified the fault tree for the failure conditions titled "Erroneous Runaway/oscillatory stab output un-arrested by column cutout". 
The failure was one of eight conditions that contributed to a higher-level failure condition titled "Autopilot Malfunction in the Pitch Axis at Low Altitude." This failure condition is one of four conditions that contributes to the Top-Level event titled "Autoflight malfunction at low altitude which results in unsafe flight path in autopilot OFF, single channel, or fail passive configuration." This Top-Level event was identified as a catastrophic hazard as part of Boeing's EDFCS functional hazard assessment.  
An NTSB review of the modifications incorporated into the fault tree titled "Erroneous Runaway/Oscillatory stab output un-arrested by column cutout" revealed that the following two failure conditions "AND'ed" together resulted in the hazard.
  • Column Trim Cutout Fails to Interrupt Stab Motion
  • Undetected Stabilizer trim runaway 
For the "Column Trim Cutout Fails to Interrupt Stab Motion" hazard, the fault tree identified two potential failure conditions (OR'ed together) that could result in the hazard.  One of the failure conditions "FCC-730 produces undetected erroneous MCAS or FLAPS Up/Dn discrete" is where the fault tree begins to address the erroneous activation of the MCAS Engage outputs.
Tracing the failure conditions that could lead to the hazard identified by the SCD led to the event titled "input failures cause FCC to produce an undetectable erroneous MCAS engage discrete". The probability for this event was <1E-9. 
In this context, the FCC protects against its own hardware or software introducing misleading data. In this context, a valid AoA reception is judged to be accurate, not misleading. The conclusion from the FCC SSA is that the catastrophic hazard relating to MCAS malfunction from erroneous input was solely from the corruption within the FCC itself to the hazard.

While input failures within the FCC may be protected to 1E-9, from what is known, the Stab Trim and MCAS outputs from one FCC is probably 1E-5 if operating on a single processor, but could be 1E-9 if operating on both processors with tandem output control.

The Stabilizer Trim Control SSA includes an Appendix with an FHA that identified the severity of hazards due to the implementation of stabilizer trim changes.

From the JT610 Final Report:
An NTSB review of Appendix "G" found that the introductory section of the SSA had not been updated to reflect the March 2016 MCAS maximum authority changes...There was no mention that MCAS had been revised to improve flaps up, low Mach stall characteristics and identification...the Appendix still reflected a pre-March 2016 MCAS maximum authority limit of 0.6 degrees.  
However, an NTSB review of Boeing internal documents confirmed that the FHAs had in fact been reassessed each time that MCAS requirements were changed, include the change in authority limit from 0.6 to 2.5 degrees. In all cases, the reassessment found that the FHA categories had not changed.
In a July 2016 briefing, Boeing provided the FAA with a presentation on stall characteristics and configuration changes. At this briefing, Boeing discussed some of the physical aerodynamic devices (relocation of stall strip, vortex generators (VG) configurations, etc.) they used to improve the stall characteristics with only limited success. During the briefing Boeing discussed their intent to expand the MCAS function to activate at lower Mach speeds. FAA well understood that greater MCAS authority would likely be necessary...In the Fall of 2016...the maximum MCAS authority of 2.5 deg in the low speed region was specifically covered.
Boeing indicated that fault tree analysis were only performed on the FHA events that were determined to be either Catastrophic or Hazardous, which is constant with the guidance in SAE ARP 4761. As described above, unintended MCAS activation was shown to be Major in the normal flight envelope and Hazardous in the operational flight envelope.

Flight Envelope

Flight envelope is a term to describe the range of conditions an airplane is exposed to from departure to arrival. FAA Advisory Circular AC 25-7D provides a definition of the normal flight envelope, the operational flight envelope, and the limit flight envelope. Each flight envelope is more and more extreme, edging to precipice of controlled flight itself.

Considering only speed and angle of attack (alpha), flaps-up.

Normal Flight Envelope (NFE) extends between the minimum and maximum operating speeds. The maximum speed would be Vmo or Mmo. The minimum speed is the lowest you can select, affording at least 40 deg. bank angle (or 1.3g maneuver margin).

The NFE border with the Operational Flight Envelope (OFE) begins at speeds below the 1.3g minspeed.

The OFE border with the Limit Flight Envelope (LFE) begins at the stall warning speed, or 𝛂(sw). This is the point that the stick shaker is activated. 

The LFE extends then to the point of maximum coefficient of lift (Clmax), beyond which the airplane is stalled.

AC 25-7D

For the purpose of AC 25-7D Handling Qualities Rating Method (HQRM), a method of determining probable Versus improbable flight conditions, the probability of encountering the OFE was determined to be 1 in 1000 (1E-3) and for entering the LFE to be 1 in 100,000 (1E-5).

AC 25-7D

Further to the calculation, AC 25-7D teaches that you combine the predicted failure rate (or the likelihood of malfunction) with the probability of encountering the failure. 
AC 25-7D

AC 25-7D

The stated purpose of the determination of probable Vs improbable fail was solely for determining the required handling qualities depending. In the following table E-2, the further delineation of atmospheric disturbance, demonstrates the increasing demand on controllability as the airplane gets pushed around.

AC 25-7D

From the JT610 Final Report:
FAA advisory Circular 25-7C Appendix 5 lists the probability of being outside the normal flight envelope as 1E-3. Therefore, a condition that meets the integrity requirements for Major within the normal flight envelope also meets the Hazardous integrity requirements for operational flight envelope.
Therefore, unintended MCAS operation FHA events were not evaluated in the fault tree analysis as there were assess MAJOR in the normal flight envelope; Boeing indicated that is consistent with FAA regulations and the Boeing process. 

Claiming the Discount

The safety analysis is ultimately managing functional development assurance level. 

Functional development assurance level applies at the airplane level, and down to both hardware and software. 

A system meeting a HAZARDOUS criteria is functional development assurance level B. 

The functional development assurance level prescribes the software and hardware levels. 

From ARP4754A:

From DO-178C:
Software Level Definition
This document recognizes five software levels, Level A to Level E. For the example failure condition categories listed in section 2.3.2, the relationships between these software levels and failure conditions are:
a. Level A: Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a catastrophic failure condition for the aircraft.
b. Level B: Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a hazardous failure condition for the aircraft.
c. Level C: Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.
RTCA ARP 4754A describes the influence of system architecture on Functional Development Assurance Level with multiple members. Effectively, a level B hazard requires a single member FDAL level B or two members FDAL C.  A system whose software that is developed to meet a level C hazard does not necessarily satisfy a level B hazard.

From ARP4754A:

On the hardware front, the goal in general is to limit the failure rate to an acceptable level.

The malfunction of MCAS is HAZARDOUS in the operational flight envelope.  That is to be construed, if you are in the operational flight envelope, if MCAS then fails, it is HAZARDOUS. The concern is serial, not independent. There is no relief for the malfunction case.

The loss of MCAS and being in the operational flight envelope is a different question. For this case, the failure of MCAS is independent of being inside the operational flight envelope.

The failure rate for MCAS malfunction AND being in the operational flight envelope is no greater than 1E-7. If the probability of being in the operational flight envelope is 1E-3, then MCAS must limit its failure rate to no more than once time in 10,000 hours.

Malfunction rate is distinguishable (in this context) from failure rate, where malfunction is a valid but misleading hazard and failure is a simple loss of function (passive hazard). 

An MCAS malfunction was treated as MAJOR in the normal flight envelope, a 1 in 100,000 (1E-5) malfunctions per hour rate. 

An MCAS malfunction was treated as HAZARDOUS in the operational flight envelope, or 1 in 10,000,000 (1E-7) malfunctions per hour rate.

A MAJOR malfunction rate can be met by a single threaded system. 

A HAZARDOUS malfunction rate cannot be met by a single threaded avionics system reliant on a single sensor.

If two channels are required to achieve a fail-safe architecture, then the function is only available if both channels are available. Thus the loss of function is more likely with a fail-safe system than it would be for a single thread.

From the JT610 Final Report:  
The FHA for uncommanded MCAS activation was classified as Major therefore, the FMEA and FTA were not required. 
FMEA would have been able to identify single-point and latent failures which have significant effects as in the case of MCAS design. 
It also provides significant insight into means for detecting identified failures, flight crew impact on resolution of failure effect, maintenance impact on isolation of failure and corresponding restitution of system.
FTA would have also been able to show if the system architecture meets the numerical criteria set by the FHA. Again, in general, only failures categorized as Hazardous or Catastrophic are evaluated, even though in some situations, complex single-string Major failures are evaluated. Another benefit of FTA that had been missed was to demonstrate compliance with probabilities for combinations of failures. If a system does not meet minimum allowable probability, FTA can indicate where system is deficient and where mitigating action can be applied.

No Discount

The discount proposition is sensible based on combining independent factors to set loss of function failure rate. This is the very basis of high integrity design.

Exposure window is applied to failure rate to arrive at probability. Exposure window does not relieve the malfunction-based design assurance level. It does provide for required inspection intervals or other factors required for continuing certification.  

The probability of being in the operational flight envelope applies to exposure window.

On one hand, the chance of MCAS malfunction happening at the same time the airplane is outside of the normal flight envelope are combined to achieve a composite failure rate.  This leads to approving a system with one-tenth the integrity to achieve a MAJOR hazard as acceptable for a HAZARDOUS hazard. It does make numerical sense when considering loss of function.

On the other hand, if the airplane is outside of the normal flight envelope and the malfunction of MCAS is still HAZARDOUS, then there is no discount in malfunction  rate.

The malfunction rate, once you are outside the normal flight envelope, should still be 1E-7 malfunctions per hour.

D.P. Davies wrote about this very issue regarding stick pushers in "Handling the Big Jets" first published in 1967, third edition in 1971. Stick pushers were used as a stall identification device specifically to prevent "super stall" on high-tail jets.
All this philosophy is conditional upon the qualification imposed that the device must be sufficiently reliable. Now 'sufficiently reliable' in this case is defined as:
On failing to operate when required to operate The stall probability rate of 1 in 100,000 x the per flight failure rate of 1 in 100 flights = 1 in 10 million; which is the same rate as that as assumed for failure in a blind landing. Expressed more simply this means that, 99 times out of a 100 occasions on which the airplane is stalled (with is one in every 100,000 flights), the pusher will push and recover the airplane.; on the hundredth occasion the pusher is assumed to have failed and the airplane possibly suffers a catastrophe. But this will occur only once in 10 million flights. However much you might object to this possibility, it is an acceptable level (because it is extremely remote) and one on which are based a lot of other risks with are run in civil airline operation.
On operating when not required to operate (the runaway case) For a modest upset, 1 in 100,000 flights; for a severe upset 1 in 10,000,000 flights. (modest upset is defined as not worse than zero g. A severe upset is defined as significantly negative g but not beyond proof negative. These values are closely related to autopilot certification which even the stoutest opponent of stick pushers never seems to question. 
Davies confirms that the method summarized for MCAS below, regarding HAZARDOUS in the operational flight envelope:

  • loss of function
    • take the 1E-3 discount off of HAZARDOUS failure rate
    • 1E-4 per hour failure rate in MCAS case
  • malfunction (runaway) case
    • Full HAZARDOUS malfunction rate 1E-7 per hour
From the JATR Report:
Recommendation R3.2: The FAA should issue a policy statement on the need for caution and early negotiation with the certification authority when an applicant proposes using additional guidance not originally intended for showing compliance to system safety requirements. 
Observation O3.2-A: The JATR team observed that the SSA takes credit for the probability that the aircraft will be flying in certain portions of the flight envelope, as documented in AC 25-7C. A probability of 1E-3 for the aircraft in the operational flight envelope (OFE) was used in combination with the probability of the system failure to achieve the 1E-7 minimum probability required for the “hazardous” MCAS failure condition. Use of AC 25-7C is not a standard industry approach for 25.1309 compliance. The JATR team’s view of the intent of the probability of 1E-3 for the OFE in the HQRM is to select flight test cases for handling qualities evaluation, not to support the quantitative aspects of 25.1309 or 25.672(c) compliance.
Finding F3.11-A: HQRM guidance from AC 25-7C was applied for the evaluation of control systems malfunctions. The application of the probabilistic aspects of this guidance was appropriate to the determination of the required handling qualities, but may not be suitable for evaluation of the failure condition per AC 25.1309-1A, System Design and Analysis, and AC 25-7C. 
Finding F3.11-B: For § 25.1309 compliance, the criticality of the failure condition should account for intensifying conditions, such as crew workload or multiple cockpit indications, and effects and interrelationship of failures with the flight envelopes. 
Finding F3.11-C: Boeing’s application of HQRM allowed for a reduced envelope in the evaluation of SPEED TRIM FAIL, which may not meet the intent of guidance within AC 25-7C and AC 25-1309-1A.
The FAA will need to offer some commentary, but the fact that JATR rejected the discount claim is very, very troubling. Boeing discounted MCAS to achieve this one condition. But Boeing has stated that this discount is business-as-usual. How many other system malfunctions are discounted in this manner?

The combination of effects from one misleading AoA vane are a MAJOR effect from ADIRS, of no consequence to EDFCS, and was concluded a MAJOR hazard from Stabilizer Trim. 

The combination of effects from one AoA malfunction to MAJOR effects in Air Data, in Stall Warning, and in MCAS was not evaluated. Three MAJOR effects, each demanding non-normal checklists, in the takeoff and climb out flight phase, has been shown to be overwhelming workload, apparently CATASTROPHIC.

Is MCAS Stall Identification?

Boeing has consistently characterized MCAS as modifying the handling characteristics of the airplane, and in particular, was not a part of stall warning or stall identification. Boeing has absolutely described Speed Trim as part of Stall Identification. 

Stall warning is just that, a warning. It normally does not take any action on the flight controls. Stick shaker is a clear stall warning. Elevator Feel Shift is an opposing force, not a driving force.

Stall identification (augmentation) is an action - it is doing something to the airplane to push the nose down. Stick pushers and stick nudgers are commonly used for stall identification.  A stick nudger would have performed nicely in place of MCAS, but that would have required redesign of the feel system. Speed Trim is performing a stall identification function.

Stall Identification should not suffer from single point malfunction, it should be inhibited if commanding negative g, and it should have an OFF switch. 

Speed trim is single thread, has no normal acceleration inhibit, and has no OFF switch. It is not at all clear how the situation has been reached, as it extends into the 737NG at least. MCAS would benefit from each of these features. 

Speed Trim was added to the 737-300. The original version of Speed Trim was inhibited in high alpha scenarios. 

The 737-700 certification was held up by the European Joint Aviation Authorities (JAA) after the FAA had certificated it, in part over making a change to the Speed Trim System such that it would operate at any speed.  The issue is plainly discussed as Stall Identification.  
25 FEBRUARY, 1998 
737-700 receives JAA approval after stall warning changes 
Boeing's 737-700 obtained European Joint Aviation Authorities certification on 18 February after changes were made to increase stall warning. The modifications meet the JAA's insistence that the pilot be able to identify clearly the occurrence of a stall, even after the activation of the stick shaker.  The resulting changes to the speed trim system, and related wiring, ...
Boeing Next Generation 737 chief project engineer Pete Rumsey says, "The JAA asked us to fly beyond the stick shaker. We came close to the specific requirements, but did not meet the letter of the law, which was written for older aircraft." The advanced design of the new wing means that, in a stall, "-lift degrades very gradually. The aircraft continues to be controllable", says Rumsey. 
The JAA insisted on the addition of a system to raise pilot awareness. "We are adding a speed trim system that will demonstrate the stall characteristics more. It will push the nose down as the aircraft goes into a stall," he says. Normally, the speed trim system is switched off automatically as the stick shaker is activated. The JAA ruling means that original safety systems have been redesigned to allow the trim system to activate in the event of a stall.
From AC25-7D:
The airplane is considered to be fully stalled when any one or a combination of the characteristics listed below occurs to give the pilot a clear and distinctive indication to cease any further increase in angle-of-attack, at which time recovery should be initiated using normal techniques.
  1. The pitch control reaches the aft stop and is held full aft for two seconds, or until the pitch attitude stops increasing, whichever occurs later.
  2. An uncommanded, distinctive, and easily recognizable nose down pitch that cannot be readily arrested.
  3. The airplane demonstrates an unmistakable, inherent aerodynamic warning of a magnitude and severity that is a strong and effective deterrent to further speed reduction. This deterrent level of aerodynamic warning (i.e., buffet) should be of a much greater magnitude than the initial buffet ordinarily associated with stall warning. 
  4. The activation point of a stall identification device that provides one of the characteristics listed above. 

Speed Trim command Airplane Nose Down as the airplane slows down. This meets criteria 4 adding to criteria 2 (above).

An example of a Speed Trim schedule is shown below:

The 737 NG FCOM does describe Speed Trim as part of Stall Identification.

From AC25-7D:
Probability of artificial stall warning and stall identification systems operating inadvertently: 
The probability of inadvertent operation of artificial stall warning systems, during critical phases of flight, should not be greater than 10-5 per flight hour. 
To ensure that inadvertent operation of the stall identification system does not jeopardize safe flight, and to maintain crew confidence in the system, it should be shown that:
  • No single failure will result in inadvertent operation of the stall identification system; and
  • The probability of inadvertent operation from all causes is improbable (not greater than 10-5 per flight hour).
A single failure of an AoA vane causes both MCAS to trigger and Stall Warning to trigger "inadvertently", but it does not cause Speed Trim to trigger (it is based on airspeed).  Speed Trim is still single-thread on airspeed malfunction, and from hardware output.

Boeing has previously denied MCAS to be a part of stall identification.

The JATR report was justifiably concerned on the need to protect from a single-point malfunction.

From the JATR Report:
Recommendation R3.7: The FAA should review how compliance was shown for the stall identification system on the B737 MAX with respect to inadvertent operation due to single failures. 
Finding F3.7-A: The JATR team considers that system features on the B737 MAX might constitute a stall identification system. This system is vulnerable to inadvertent actuation due to a single failure, which would not meet the accepted guidance contained within AC 25-7C, Chapter 8, Section 228.
MCAS is used only with flaps up, so it is not expected to be in operation close to the ground.

Speed Trim is described to operate when flaps are down or gear is up or airspeed below a threshold. I am assuming it is flaps down and gear up or flaps up and airspeed below a threshold. This allows the gear down discrete to block Speed Trim activation near the ground.

Ultimately, Speed Trim (and MCAS) are unbounded. They are allowed to drive the stabilizer to its full travel. A stick nudger or pusher, or any dedicated actuator used for augmentation is designed with limited authority by mechanical capability. 

From the JT610 Final Report: 
Inadvertent operation of the stall identification system should not cause catastrophic ground contact. This should be achieved by limiting the effect of the stall identification system to that necessary for stall identification purposes, without undue flight path deviation (e.g., by limiting the stroke of a stick pusher). 
Alternatively, if inadvertent operation could result in catastrophic ground contact according to 25.1309(b)(1), the probability of inadvertent operation must be extremely improbable. 
Inhibition of the system close to the ground (e.g., for a fixed time after liftoff or below a radar altitude) would not normally be an acceptable means of compliance with this requirement.
I am personally quite surprised that a normal acceleration factor was not included in MCAS. While I understood that Boeing had removed the "above 1g" criteria when MCAS was extended to low altitude, wing-level conditions, I would have expected an inhibit term, something line normal acceleration less than 0.7g. 

From AC25-7D:
Normal operation of the stall identification system should not result in the total normal acceleration of the airplane becoming negative.
AC25-7D Paragraph 42 offers further guidance for stall identification, from which it seems that there should be a procedure to deactivate the augmentation, an OFF switch. 

From AC25-7D:
A means to quickly deactivate the stall identification system should be provided and be available to both pilots. 
It should be effective at all times and should be capable of preventing the system from making any input to the longitudinal control system. 
It should also be capable of canceling any input that has already been applied, from either normal operation or from a failure condition.
One of the flights (from the last 18 years) that dealt with stuck stall warning took action to suppress stall warning. This seems a very reasonable feature.

From the JT610 Final Report:
On the fourth flight, the stick shaker occurred after takeoff. 
The flight crew elected to continue as returning would result in overweight landing and the weather along the route was clear. 
About 40 minutes after takeoff, the flight crew pulled the circuit breaker of the affected control column with intention to eliminate noise and to make the stick shaker warning on the other side functioning normally.
JATR pointed out that the role of Speed Trim System (STS) and MCAS relate specifically to the suitability of the aircraft without any augmentation.

From the JATR Report:
Finding F3.4-A: The acceptability of the natural stalling characteristics of the aircraft should form the basis for the design and certification of augmentation functions such as EFS and STS (including MCAS) that are used in support of meeting 14 CFR part 25, subpart B requirements.
Recommendation R3.5: The FAA should review 14 CFR 25.201 (Stall Demonstration) compliance for the B737 MAX and determine if the flight control augmentation functions provided by STS/MCAS/EFS constitute a stall identification system. 
Finding F3.5-C: The JATR team considers that the STS/MCAS and EFS functions could be considered as stall identification systems or stall protection systems, depending on the natural (unaugmented) stall characteristics of the aircraft. From its data review, the JATR team was unable to completely rule out the possibility that these augmentation systems function as a stall protection system.

What about Speed Trim?

Speed Trim System was introduced in 1984 on the 737-300. Speed Trim has performed well over the course of all these years. Speed Trim System includes a second monitor processor with a SPEED TRIM FAIL alert. The monitor processor cannot prevent the command processor from erroneous output. The discussions above would be relevant to Speed Trim and are likely based largely on how Speed Trim System was originally approved. Speed Trim System (not just the MCAS sub-function) is subject to the same hazards that MCAS prescribes. As long as the MCAS specific output discretes are protected sufficiently, the aft column cutout function is therefore intact, and any malfunction of Speed Trim (not MCAS) is likely at most a MAJOR hazard, as with any runaway.

The subsequent installations of Speed Trim on 757 and on 747-400 were designed with a command and monitor processor, each with an independent source of air data, whereby both processors can stop a malfunction (fail-safe). There was no open mandate to upgrade the 737 Speed Trim System to fail-safe, especially with the addition of MCAS, as part of a company-wide assessment of safety. Each model has unique factors, and those other models have two fully functioning stab trims systems, whereas the 737 has one stab trim system and a wheel.

Misleading AoA Vane is HAZARDOUS

I would conclude it also as a MAJOR effect for false stall warning from one SMYD. 

The fragmented safety analysis process does not reflect the natural combination that will occur from this one single sensor failure - instead it is treated like a bunch of small bits, failing to appreciate the combined workload that can overwhelm pilots. 

From The Boeing Ops Bulletin (6 Nov 2018)

There is no denying that Boeing and the FAA were presented with a cacophony of flight deck effects, and adding MCAS malfunction on top of this. Both the FAA and Boeing continued to view the situation as business-as-usual, no alarm bells were going off that a single AoA malfunction generated HAZARDOUS workload.

If a single AoA vane still causes a MAJOR event (Airspeed and Altitude disagree) and also another MAJOR event of false stall warning (stick shaker, minspeed/PLI anomaly, feel force increase), then the combination of two MAJOR events simultaneously is HAZARDOUS. 

The addition of MCAS malfunction takes the combination to CATASTROPHIC.  Two tragedies stand in testimony, there can be no denial.

With a need for continued functionality in the presence of one failed sensor, a third AoA source or enhanced means is needed to detect malfunction in either existing AoA vane. A synthetic AoA may be achieved by reference to airspeed and a basis for gross weight sufficiently accurate to resolve AoA differences.  This is a generic problem facing many Boeing airplane models, and while not urgent, demands corrective action, if even on a go-forward, best-practices basis. 

Peter Lemme

peter @
Follow me on twitter: @Satcom_Guru
Copyright 2019 All Rights Reserved

Peter Lemme has been a leader in avionics engineering for 38 years. He offers independent consulting services largely focused on avionics and L, Ku, and Ka band satellite communications to aircraft. Peter chaired the SAE-ITC AEEC Ku/Ka-band satcom subcommittee for more than ten years, developing ARINC 791 and 792 characteristics, and continues as a member. He contributes to the Network Infrastructure and Interfaces (NIS) subcommittee developing Project Paper 848, standard for Media Independent Secure Offboard Network.

Peter was Boeing avionics supervisor for 767 and 747-400 data link recording, data link reporting, and satellite communications. He was an FAA designated engineering representative (DER) for ACARS, satellite communications, DFDAU, DFDR, ACMS and printers. Peter was lead engineer for Thrust Management System (757, 767, 747-400), also supervisor for satellite communications for 777, and was manager of terminal-area projects (GLS, MLS, enhanced vision).

An instrument-rated private pilot, single engine land and sea, Peter has enjoyed perspectives from both operating and designing airplanes. Hundreds of hours of flight test analysis and thousands of hours in simulators have given him an appreciation for the many aspects that drive aviation; whether tandem complexity, policy, human, or technical; and the difficulties and challenges to achieving success. 


  1. Thanks Peter, great piece- albeit hard for me to grasp all the aspects.
    Is this a typo? You write: ... This leads to approving a system with one-tenth the integrity to achieve a MAJOR hazard as acceptable for a HAZARDOUS hazard. It DOES make numerical sense.
    You mean: it DOES NOT ??

  2. Hi Peter, seems vital work. As ex UN career staff who lost colleagues in the Addis crash, I know IMO would be all over this if it were maritime, ITU if telecomms, WHO if health, on and on. So why is ICAO barely being mentioned anywhere including not by you? If FAA/NTSB and the Euro equivalent are stiffing them, that seems another red flag for your list. Peter Quennell

  3. after my third re-read:
    Math error concerning the combination of probabilities committed by Boeing. Combined event probability is multiplied single event prob only if events are independently random! In this case, MCAS has the power to bring the plane out of normal speed envelope, and MCAS is designed to trigger outside of normal envelope. Therefore the single events are by no means independent.

  4. On the other hand, if the airplane is outside of the normal flight envelope and the malfunction of MCAS is still HAZARDOUS, then there is no discount in malfunction rate.

    The malfunction rate, once you are outside the normal flight envelope, should still be 1E-7 malfunctions per hour.

    As a matter of fact, the dependency of A) OFE and B) MCAS runaway is extremely strong. If B) occurs in NFE, the probability for A) becomes 1 = 100%. Thus, the probability for B)alone is equal to the probability for A)&B) combined. The "discount" is plainly wrong.