Thursday, November 15, 2018

737 MCAS - Failure is an Option

The 737MAX introduced a new feature, Maneuvering Characteristics Augmentation System (MCAS). MCAS commands the stabilizer to trim down (only while flaps are up) in steps.  From reports, the aft column cutout switch is disabled while MCAS is active.  MCAS commands nose down trim at 0.27 deg/sec for about 10 seconds at a time. It then pauses briefly (duration not known, but presumed five seconds).  The time history of a fabricated scenario, starting at a stabilizer position of 8 units, shows it would take about 55 seconds to reach the nose-down limit. Starting at a more nose down position takes less time, e.g from 4 degrees about 35 seconds).

MCAS Commanded Stabilizer Trim Without Stopping

REVISED 10:18pm 16 Nov 2018
This commentary and analysis is based on unofficial information and may be inaccurate. Certification and System Development are complex and proprietary undertakings!
The concerns raised are valid,
but the circumstances may be a bit different than how I characterize them herein. 

The combination of other flight deck alerts combined with the autonomous MCAS stab commands may be overwhelming.

Failing to override trim with aft column motion would be confusing, if caught unaware.

Manual trim command would pause MCAS for five seconds; then it would pick up again.

The Autopilot Trim Cutout switch would turn off the MCAS trim command.

The Stabilizer Trim Cutout switch would remove power from the trim motor.

The flight crew have the option to grab the stabilizer trim wheel as a brute-force method.

Reports indicate that MCAS was added to the 737MAX as a result of issues with accelerated stalls. These are already abrupt. The handling characteristics deteriorated are reported to be attributed to the engine changes unique to the MAX. 

The 737NG speed trim function provides a nose down trim incremental command as the airplane slows into a stall, but it would not likely drive into the stops. Check out this post for all the details relating to 737NG Pitch Control.

(little known fact, if you slow down manually, speed trim will trim the stab nose down; but if you slow down on autopilot, the autopilot will trim the stab nose up).

There should be little doubt MCAS provides benefits as intended.

The 737 low speed upset recovery has been criticized repeatedly (for decades) for needing timely nose down trim, to prevent uncontrolled pitch up, if the elevator is unable to offset the thrust-pitching moment.

Unintended MCAS activation is alarming, it is persistent in its nature, and it brings a host of distracting annunciations to the mix. 

The failure to stop the trim by pulling back on the column would be completely unexpected and confusing.

Boeing describes MCAS as a result of a handling quality shortcoming encountered in an accelerated stall, that would be likely be tested as a wind-up turn. The pilot holds airspeed and progressively increases bank angle and back pressure to result in stalling at high load factors. The flight condition is most likely entered with the aft column travel beyond the column cutout threshold. If the aft column cutout was active, MCAS would be inhibited before it could do anything. For this reason, I think Boeing decided to disable the column cutout feature for MCAS, via autopilot software change.

Humans are most confused when something that has always been reliable and trustworthy fails.

The combination of uncommanded trim (first surprise) and then failure to stop it with aft column travel (second surprise) is exponential - and in this situation response time is critical.

Failure is an Option

Adding a new function to an airplane involves a safety assessment.  How does the loss of the function and how does the malfunction create hazard?

The loss of MCAS function has little bearing on any given flight, as low speed upsets are never supposed to happen.

If MCAS is reasonably reliable, it can be assumed to be available when it is needed.

Even if MCAS fails to activate in a stall, there is still a good chance for recovery.

The issue of feature availability gets to Minimum Equipment List and Dispatch rules.

The 737 FCC installation is a "dual-dual" configuration. Within each of the two autopilot computers there are two different processors, that each themselves are programmed by different people. The greatest threat is a common-mode software failure. Having two different groups program from a common set of requirements is a means to diminish a common mode failure.

A common-mode failure is either because everyone made the same mistake, either by implementing it wrong the same way, or the requirement was wrong itself.

Speed Trim and MCAS
Speed trim is an autopilot function that trims the stabilizer with the autopilot turned off (manual control) on a schedule based on airspeed. As airspeed slows, the stabilizer is moved to a more "nose-down" trim position.  The autopilot outputs speed trim command directly to the electric trim actuator.
FCC Selection For Speed Trim    Only one FCC at a time supplies the speed trim signal to the stabilizer trim electric actuator. When the FCCs get electrical power, FCC A supplies the speed trim signals. If power remains on the FCCs, the on ground signal from the proximity switch electronics unit (PSEU) switches the FCC which supplies the speed trim signals. If one FCC fails, the other FCC automatically supplies the speed trim signal.
Only one of the two FCC internal processor calculates the speed trim output (the "active" FCC).  The other internal processor drives speed trim warning.  If power is cycled, the FCC A commands speed trim, and until it is cycled again, the role will swap between FCC A and FCC B after every landing.

I do not know if both FCC internal processors calculate MCAS outputs, or if it was like speed trim where one side makes the command and the other trips the warning.  I assume it is like speed trim (one FCC), but this may be a mistake.

Electric Stabilizer Trim Command
Electric stabilizer trim commands are actually of two variants, manual trim and autopilot trim.

Manual trim is comprised of switches and relays. The failure modes of hardware-only configurations do not depend on software. These failures are classic broken wire, shorted wire, stuck contact, open coil, open contact failures. By placing the switching signals in a serial tandem manner, any single failure can be overcome by the capability of the switching elsewhere.

The Pilot column trim switches not only switch the command up and down, but they also provide power to activate the trim.

It takes power and a command to move the motor.

Lose power, no motion.

Lose command, no motion.

The column trim switches go through a column cutout switch that trips if the column is moved significantly in opposition to the trim.  The pilot pulling back with nose trim down would stop the nose down trim command.

The outputs of the column trim switches go through a control stand cutout switch. This pilot-operated switch is the last electrical barrier.

Thus there are measures at three stages (column switch, column cutout, control stand cutout) to stop runaway stabilizer trim.  Each stage has the ability to stop both power and command.

Autopilot speed trim output is managed completely differently. The command from the "active" FCC, single processor, is sent to the stab trim actuator in parallel to the a manual electric commands. The autopilot stabilizer trim command has only one control stand autopilot trim cutout switch to isolate autopilot trim.  Nothing else stands in the way of the autopilot command.

The other control stand stab trim cutout switch will remove power from the actuator regardless of the autopilot command. This is the big hammer.

The autopilot receives a signal from the column switch cutout module that the pilot has moved the column in opposition to the trim command.

Unlike manual electric trim, the active FCC single CPU software decides how to respond to the pilot opposition. 

For speed trim, the active FCC will stop the trim.

From what is reported, in MCAS mode, the active FCC ignores the pilot column inputs to stop the trim command.

I am very surprised that using the column cutout function for MCAS will not work is not highlighted in the FAA Airworthiness Directive. It even implies that it will work "...if relaxing the column causes the trim to move”.

From what is reported, in MCAS mode, the active FCC will suspend stabilizer trim commands for something like five seconds if the pilot uses the column trim switch, and if still active, resume trimming nose down.

Availability Vs Integrity
Availability is broadly related to reliability.

Redundancy can improve availability by overcoming most single point failures.

Generally the system Design Assurance Level (DAL) is tied to the level of the hazard created.

Where loss of function creates major hazard, DAL C mandates the software and hardware development levels.  A single threaded hardware solution would easily meet DAL D, but would be pressed to meet DAL C without redundancy. DAL B would almost always have redundancy.

The redundancy is fighting against hardware failures. Software development is managed at a level ever more intrusively as development levels increase. Hardware meets numerical goals, whereas software meets categorical goals.

MCAS command integrity is dependent on the FCC software to faithfully calculate the command based on the value of the sensor.  The command integrity includes a requirement that the sensor is valid for the command to be valid. Sensor integrity, or the potential for misleading data, must be accounted for in the integrity of the command. Reliance on a single sensor would be difficult to justify to satisfy anything greater than a major failure mode. A dual sensor would bring a large benefit to command integrity, and even more so if dealing with a hazardous failure.

My comments about DAL levels are in the context of integrity, not availability, and towards the hazards that they are matched to.


Stabilizer Runaway Hazard Assessment
I would characterize uncommanded stabilizer motion as a hazardous failure condition. Timely response is necessary. Increasing workload is not the paramount issue.

There is no indication for a hardware failure driving the autopilot stabilizer in MCAS mode.

The failure creating the stabilizer runaway hazard is the result of the (single?) FCC software calculating MCAS command, which includes measures to manage input signals validity before using them to create valid output commands.

Regardless of the autopilot commands, the flight crew can select either cutout switch, or grab the stabilizer trim wheel physically to restore control in an MCAS malfunction.

The malfunction of the active FCC command is a "single point failure" to the autopilot stabilizer trim function that can lead to stabilizer runaway.

Mitigation (autopilot trim cutout switch, stab trim cutout switch) ensures that the FCC malfunction is not a single point catastrophic failure.

Safety margins are reduced by the distractions in workload. The loss of the Flight Management System is a major failure. Providing misleading information to the crew is at least a major failure, and depending on the information, a hazardous failure.

A hazardous failure couples the increase in pilot workload to time-critical response. This is where runaway stabilizer crosses into hazardous, as the pilot must stop the runaway or the airplane will become uncontrollable.

A catastrophic failure is one that has no hope of recovery (one should never give up, though!). A wing breaking off would be catastrophic. The stabilizer jack screw is itself subject to catastrophic structural failure modes.

The malfunction of the MCAS function has a significant flight deck effect. This failure, as is becoming increasingly evident, can be catastrophic if timely action is not taken.

The timely action may be related to maintenance practices as well as tactical actions by the crew to use either cutout switch, or grabbing the trim wheel manually.

Unlike the manual electric trim that has three successive barriers that depend solely on hardware, the autopilot electric trim is dependant on a single channel of software calculation and has two successive barriers (the cutout switches).

The flight crew wrestling with the stabilizer trim wheel as a last resort would be a difficult feature to depend on in a safety analysis for MCAS.

The reliance on software in the autopilot case is the most difficult, as the complexities are greatly increased.

Input Signal Management
Input signal management is critical to resisting malfunction.

Speed Trim, and presumably MCAS, use only on-side inputs. Boeing and investigators must be reviewing this, and I hope I have characterized this correctly.

Why is MCAS any different, or less safe?

The issue may be related to using a single source of AOA sensor data.

Underlying the intended function are the failures of the inputs.

No matter the design, if you believe a bad input, you will get a bad output.

Using only one active FCC may create a reliance on "on-side" sensors, and deprive the opportunity to compare to the off-side. Cross-compare from single failed sensor (valid data, but misleading) would be detected, and a decision could be made on how to proceed.  

My first job at Boeing was on the Pitch Augmentation Control System (PACS) for 767 and 757. It was a stand-alone dual computer system. There were about ten engineers dedicated to the project, working together for years. I spent ALL of my time testing failure scenarios of inputs and of outputs.

PACS was unique in that it used in-line monitoring. Both PACS computers had their own inputs. They each sent the other side the values they had sensed. Both PACS computers then decided if there was good agreement between them. A fault was triggered if one input differed significantly, and the two PACS computers could decide how to proceed.

Many Autopilot designs follow a "brick-wall" design. Rather than talk to the other channel, each autopilot commands an output servo to its own liking. A torque tube takes the output of each serve and the arm wrestling begins. Classic triple channel autopilots depend on two good autopilots overpowering one malfunctioning autopilot. Each autopilot does their own calculations by themselves, with their own sensor selections.

A dual-dual architecture is quite complex. The internal dual processing channels and the dual autopilot computers interact to compare data and outputs. There is no question that the two autopilots  work together if engaged in a dual-channel mode.

Speed trim, and presumable MCAS, is an odd autopilot feature, as it is applied when the autopilot is not engaged.

Speed trim appears to be a single channel, single processor command. I can only assume MCAS is as well.

A defining moment was when the decision was made that the integrity of the MCAS command should meet a major failure standard, from which a single channel solution would be suitable. The accompanying two cutout switches may have been compelling to diminish the MCAS command failure from hazardous to major, and thus justifying a single channel solution.

Design Assurance Level (DAL) comments may be a red herring, so let explain my terminology:
  1. The FCC software may be developed to DAL A
  2. Single FCC software provides the command based on single input, single output
  3. Two hardware cutout switches can allow the pilot to turn off the command if it malfunctions
The integrity of the command, even if developed to DAL A, done as a single channel with single source of inputs cannot likely meet nothing higher than a major hazard for command integrity.

A major hazard can be dealt with by DAL C and a single point failure (hardware failure rate less than 1 in 100,000 flight hours - once in the lifetime of every airplane).

The MCAS hardware design can meet requirements by including the tandem cutout switches (autopilot trim cutout and stab trim cutout) for credit.

This would say that a malfunction of MCAS is resolved by the cutout switches applied in a timely manner.

It is still likely that the FCC and cutout switch hardware arrangement is satisfactory against a hazardous failure (1 failure in 10,000,000 flight hours - once in the lifetime of the fleet of aircraft of type).

(FYI, catastrophic failure is 1 in 1,000,000,000 flight hours - never once in the fleet of aircraft of a given type)

If the decision had been made that the undetected failure of the integrity of the MCAS command was hazardous, then a dual system relying on multiple sensors would be have been more appropriate. This probably belies the fatal flaw. 20 years of service history would have supported the legacy view. Missing was reliance on the AOA vane, and taking out the aft cutout feature.

The active FCC includes a second processor that independently is monitoring the MCAS function and is capable of raising alarms. This aspect would resolve the common-mode failure concern, based on verification error. It is possible (likely) both processors will make the same mistake with the same sensor (a validation error, bad requirement).

The use of a second processor to independently compare the command and drive warning is a verification benefit, it does not improve the command integrity in the presence of common sensor input failings.

Running the monitor in the non-active autopilot channel, and providing means so that the monitor can shut down the command, is the most logical improvement.

PACS was a dual system with far less stab trim authority, but it was abandoned for vortex generators.

The 737 dual-dual architecture is very unique. The decision to make speed trim single channel, single processor goes back to the 737 classic. The MCAS function is just another FCC software module that behaves, at a high level, like speed trim, whose architecture would have then been replicated.

The decision to remove the autopilot aft column cutout feature appears to have been underappreciated. What may have been lost is stepping back and deciding MCAS needed higher integrity. Events have overtaken that decision.

Response to Failure
Generally, faults will lead to loss of function.

The hazard is uncommanded stabilizer trim (malfunction), not the hazard from failure to trim stabilizer (loss of function).

The Speed trim circuitry, which has existed since 737 classic, has been long shown to be reliable, at least based on the lack of any significant incident. That function runs off of airspeed, and presumably the logic to detect a false airspeed input has been reliable. There are undoubtedly many other features in speed trim to prevent malfunction. Regardless, speed trim is certainly capable of uncommanded stabilizer trim commands and yet none have drawn any attention, if they have occurred.

Dealing with Failures and Errors
A validation error is a shortcoming in requirement - the requirement is wrong or has gaps. The best implementation cannot overcome a validation error.  A more tragic saying than "live and learn" applies here. For example, even if the FCC software is DAL A, the requirements must be correct.

A verification error is one that results from implementing the requirement incorrectly. Verification errors are discovered in test and by code inspection. The verification test is a logical process that looks at the implementation of every requirement. Verification errors are more frustrating, as the expectations are well-known and documented.

I am personally confident that Boeing engineers have thought carefully about preventing false MCAS commands. There should be no doubt that they have spent considerable efforts in designing and testing the MCAS function and malfunction, and have fully and completely satisfied their certification obligations, without reservation.

As an engineer, you always look for backstops in your safety assessment.  The existence of the cutout switches and the ability to manually grab the trim wheel may have allowed the command integrity to be minimized.

The alternative use of a dual-channel solution, where both autopilots must agree (and where sensor outliers are revealed), would yield significant benefits.

Why did an uncommanded MCAS activation happen? I don't know, but the reports say it was related to a failure of the AOA vane on one side. 

It's possible that the AOA vane failed in a way not known - that the testing of the failure modes was based on AOA behavior that is not accurate (verification error).

It's possible that there are multiple failures that combine in a way never anticipated (validation error).  AOA vane issues may only be one part of the reason for malfunction of the MCAS function.

If MCAS malfunction present in only one FCC, does that mean that the exposure would be every other flight (assuming no power cycle)?

How this relates to JT610 remains the purview of the investigators. Hopefully the CVR will be found to complete the situational awareness assessment.

Exposure window for  calculating probability from failure rate is bounded by maintenance. A fault indicated and repaired is a process that has to be modeled. Repeated failure to take proper corrective action extends the exposure window. At some point, this clearly can manifest into combinations of failures that were never envisioned. 


Stay tuned!

Peter Lemme

peter @ satcom.guru
Follow me on twitter: @Satcom_Guru
Copyright 2018 satcom.guru All Rights Reserved

Peter Lemme has been a leader in avionics engineering for 37 years. He offers independent consulting services largely focused on avionics and L, Ku, and Ka band satellite communications to aircraft. Peter was chairman of the SAE-ITC AEEC Ku/Ka-band satcom subcommittee, developing ARINC 791 and 792 characteristics and contributes to the Network Infrastructure and Interfaces (NIS) subcommittee developing Project Paper 848, standard for Media Independent Secure Offboard Network.

Peter was Boeing avionics supervisor for 767 and 747-400 data link recording, data link reporting, and satellite communications. He was an FAA designated engineering representative (DER) for ACARS, satellite communications, DFDAU, DFDR, ACMS and printers. Peter was lead engineer for Thrust Management System (757, 767, 747-400), also supervisor for satellite communications for 777, and was manager of terminal-area projects (GLS, MLS, enhanced vision).

An instrument-rated private pilot, single engine land and sea, Peter has enjoyed perspectives from both operating and designing airplanes. Hundreds of hours of flight test analysis and thousands of hours in simulators have given him an appreciation for the many aspects that drive aviation; whether tandem complexity, policy, human, or technical; and the difficulties and challenges to achieving success.

No comments:

Post a Comment