Design Processes And Operation example essay topic
Management is subsequently under pressure to introduce enhanced processes and technology to deliver improvements in operational efficiency. From such developments transpire new and unknown human-machine interfaces and the role of humans in operating these systems. This is where the vital role of governance lies, and defines the manner in which design processes and operation are undertaken within an organisation. Governance also looks at understanding of the information contained within their system and the dispersion of this information and knowledge throughout the organisation. Disasters can be caused by a variety of sources such as technical or environmental but this focuses on the human factors, and desired elimination of, in causing such catastrophic system failures. By its very nature human error is impossible to eliminate in totality.
It is useless to attempt to change human nature; progress is made in designing the system to safely operate with humans as an integral part of the process. I believe it impossible to ensure no disasters will occur in the future, but this essay will look at the methods used to keep the possibilities of such disasters occurring to a minimum. I will look at the processes and principles applied to design and monitoring within organisations, where appropriate highlighting relevant case studies to illustrate my points. 2. Why disasters happen Disasters are more likely to occur with progression in technology and the complexity of systems. Over the past 30 to 40 years a technological revolution has occurred in the design and control of high-risk systems.
This has brought about radical and still little understood changes in the tasks that the human operator is expected to perform. There are several factors affecting human performance in such systems and different types of human error need to be recognised. 2.1 Types of Errors Each disaster has various contributory factors but it is has been stated that 60-90% of major accidents in complex systems have human error as a primary cause. Making errors is an integral part of our human nature demonstrated by a survey that showed skilled operators in word-processing select inefficient commands 30% of the time.
Defining error types is important to identify the cognitive state of the operator at the instant the erroneous action was undertaken. This classification then allows processes to be formed that try to mitigate the effects of such error types reoccurring. 2.1. 1 Error Classification A good classification of human error from an engineer's viewpoint was defined by Kletz T. He separated errors into four broad classes. These are included here as a basis for discussion later into case studies.
Mistakes: These are classed as errors made where the operator doesn! |t know what action to take or worse still, believes he is undertaking the correct action that is in fact wrong. These can be caused by either cognitive errors or because of the application of a rule that does not fit. Violations (Non- Compliance): These are where the operator knows what to do but doesn! |t. Though often called violations the operator often believes a departure from the rules is justified, so a more appropriate name is non-compliance.
Mismatches: Here the task was beyond the physical or mental ability of the operator. Slips: Here the intention is right but the action is wrong. 2.1. 2 Latent and Active Errors In considering the human contribution to system disasters it is important to distinguish between two kinds of error: active and latent. The effects of active errors are felt almost immediately, whereas the adverse consequences of latent error can lie dormant within the system for some time, only becoming evident when they combine with other factors to breach the system defences. Active errors are generally associated with the! yen front-line! | operators of a system such as control room operators. Latent errors are likely to originate in activities that are removed in both time and space from the direct control interface such as designers and maintenance personnel.
Analysis of recent disasters has revealed latent errors to be the greatest threat to safety in a complex system. One example of latent errors been the primary cause is the Bhopal incident where a gas leak from a small pesticide plant devastated the central Indian city of Bhopal. It could be argued here that the most significant error with respect to the entire incident was in locating such a high-risk plant so close to a densely populated area. This was a classical example of a latent error where the problem lied dormant from plant manufacture until the catastrophic incident where other contributory errors combined. This highlighted the latent error originating in government / management that underpinned the disastrous effects on the local population. This is just one example of many for Bhopal and highlights the requirement of the governance body to spend time and money addressing such issues.
All types of errors learnt from incidents should be monitored and classified in order to learn from them, and reduce the possibility of such reoccurring in the future. 2.2 Sources of Error Types of error have been outlined above but these errors can originate from a variety of areas. When analysing any serious disaster it is likely a contribution of several errors combined with tragic consequences. It should be acknowledged that these errors are not solely in the operators and design but also flow right up to managerial and government levels. In order to prevent disasters the causes need to be recognised so a case study looking at Chernobyl is undertaken in section 3. Obviously different incidents have different causes but Chernobyl covers the majority of such.
2.3 Increased Automation in Complex Systems A remarkable development of recent years has been the extent to which human operators have become increasingly remote from the processes they nominally control. Machines of growing complexity now intervene between the human and the physical task. The most profound changes to complex systems operations came with the advent of cheap computing power. This effectively introduced a human-computer interface (HCI) lying between the operator and the system. This interface now permits only limited interaction between the human and the now remote process. With the introduction of these systems driven by the governance body to improve efficiencies (and expected safety benefits), the system operates more effectively within! yen normal! | system tolerances.
However, in normal operation the human is purely in a supervisory role, loosely defined as! yen initiating, monitoring, and adjusting processes in systems that are otherwise automatically controlled! | (Sheridan and Hennessy! V 1984). The problems arise when the human operator is required to take over when the system state drifts outside of this safe operational range. This type of supervisory role for the human operator is represented in the following diagram: Fig. 1 - The Basic Elements of Supervisory Control This sort of increasingly common control system has brought about a radical transformation of the human-system relationship. It can be seen that the computer rather than the human performs as the central actor.
The main reason for the human operators continued presence is to apply the human's unique powers of knowledge-based reasoning to cope with system emergencies. Ironically, this task is particularly ill suited to the facets of human cognition. The problem lies in emergency scenarios that may be so alien and complex that humans perform badly making errors of judgement. The generic probability of human error rises dramatically under stressful conditions and unfamiliar tasks: "h Simple, frequently performed task, minimal stress 0.001 "h More complex task, time constraints, some care necessary 0.01 "h Complex unfamiliar task, little feedback, some distraction 0.1 "h Highly complex task, considerable stress, little time available 0.3 "h Extreme stress, rarely performed task 1.0 Worse still, these out of operational tolerance scenarios have been purposely encountered in disasters such as Chernobyl. Reactor tests were undertaken leading the reactor to be run at unsafely low reactor levels with operators possessing only a limited understanding of the system they were controlling.
This was coupled with a task that made dangerous violations inevitable. Clearly, the governing body should be required to assess the trade off between the benefits of increased automation to the safety aspects of supervisory operators with a low cognitive appreciation of the detailed process being automated. This introduces a whole new topic relevant to corporate governance strategies in operation, commonly referred to as! yenIronies of Automation! |. This topic is discussed later. 2.4 Systems becoming more Opaque With catastrophes becoming increasingly unacceptable technology is applied to high-risk complex systems (such as chemical and nuclear sites) in the form of Automatic Safety Devices (ASD's) to protect against all known breakdown scenarios.
These types of developments have made these systems more complex, tightly coupled and highly defended. As a consequence the systems are increasingly opaque to the people who manage, maintain and operate them. The opacity has two facets: not knowing what is happening and not understanding what the system can do. Analysis of major accidents reveals that the safety of the system was eroded by latent errors.
Humans operate best in dynamic environments where slips and mistakes are immediately visible - such as driving. Complex systems can require several errors or events to coincide before a visible response is communicated to the operator, permitting this build up and cohesion of latent errors. With respect to governance of such systems, design should concentrate on decreasing the duration of such errors, rather than reducing their frequency, in the design and development stages. I am not suggesting complexity of high-risk systems is unnecessary; it clearly has benefits in the ASD's and such in improving operational safety and efficiency.
As seen though disasters still happen despite all these, hence there is a need for the governance to improve the human's role and knowledge of operating such complex systems. 3. Chernobyl Disaster On April 26, 1986, at 1: 23 AM Moscow time a disaster struck the region of Chernobyl in Northern Ukraine. The disaster was the destruction of the #4 RBM K-1000 nuclear reactor, due to a nuclear meltdown. The disaster destroyed the reactor and sent a cloud of radioactive dust one mile into the air.
Radioactive dust was delivered to surrounding countries, and the world, raising levels of radioactivity worldwide. A relevant and appropriate quote was! yen A nuclear disaster anywhere is a nuclear disaster everywhere! |. The consequences of this disaster were / are enormous and too detailed for listing here. More relevant is determining the primary causes to enable possible governance strategies for future prevention to be discussed later in the document. The test was to undertake experiments on the reactor that meant running at unacceptably and unsteady power levels.
This was driven by initiatives to save money but also by curiosity of the operators. Authority to proceed with the tests was given to station staff without the formal approval of the Safety Technical Group and other similar Russian plants had refused such tests on safety grounds. The fact the tests were authorised at all demonstrate! yen top-level! | institutional and managerial errors combined with a general lack of safety culture in the Russian Government. The following couple of events demonstrate this further: "h At 1400 25 April as part of the test plan the ECCS (Emergency Core Cooling System) was disconnected from the primary circuit, stripping the reactor of one of its major defences. This should not have been tolerated and represents another managerial failure. "h At 1405 the Kiev controller requested Unit 4 to continue supplying to the grid, but the ECCS was not reconnected. Although this did not contribute directly to the subsequent disaster, it was indicative of the lax attitude of operators toward safety procedures.
Subsequent 9 hours of operating at around 50 percent full power increased xenon poisoning, making plant more difficult to control at low power. Here lies both design failure through instability and managerial failure in permitting a low safety culture to develop amongst operators. These two events are indicative of how poor governance of the nuclear plant allowed safety standards and operator knowledge to slide with devastating consequences. Later sections will discuss governance strategies to combat the system failures demonstrated above.
Automation of such high-risk sites is necessary for safety but a knock on effect is a lack of appreciation and knowledge in the operator of the technical operation. Although a lesser issue in the cause of Chernobyl, a significant topic in the governance of such plants that is discussed in the following section. 4. Ironies of Automation 4.1 Definition of Ironies Whilst on the topic of human-system interaction it is appropriate to consider this topic, expressed well by Lis anne BainBridge (1987) of University College London.
It concentrates on the difficulties that lie at the heart of the relationship between humans and machines in advanced technical installations. This is an important issue for the governance of organisations and its implications cover both design and operation of facilities. As organisations strive to be as efficient, reliable and safe as possible, system designers see the human operators as inefficient and unreliable desiring to supplant them with automated devices. There are two ironies here: "h The first is that designers! | errors make a significant contribution to accidents and events. "h The second is where designers seeking to eliminate human error still leave the operator to do the tasks the designer cannot think how to automate. This means after automation there are two categories of tasks to be undertaken by the human operator. He / she is firstly required to monitor the automated system.
If the system state drifts outside of tolerance operation the requirement is to take over and stabilise the process using manual control skills, and diagnose the fault as a basis for shutdown or recovery using cognitive skills. However, it is a well-known fact that humans in monitoring roles cannot maintain effective vigilance for more than a short period of time and are subsequently ill suited to detecting rare, abnormal events. The second operator task is to take over manual control when automation fails, but little practise of manual control is experienced leading to de-skilled operators over time! V ironically worse in more reliable plants. The irony here being when something goes wrong operators need to be more skilled to effectively deal with the incident, not less skilled as transpires from this scenario. The underlining issue here is that automated systems were not designed with human interfacing as a priority.
To reduce the probability of serious incidents arising due to the above point's, steps can be taken by organisations to reduce these probabilities through consideration in the design and operation (monitoring) stages. Some of these are outlined in the following section. 4.2 Approaches to Solutions 4.2. 1. Monitoring In situations where low probability events need to be quickly noticed by operators designs should incorporate artificial assistance such as audio and visual alarms.
In designs with a large number of loops in the process some form of alarm analysis is required to allow the operator to focus on the correct part of the plant. Much care needs to be taken in the design of large alarm systems as a proliferation of flashing lights can hinder rather than aid the operator. The governance of such alerting systems should be a major consideration in the design of complex high-risk systems such as the control and alerting systems for nuclear and chemical plants. The concept of using Multi Function Displays (MFD's) on VDU's seems appropriate to me.
This permits the operator to access the data he requires from the same terminal rather than using multi terminals. The analogy I recall is the Euro fighter cockpit that uses three MFD's as the primary displays. The information desired by the user is easily accessible and any warnings outside of these displays will be flagged to the pilot's attention so as not to go unnoticed. As with cockpit design directly wired instruments are always present as a back up for VDU failure. I believe this principle should be applied to all complex control systems. 4.2. 2.
Working / Long Term Knowledge The points above underline the requirement for operators to maintain their manual skills. This can be approached in two ways. One possibility is to allow the operators hands on manual control of the process for a limited time each shift. This may prove unfeasible, however, in high-risk plants. In this case management should provide high fidelity simulators to develop these necessary fast reactions in case of emergency scenarios. 4.2. 3.
Human-Computer Collaboration Governance should ensure design processes are manipulated so that a philosophy is integrated where the human and machine should be designed together, for collaboration. This would benefit over past designs where humans were an afterthought, subsequently isolated from the system. One method to address this issue is to enforce task analyses in the design stage. Tasks humans may be required to carry out with the machine can be broken into! yenswim lanes! | where tasks are grouped for human and computers. This allows necessary interfaces between human-computer to be recognised and understood designing the human as an integral part of the system: Fig. 2! V! yenSwim Lane!
| Task Analysis The shaded circles indicate the HCI (Human Computer Interface) between the tasks (white boxes) and this diagrammatic form also allows designers / operators a simple overview of the system architecture, building the all-important knowledge base. 5. Processes for Safe Operation This section will concentrate on the processes that can be adopted by organisations for safer operations. As a basis for this discussion, and personal reference, I will be using a diagram from the course lecture notes that I am sure you! |re familiar with!
Around this I will attempt to identify the important design and monitoring issues to be recognised by organisations, where appropriate using relevant case studies to reinforce the points made. Being a Systems Engineer I will also consider the issues in a systems view of this process, that is of particular relevance to my studies. Fig. 3! V Processes to design in disasters 5.1 Safe Operations This diagram gives a complete overview of a generic process applied with a view to eliminating disasters.
The left side demonstrates the companies knowledge management, design process and maintenance philosophy whilst the right side concentrates on operational management. Several basic principles are contained within this process that constitute the main principals of safe operation: 5.1. 1 Hazards Recognised and Understood If hazards are not recognised and addressed early in the design process they can have catastrophic consequences in operation. An approved approach combines HAZOPS (Hazards and Operability Survey), seeking relevant user experience, and the use of standards and benchmarks. In 1984 there was leak of radioactive material into the sea from the British Nuclear Fuels Limited (BNFL) plant at Sellafield, Cumbria. Initial reasoning was operator error, but the underlying fault was design error.
Poor pipe work design meant radioactive solids managed to settle in the sea lines during pumping between tanks. These were later washed out to sea by a thought! yen safe! | flow. This is a clear example of where a HAZOP analysis would have detected and subsequently prevented such environmental damage. A recommended stage after HAZOPS is to conduct F MECA (Failure Mode and Criticality Analysis) on the identified risks. This identifies the ways in which automated systems can be designed to react to failures so as to optimise system response. Modern aircraft are a good example.
With power loss the system is designed to provide power to hydraulics first to enable attitude control, then a radio to communicate the trouble etc. until all failure modes are covered in contingency (within reason). With one of the eternal goals of management to improve performance this requires experiments to develop the required improvements, with subsequent changes to processes and operation. This increases the chances of unacceptable performance and dangerous incidents due to the relatively unknown processes being applied. This concept was simply expressed in diagrammatic form by Rasmussen J. He aptly named this concept! yen Safety Migration!
|. Fig. 4! V Rasmussen Safety Migration The acceptable operational environment is the central region limited by the three restrictive boundaries. To ensure the lower boundaries are not crossed due to management pressure (with the increased possibility of accidents) a shallow gradient should be applied. In order to maintain this shallow gradient HAZOP's should be made a priority by the governing body. This technique mainly covers new hazards.
In order to maximise current knowledge of hazards that already exist, the relative experienced entities should be consulted. Standards should be adhered to as these already define safe operating conditions. An example standard being the noise action regulations defined by the HSE. Managements should compare their organisation to leading benchmark companies in the same sector to keep an open mind and acknowledge better practices.
Links can be sought and established for such, using joint ventures and projects that are increasingly common in modern evolving companies such as BAE SYSTEMS (joint ventures with American and European companies) 5.1. 2 Ensuring Equipment fit for Purpose Hazards outlined from the above processes should be fed back so safety is inherent in the design. It should be appreciated that first design will not be flawless so user reviews should ideally be conducted on prototypes. Aside from safety issues, the cost of implementing changes increases as you progress down the system lifecycle so the earlier the better. Mock scenarios and simulations allow the equipment to be used in the correct context hopefully rising flaws that may occur in operational use.
A personal example of such was working on the Nimrod WSI R (Weapon System Integration Rig) at BAE SYSTEMS Wart on. A rig with complete mock up of the avionic and mission systems exists that is extremely representative of the real aircraft. This allows each individual piece of equipment to be integrated and qualified into the system. Flaws in design are highlighted and rectified before reaching first flight, and the human factors in the cockpit can be realistically simulated. This process all helps towards achieving a mature product that is understood and predictable. 5.1. 3.
Systems and Procedures to maintain integrity There are three main categories of processes to be considered in governance: "h Operational Processes! V Defines the philosophy in the physical day-to-day operation of the plant and delivery to external customers. "h Support Processes! V Deliver products and service to internal customers "h Design & Development Process! V How the systems are designed. The best design and equipment in the world still require safe operating and maintenance procedures reinforced by effective processes. Piper Alpha was a distinctive example of failure in maintenance procedures.
A valve was left open and the next shift unknowingly tried to use this backup pump to tragic effect. One method that is popular now is a! yen permit to work! | system. Permit to Work System A permit-to-work system is a formal written system used to control certain types of work that are potentially hazardous. A permit-to-work is a document that specifies the work to be done and the precautions to be taken. Permits-to-work form an essential part of safe systems of work for many maintenance activities. They allow work to start only after safe procedures have been defined and they provide a clear record that all foreseeable hazards have been considered.
A permit is needed when maintenance work can only be carried out if normal safeguards are dropped or when new hazards are introduced by the work. Examples are, entry into vessels, hot work and pipeline breaking. Operational procedures should be defined for inherent safety. As discussed earlier an integral part of the procedures should be to keep humans integrated within the automation systems to maintain a knowledge base of the process within operational staff. Management should ensure operators experience little time pressure so erroneous actions are less likely. Under stressful workloads people tend to drop safety issues first so a good operational guideline defines employees working time as 85% working and 15% thinking.
Process Maturity Safer operations come from evolving processes that learn from past iterations. This is demonstrated in Fig. 3 where the feedback obtained from reviews and audits flows back into the organisations accumulated knowledge base. This then allows improvement of subsequent design processes! V it has evolved. It is beneficial to study processes applied to similar cases, as the experience gained here can advance maturity. Mature processes should be targeted, as they are stable, reliable and well understood.
This reduces the chances of latent errors flowing through into operation. 5.1. 4 Competent Staff The intelligence and background of staff should be carefully considered. In the context of monitoring type roles in high risk sites it may not be wise to resource the most intelligent subjects, such as those with PhD's. On first sight this seems the logical process, but such subjects are likely to possess inquisitive minds seeking experimentation. This can be detrimental to operational safety as seen in the Chernobyl nuclear explosion. Possibly a wiser move is to employ not less intelligent, but certainly more steadily- minded operators who are more likely to be content in the monitoring role.
A culture where staff are trained to the! yenwhys! | of actions as well as! yen how to! | should be encouraged. In case of emergency they are more likely to take appropriate action due to better understanding of the system. Training can be structured to force operators to contemplate the! yenwhys! | directly using questions on mock scenarios and suggested actions. A safety conscious culture should result in more stringent operators.
Using classes on analysing past disasters and some horrific personal accounts are likely to penetrate most peoples! | conscience. To ensure knowledge and attitude doesn! |t relax over time! yen refresher! | courses should be implemented. 5.1. 5 Emergency Procedures Emergency Procedures are an essential safety precaution. Firstly these procedures should be defined and catered for in the design stages.
Referencing the Piper Alpha emergency evacuation one of the main flaws was the extensive time required to set up the rescue platform (up to 45 minutes) in difficult conditions of smoke and heat from the raging fire! V clearly a design issue. Once effective procedures are defined regular drills need to be run to ensure actions become second nature, especially in high-risk scenarios. Generous time for such drills needs to be scheduled by the management so that reviews can be conducted and problems resolved. Management also needs to be trained up to give strong leadership in crisis situations. This can be achieved by realistic drills and simulations where managers can be monitored by external bodies in their effectiveness.
A combination of analysing theoretical scenarios and participating in full-scale drills would be most beneficial. 5.1. 6 Monitoring and Audits of Safety The situation now is hopefully we have safely designed systems with effective operating and emergency procedures. However, there is a need to ensure that the system procedures are being operated as intended in a safety conscious manner. This is where audits and monitoring are utilised. As discussed in! yenIronies of Automation! | if the automation is efficient then operators can become complacent, becoming less stringent with a more relaxed attitude towards safety, as their trust in the system grows. Using simple audits operator records can be checked to see if plausible events / values are being recorded in the logbooks.
If not, management needs to address issues of why this has occurred. One possible reason may be job time pressure, where these often seen remedial tasks are the first to be rushed or bodged. In addition to audits more technical methods can be applied such as monitoring trends in the output of the system. For example, if attributes of the system output are on a gradual change this may be indicative of changes within the system, possibly dangerous ones such as increasing tolerances in machining parts.
5.2 Systems View of Processes A systems approach stands back and looks at the whole system lifecycle. A critical issue in ensuring safety is to ensure there are enough resources, correctly directed to work the problem. A trend in the past was a complete misuse of resources depicted by the following model. The left pyramid represents the time and effort expanded to reduce errors for the sector described on the left. The right pyramid represents the suggested importance of each sector. Fig 5!
V Resource Importance model A simple model that demonstrates the mismatch of effort expanded to minimise error in each sector versus the relative importance. This balance needs to addressed and corrected at the very top level within an organisation. A common and modern process in systems engineering is that of concurrent engineering where the lifecycle stages overlap to reduce time to production. However, I see another advantage as allowing faster feedback of knowledge such as between prototyping and design.
This links again into developing process maturity and subsequently safer systems with fewer! yen surprises! |. As I have experienced an important aspect of ensuring safe systems is to maintain tractability between requirements and the final product. This way all safety aspects identified in requirements are sure to be satisfied, not! yen getting lost! | in the process.
This an I.T. issue and is achieved by cross-referencing on computer systems. A strategy I believe governance would find worth employing the resources to implement in high-risk systems such as military aircraft. 6. Conclusions A common theme throughout has been the importance of knowledge, the accumulation of and dispersion to the people in the systems that need it. In accumulating knowledge, and feeding back into the processes, this constantly reduces the unknowns in a system improving process maturity.
Mature processes improve understanding, reducing the risks of accidents, but only if processes exist within the system to disseminate knowledge to all parties who require it. Organisations should also be governed to ensure human factors and the role of humans is designed as an integral part of the system. Ideally, this would avoid highly automated scenarios where the human is isolated both physically and mentally from the system. Enforcing this design philosophy should facilitate humans better equipped to deal with emergency situations. Apparent in the many case studies browsed was a lack of safety culture. This culture should be targeted at the top of the management structure so it flows down and infests itself in all personnel.
However, managements can become isolated and believe their safety influence is minimal such is their physical isolation from the plant. Difficult to resolve such scenarios, a possible solution is external audits funded by the government to flag up such lax managements. I believe a systems approach is an excellent way to tackle the design of high-risk plants. This ensures all system requirements are traced through and satisfied by final design including mock-up human factors rigs where required. However, this process is mainly used for design and manufacture acceptance to a customer so additional processes would be required for operation / maintenance etc. As stated in the introduction I believe the fallible nature of humans makes it impossible to say no disasters will occur in the future.
The ubiquitous theme in my discussions has revolved around the human's role in systems. To minimise the risks in future with increasingly sophisticated technology one principle should be applied above all; integrating humans into the system. 7.
Bibliography
In addition to material from the HUC 109 System Ergonomics lecture notes the following sources were used. Books Reason, J. (1990) Human Error.
London: Cambridge University Press) Kletz T (2001) An engineer's view of human error 3rd ed.