Problem  Management  System

 

Release Date:  November 21, 2001

 

 

Produced by:              Thomas Bronack

 

 

 

 

Section Table of Contents

 

 

9.   Problem Management. 3

9.1.  Introduction to Problem Management. 4

9.1.1.  Definition. 4

9.1.2.  Mission. 7

9.1.3.  Objectives. 8

9.1.4.  Scope. 8

9.1.5.  SMC Discipline Interfaces. 10

 

9.2.  Process Overview.. 11

9.2.1.  Problem Definition. 11

9.2.2.  Problem Recognition: 12

9.2.3.  Problem Reporting and Logging: 12

9.2.4.  Problem reporters include: 12

9.2.5.  Required problem information includes: 12

9.2.6.  To report a problem, via the Global Systems Help Desk: 12

9.2.7.  To report a problem, directly through Apriori: 13

9.2.8.  Problem Tracking: 13

9.2.9.  Problem Determination: 13

9.2.10.  Bypass/Circumvention: 14

9.2.11.  Problem Resolution: 15

9.2.12.  Service Level Document - Post Mortem: 15

9.2.13.  Management Reports and Review: 16

9.2.14.  The desired reporting scenario is that each morning: 17

 

9.3.  Process Flow.. 18

9.3.1.  Problem Reporting. 19

9.3.2.  Entering problems into APRIORI 19

9.3.3.  Perform Level 1 Support. 19

9.3.4.  Route Problem to Resolver. 20

9.3.5.  Transfer Problem to Another Resolver. 20

9.3.6.  Escalate Problem to Level 2 Support. 21

9.3.7.  Notify Resolver and Area Manager via Pager. 21

9.3.8.  Review Problems. 21

9.3.9.  Track Problems. 22

9.3.10.  Problem Work-Arounds. 22

9.3.11.  Bypass/Circumvention Procedures. 23

9.3.12.  Recovery/Restart Procedures. 24

9.3.13.  Problem Resolution. 24

9.3.14.  Problem Closure Procedures. 25

9.3.15.  Post Mortem Procedures. 25

9.3.16.  Problem Reporting. 26

 

9.4.  Process Elements. 27

9.4.1.  Products. 27

9.4.2.  Global Systems Help Desk. 29

9.4.3.  Problem Reporters. 29

9.4.4.  Problem Resolvers. 29

9.4.5.  Change Management. 30

 

9.5.  Roles and Responsibilities. 31

9.5.1.  Problem Manager. 31

9.5.2.  Global Systems Help Desk Personnel. 31

9.5.3.  Problem Reporters. 31

9.5.4.  Problem Resolvers. 31

 

9.6.  Process Evaluation. 33

9.6.1.  Present System Weaknesses. 33

9.6.2.  Recommendations for Improvement. 33

 

9.7.  Appendices. 34

9.7.1.  Appendix ‘A’ Service Level Document - Post Mortem.. 35

 

 

 

 

 

Section Table of Figures

 

Figure 1: Help Desk Functions and Procedures. 3

Figure 2:  Definition  of  a  Problem.. 5

Figure 3:  Definition  of  Problem  Management  System.. 5

Figure 4:  Problem  Management  Objectives. 7

Figure 5:  Proposed  Problem  Management  Process  Flow.. 11

Figure 6:  Global  Systems  Help  Desk   Functions  Overview.. 18

Figure 7:  Service Level Document - Post Mortem.. 35

 

 


 

 

9.            Problem Management

 

The purpose of this document is to provide a description of the Problem Management System and the procedures used to enter, review, update, assign, escalate, resolve, and close problems.

 

<note>  Apriori is a problem management tool that has a ”Bubble-Up” data base that displays past problems that match entered problem abstracts.

 

 

Figure 1: Help Desk Functions and Procedures

                                                                                                                                                Error! Bookmark not defined.

 

 

9.1.      Introduction to Problem Management

 

 

9.1.1.   Definition

 

This topic will provide the definition of a PROBLEM and how the PROBLEM MANAGEMENT SYSTEM works.

 

PROBLEM definition:

 

Figure 2:  Definition  of  a  Problem

 

 

The Problem Management System has been established to capture and report on encountered problems.   It employs the APRIORI Problem Management System product from Answer Systems, Inc. as a front-end and problem repository.

 

The Problem Management System is used to record problem events, assign them to a resolver, and track problems until they are successfully resolved.   If problems are of a high priority, or outstanding beyond an acceptable period of time, then the problem Management System will escalate the problem in priority.   When this occurs, the next level in problem support is activated and additional  manpower is applied to resolving the problem.

 

 

The PROBLEM MANAGEMENT SYSTEM is defined as:

 

Figure 3:  Definition  of  Problem  Management  System

 

 

Problem Management is the process of detecting and reporting problems that impact services supplied by Technology Operations.   A problem is any unplanned deviation from standards or an expected service delivery. 

 

Problems include:

 

*   Hardware mainframe through data network;

*   Software systems, sub-systems, applications, and utilities;

*   Communications applications, devices, and lines;

*   Data Network applications, and devices;

*   Human errors;

*   Procedures;  and

*   Environmental failures (i.e., Heating, Ventilation, Air Conditioning, Power, Water, Raised Floor, etc.).

 

Intervention may be required to determine, eliminate, or circumvent problems as they are identified.


 

 

9.1.2.   Mission

 

Figure 4:  Problem  Management  Objectives

 

 

The Mission of the Problem Management System is to employ standard procedures for reporting and resolving problems.  The goal of this process is to reduce the impact of failures on service expectations to an acceptable level. 

 

Problem Reporting is used to inform Management and Technical personnel of problems that affect service expectations.   Problems are assigned to resolvers through the Technology Operations department and/or the Global Systems Help Desk.   Escalation’s, designed to raise the priority of a problem, are incorporated into the Problem Management Process so that appropriate resolution actions can be taken and the outage’s duration reduced.

 

The Problem Management Process will record, assign, escalate, track, resolve and report on any situation that is a deviation from expected service deliveries or standards.

 

Post Mortem procedures are utilized to review resolved problems that have impacted multiple users, or are of a high priority.   The goal of the Post Mortem process is to implement improved problem recovery and resolution procedures, as deemed necessary.


 

 

9.1.3.   Objectives

 

 

The objectives of Problem Management are: 

 

*   Ensure all encountered problems are reported,

 

*   Prioritize problems as to their criticality and business impact,

 

*   Log problems to Problem Record in Problem Repository,

 

*   Resolve a majority of reported problems at Global Systems Help Desk,

 

*   Assign problems to resolvers that cannot be resolved by Global Systems Help Desk,

 

*   Track and manage problems (from origination through resolution),

 

*   Escalate problem resolution in accordance with criticality and duration,

 

*   Close problems when resolved,

 

*   Conduct Post Mortems on resolved problems,

 

*   Recommend updates to supportive documentation,

 

*   Produce and distribute Problem Reports.

 

 

These objectives are met through:

 

*   The use of the APRIORI Problem Management System,

 

*   Personnel assigned to the Global Systems Help Desk,

 

*   Internal Resolvers assigned to the departments of Technical Operations, and

 

*   External Resolvers assigned to problems under the management of Internal Resolvers.

 

 

9.1.4.   Scope

 

 

The Problem Management Process begins with the recognition of a problem and ends when a problem is closed.   In both cases, the problem reporter must be included; initially as the reporter or a problem and finally as the approver of the problem’s resolution. 

 

In some cases, the Systems Manager, or On-Line Manager, must also approve the resolution to ensure that the closed problem does not impact other areas beyond the reporters domain (i.e., problems that affect multiple users, but are reported by only one user, etc...).   This “checks-and-balances” procedure is accomplished via periodic problem review meetings and the distribution of problem reports.

 

The APRIORI Problem Management System is used to record and process problem records (each problem has a unique problem record id).  Problems can be entered into the APRIORI system by personnel from:

 

*   Global Systems Help Desk,

*   Technical Operations,

*   Applications Development, and

*   Business End Users.

 

The Technology Operations Problem Management System is responsible for problems within the following areas:

 

*   Host applications,

*   Hardware,

*   Software,

*   Data Network,

*   Terminals,

*   Procedures, and

*   Facilities.


 

 

9.1.5.   SMC Discipline Interfaces

 

 

The Systems Management and Controls (SMC) disciplines interfacing with the Problem Management System are:

 

*   Batch Management:  Provide problem reports for Batch Jobs and their recovery/restart procedures.

 

*   Capacity Management:   Provides problem reports on abends due to capacity shortages, such as DASD Space and Region Sizes.   May also be used to isolate the need for additional hardware and/or the reconfiguration of existing resources.

 

*   Performance Management:   Provides problem reports for performance related weaknesses, such as transaction response times, Batch Job turnaround times, etc.

 

*   Change Management:   Provides problem reports for failures that occur due to a weakness in the Change Management process, such as version and release information, benchmark, testing, etc.

 

*   Recovery Management:   Provides problem reporting when errors in the recovery process occur, such as recovery planning for critical assets, sizing requirements for recovery facilities, and failed recovery/restart procedures.

 

*   Service Level Management:   Provides problem reports when service level delivery to clients is below expected limits, as determined via Service Level Reporting.

 

*   Inventory Management:   Provides problem reporting for items not included in the Inventory, or wrongly reported in the Inventory System (i.e., Acquisition, Redeployment, Termination, or Surplus).

 

*   Configuration Management:   Provides problem reports on failures related to the Configuration as depicted in Configuration Diagrams showing: hardware, software, communications, power, water, facilities, etc.

 


 

 

9.2.      Process Overview

 

Figure 5:  Proposed  Problem  Management  Process  Flow

 

 

Problem Management System procedures are described below.

 

 

9.2.1.   Problem Definition

 

A problem is categorized as:

 

*           A deviation from “Expected Service Delivery”, or

*           A “Standards and Procedures” violation.

 

when these conditions are recognized, then they must be reported as a problem.

 

 

9.2.2.   Problem Recognition:

 

The detection and identification of problems, or potential problems, through monitoring, trend analysis, or observation.  The recognition of a problem can come from any point in the system; when identified, problems must be entered into the Problem Management System and routed to a problem resolver.

 

 

9.2.3.   Problem Reporting and Logging:

 

Problem entry is performed via the Apriori program product from Answer Systems.   Apriori provides a Problem Repository data base and is equipped with front-end displays used to enter, access, research, assign, track, and resolve problems.

 

 

9.2.4.   Problem reporters include:

 

*           Technology Operations personnel,

*           Applications Development personnel,

*           Business Users, and

*           Global Systems Help Desk personnel.

 

 

9.2.5.   Required problem information includes:

 

*           Name, location, and phone number of problem reporter,

*           Date and time of problem occurrence,

*           Description of problem and circumstances that led to the problem,

*           Assessment of the severity and impact of the problem, including:

*           Urgent,

*           Important,

*           Minor, or

*           Information.

*           Supporting information, if available and appropriate.

 

 

9.2.6.   To report a problem, via the Global Systems Help Desk:

 

*           Call the Global Systems Help Desk at ______________

*           Describe the problem to the Help Desk,

*           Define the Impact of the problem,

*           The Help Desk will record the problem,

*           If possible, the Help Desk will resolve the problem,

*           Otherwise, the Help Desk will route the call to a resolver,

*           Notification is performed via Telephone, Beeper, or in Person,

*           Notification of Urgent and Important problems is immediate,

*           Senior Management and impacted Business Function are notified immediately via page for Urgent problems.

 

 

9.2.7.   To report a problem, directly through Apriori:

 

*           Log onto Apriori,

*           Enter problem information,

*           Use problem Search to review past solution data for problems of this type,

*           If possible, repair / bypass problem with supplied information,

*           Otherwise, route problem to resolver via Submit Problem Report feature.

 

 

9.2.8.   Problem Tracking:

 

Problem tracking and escalation is performed by the Global Systems Help Desk for all reported problems.   Problem reporting will be performed on a periodic basis.  Problem turnaround to resolution is dependent upon severity.  Urgent problems will be addressed immediately (within 1/2 hour), while less urgent problems may take longer to address.

 

The resolver, or Help Desk, must contact the problem reporter to gain their acceptance of the problem solution before problem closure procedures can be performed.   Systems Managers may also be informed of problem solutions and their approval sought for some types of problems.   This guarantees that problem reporters are satisfied with resolutions and that problems are truly repaired.

 

 

9.2.9.   Problem Determination:

 

Problems are identified through searches of past problems contained in the Apriori data base, or through analysis efforts performed by the resolver.   The intent of problem determination is to identify the source of the problem at a level sufficient enough to enable corrective action(s) to be performed.

 

Once the problem is assigned to the resolver, they are responsible for defining the problem’s “root cause” and for developing a solution to the problem.  Restore/ Recovery and Bypass/Circumvention procedures should also be supplied by the resolver whenever possible.  This information is added to the free-form text section of the Apriori problem record and made available to future problem searches by reporters/resolvers.

 

 

9.2.10.Bypass/Circumvention:

 

To allow processing to continue, it is sometimes possible to work around problems through a bypass or circumvention, while recovery/restart operations are used to re-establish the environment just prior to the problem event.  The combination of the two disciplines allows for the re-submission of failed jobs, using alternate components after recovery procedures have been performed. 

 


 

 

9.2.11.Problem Resolution:

 

Problems are assigned to resolvers responsible for supporting the area impacted by the problem.  If a problem has to be reassigned, or escalated, it is coordinated through the Global Systems Help Desk.  If external resolvers are needed, an internal resolver must be responsible for monitoring and coordinating the actions of the external resolver.  All problem determination, work-around, and resolution information must be entered into the free-form text section of the problem record.

 

Problem solutions that require change control activity will be processed through Change Management before verification of problem solutions can be accomplished.  Solutions that do not require change control activity will be implemented immediately.   In both cases, the problem reporter must approve the problem solution before the problem can be closed.

 

All problem related information must be added to the free-form text section of the problem record before problem closure is finalized.

 

 

9.2.12.Service Level Document - Post Mortem:

 

Post Mortem reviews for major problems (those problems affecting multiple users, or impacting business operations) are conducted within 24 hours of the problem, even when resolutions are not available.  This document is supplied to members of the Technology Coordination Group’s Mainframe Steering Committee (TCG-MFSC).

 

The sections contained within this document are:

 

*           Symptom,

*           Problem Analysis,

*           Details,

*           Impact,

*           Solution, and

*           Future Prevention.

 

If problem resolution information was not available in the initial Post Mortem report, the Post Mortem document is updated until the problem solution is available  (see Appendix ‘A’ for Post Mortem document and process).


 

 

9.2.13.Management Reports and Review:

 

Problem reports are generated for a variety of business reasons, including:

 

*           Open and/or closed problems,

*           High impact/severity, problems,

*           Problems by date/time range,

*           Problems by functional area/department,

*           Problems by component/vendor,

*           Problems by type, frequency, duration, etc...

 

In all cases, problem reports can be used to isolate weaknesses in products, applications, standards, procedures, supportive documentation, and training.  Through problem reports and periodic reviews, management and technical personnel can make adjustments to the functions they perform, thereby reducing problem events and improving performance. 

 

Apriori supports a full range of reporting functions.  Its data base design allows:

 

*           Problem record searches for specific information,

*           Grouping data in problem reports,

*           Formatting output for hardcopy, fax, or e-mail,

*           Routing reports and data to files for use by other products (i.e., it is possible to copy problem data to a file in spreadsheet format. 

                This information can then be used to generate graphs and other reports from Excel, Lotus 123, etc.).

 

Reports can be produced on a periodic basis, or via ad-hoc requests.

 

 


 

 

 

9.2.14.The desired reporting scenario is that each morning:

 

 

*           1st Level Managers receive Problem Reports (Open/Closed) for their areas,

*           2nd Level Managers receive Summary Reports for the areas reporting to them.

 

The reports should be generated via Apriori and distributed via cc:Mail (or some other electronic media like Fax) to the individual.   A schedule for report generation and distribution should be created and adhered to.   Responses to reported problems should be accomplished via cc:Mail/Fax, when information or resolutions are available.

 

The problem reports listed above should be for every manager in the Technology Operations organization.  Application problem reports should be generated, as well. 

 

The summary report for the Technology Operations executive should include all open/closed problem data for the areas under the executive’s direct control.   Trending analysis reports should be provided to allow the executive to pinpoint where problem most often occur, or where resolutions take the most time to develop.   This information will allow the executive to direct his forces to concentrate efforts on eliminating problem areas, or to provide additional training to specified areas of operation.

 

 


 

 

9.3.      Process Flow

 

Figure 6:  Global  Systems  Help  Desk   Functions  Overview

 

 

9.3.1.   Problem Reporting

 

Whenever a deviation from Standards and Procedures occurs, or an Expected Service Delivery is missed, the event must be reported as a problem.   Problem reporting is accomplished through the Apriori Problem Management System, which is under the control of the Global Systems Help Desk (included in the Client Services Department).

 

 

9.3.2.   Entering problems into APRIORI

 

*           Directly

 

For those individuals having authority to access Apriori, they should enter problem information directly into Apriori.   Once entered, the problem must then be routed to a resolver.   If the person entering the problem does not know who the resolver for the problem type is, they should contact the Global Systems Help Desk for assistance.

 

 

*           Via Global Systems Help Desk

 

Individuals not having access to Apriori can report problems to the Global Systems Help Desk by calling  (201) - 524-4357  (i.e., 4357 = HELP).   The Global Systems Help Desk will enter the problem into the Apriori Problem Repository and route the problem to the appropriate resolver.

 

 

9.3.3.   Perform Level 1 Support

 

Global Systems Help Desk personnel will perform Level 1 support on problems reported to them by Problem Reporters.   Level 1 support involves reviewing past solutions to problems of the type being reported.   If a resolution is found, then it is applied.   Should the solution resolve the problem, then the problem is closed as a duplicate of a previous problem type.  

 

A high percentage of problems reported to the Global Systems Help Desk are resolved in this manner, because of Apriori’s Bubble-Up data base technology which relates reported problems to past solutions of the same problem type. 

 

 

9.3.4.   Route Problem to Resolver

 

If the problem cannot be resolved by the Global Systems Help Desk, or through the use of Apriori Bubble-Up data base solution displays, then the problem is routed to the functional area responsible for the problem type (i.e., DASD Management for disk failures, Capacity and Performance group for slow responses, etc.).

 

A list of departments and the resolvers included in the departments is contained within the Apriori system.   If you do not know the resolver associated with the problem type you are reporting, contact the Global Systems Help Desk.

 

 

9.3.5.   Transfer Problem to Another Resolver

 

Should a problem have to be reassigned to another resolver, then the transfer procedure must be followed.   This procedure guarantees that the new resolver is aware of the problem and accepts responsibility for its resolution.  The process is:

 

*           Determine new resolver to transfer problem to,

*           Contact new resolver and explain problem to them,

*           New resolver agrees that problem is in their area of responsibility,

*           New resolver accepts responsibility for problem,

*           Apriori problem record is updated to reflect new resolver.

 

Should the alternate resolver be an outside consultant/vendor, then a company resolver is assigned management coordination responsibility for the problem and the actions taken by the outside resolver.

 


 

 

9.3.6.   Escalate Problem to Level 2 Support

 

 

Problem escalation is based upon the relative importance of the failing component, its impact on delivering business services, and the duration of the outage.   The escalation process for open problems is:

 

*           30 minutes after resolver has been called and has not arrived,

*           60 minutes after resolver has arrived, but has not formulated problem resolution,

*           Upon management discretion, based on impact, duration of outage, and relative importance of affected component(s).

 

 

9.3.7.   Notify Resolver and Area Manager via Pager

 

To accommodate immediate response to Urgent problems, the designated problem Resolver and the affected Area Manager are notified via Pager.   They are then directed to contact the Global Systems Help Desk for further instruction.   When they contact the Global Systems Help Desk, the problem is described to the Area Manager and the Resolver accepts responsibility for repairing the problem.

 

 

9.3.8.   Review Problems

 

Periodically, the Global Systems Help Desk will review active problems to ascertain if they should be raised in priority or to query the resolver and affected area as to the impact of the problem.   Since the impact of a problem can change over time, it may result in executing problem escalation procedures, or transferring the problem to another resolver.

 

The problem review process can also result in problem closures, when it is determined that an active problem is a duplicate of another problem which has been recently resolved.   Through this process, it is possible to inform resolvers of solutions obtained through other individuals or vendors.

 

The problem review process can result in savings in personnel time and problem outage durations.

 

 

9.3.9.   Track Problems

 

All active problems are tracked through Problem Reports distributed to the areas affected by the problem and the resolvers area.   The problems are then reviewed by personnel in these areas and, if the problems status has changed, they will update the problem record (either directly or through the Global Systems Help Desk).

 

This process ensures that all active problems are addressed and resolved in the shortest time possible.  It also ensures that problem information is updated regularly with the most current information available.

 

 

9.3.10.Problem Work-Arounds

 

Sometimes it is possible to activate an alternate path around a failing component, or to process work in a different, but satisfactory, method.   When possible, the technique used to “Work-Around” a problem is included in the Runbook for the Job and sometimes in the Apriori data base.  

 

Problem reporters can research past problem records of a similar type that have been resolved, or have had Work-Arounds applied to them (free-form text section of problem reports may contain work-around information when used by a previous resolver).   When finding this information, the problem reporter may choose to execute the same Work-Around as another resolver.

 

Another location where Work-Arounds can be found is in the Run Book (SUPPDATA) for the failing job.   All locations should be researched to determine if a Work-Around exists and if the problem is severe enough to take advantage of a Work-Around procedure.

 

In any case, the presence or lack of a Work-Around should be noted when entering the problem into Apriori.

 


 

 

9.3.11.Bypass/Circumvention Procedures

 

Problem Bypass/Circumvention procedures are often utilized within a data processing organization.   They are especially important for critical components, where a single point of failure can result when a secondary access path does not exist.

 

Bypass/Circumvention actions include:

 

*   Isolating failing components,

*   De-activating the component,

*   Activating secondary components,

*   Re-assigning the failing job to the back-up component,

*   Restarting the failing job.

 

Even if a Bypass/Circumvention should allow the failing job to continue with its processing, it is still necessary to report the failing component within a problem report.   This will allow for assignment of the problem to a resolver and the eventual repair of the failing component.

 

Remember, as long as a problem exists on the secondary component, a single-point-of-failure exists.   This condition can lead to a disaster event for critical components that suffer a second failure.   For this reason, the severity of the problem should be equal to the criticality of the operation and repair work should be escalated as needed to respond to the problem severity.

 


 

 

9.3.12.Recovery/Restart Procedures

 

 

To recover from an encountered problem, it is necessary to re-establish the operating environment to the status just prior to when the problem occurred (sometimes referred to a the last “Check-Point”).   This process is called Recovery processing and may include:

 

*           Deleting any datasets created after the last check point,

*           Uncataloging datasets,

*           Updating the Tape Management System,

*           Updating the Automated Scheduler, etc.

 

After recovery operations have been completed, the failing job can be restarted.

 

Recovery/Restart procedures should be included with job turnover documentation as SUPPDATA.   All non-zero condition codes should have recovery/restart procedures supplied by the programmer.   Sometimes, recovery/restart operations are included in the PROC as COND CODE steps that are executed when non-zero COND Codes are received.   The use of this process is especially important when recovery/restart operations are extensive in time and labor.

 

 

9.3.13.Problem Resolution

 

When problems are resolved, their resolution is entered into the Apriori system and closure procedures initiated.   It is therefore, important that problem resolutions are tested to insure that they really do provide a complete solution to the problem.

 

Sometimes, system managers (MVS, On-Line, etc,) are asked to review problem resolutions to insure that the solution does indeed resolve the problem in its entirety, but in all cases, the problem reporter is notified of the problem solution and asked to approve its closure.

 

 


 

 

9.3.14.Problem Closure Procedures

 

After a problem’s resolution has been accepted, closure procedures are initiated.   The purpose of problem closure is to notify all concerned parties of a problem solution and to solicit their acceptance of the solution.  

 

The reporter of the problem must approve the problem solution before problem closure procedures can be successfully completed.   Sometimes, systems managers and area managers are also consulted on a problem solution, before closure procedures can be completed.   This checks-and-balances process is designed to ensure that problems affecting multiple components, or crossing functional areas of responsibility, provide the complete solution to the problem and not just the section of the problem perceived by a single area.

 

When a problem is finally closed, its solution is added to the free-form text area of the problem report.   This information is added to the problem solution information associated with past problems of this type.   From that point on, any new problems of this type will have the solution information displayed as reference data that can be used as an aid in problem resolutions going forward.   Apriori’s bubble-up data base operation is responsible for providing the resolution data via normal procedures.

 

There are several variations of problem closure, including Close Pending.   If a Change Control is required to repair the problem, then the problem is closed as “Closed Pending” until the change has been successfully implemented.   Resolutions that do not require a Change Control are repaired immediately and their status updated to “Closed Verified” through normal close procedures.

 

 

9.3.15.Post Mortem Procedures

 

For severe problems that affect critical components, or multiple areas, a Post Mortem is performed.   The information contained within a Post Mortem is designed to fully define a problem, its impact, and the steps taken to resolve the problem.   Post Mortems are used to provide many people with problem information that will serve as an aid in avoiding problems of this type in the future.

 

Post Mortem information is distributed to members of the Technology Coordination Group - Mainframe Steering Committee (TCG-MFSC) for their review. 

 

 

9.3.16.Problem Reporting

 

The Apriori Problem Management system is equipped with a powerful problem reporting tool, which is capable of grouping problems into categories and formatting reported output to suit user needs.   This tool is used to generate reports for first and second level managers, as well as for ad-hoc requests and to support company problem resolver activities.

 

Periodically, problem reports are produced by the Global Systems Help Desk and distributed to designated personnel.  

 

Each business morning, a report of all newly Opened problems and those problems that have been Closed within the last 24 hours is produced and distributed to personnel attending the 8:45 Problem Meeting.

 

A full-range of first level and second level management reports has been developed by the Global Systems Help Desk.   These reports provide first level managers with a list of problems assigned to their area.   Second level managers are provided with summary reports describing the problems assigned to the first level managers who report to them.

 

The division executive is also supplied with a summary report detailing the problems assigned to the second level managers who report to the division executive.

 

Should a specific type of problem report be desired, contact the Global Systems Help Desk with your informational request.   They are responsible for addressing your problem reporting needs.

 

 


 

 

9.4.      Process Elements

 

 

9.4.1.   Products

 

The products utilized in support of Problem Management are:

 

*           Apriori

 

Apriori is the Problem Management System from Answer Systems, Inc.   It is utilized as a Problem Repository and front-end to the Problem Management System.  

 

UNIX based and located on dual servers, the Apriori Problem Management System is capable of communicating to Windows based and Info/Man based terminals (only those Windows based operation is being utilized).

 

There are many options available with the Apriori product, some of which have been purchased by the company.   They include telephone paging and fax reception/transmission features.

 

 

*           Telephone

 

When Urgent problems are reported, the Apriori system issues pages to the assigned resolver and the area manager responsible for the failing component.  The phone page is a means of obtaining quicker responses to encountered problems and for informing management of a problem in their area.

 

Automated Call Directory (ACD) facilities are also available with Apriori.  This facility routes problems to the next Global Systems Help Desk individual who is available to respond to incoming problem calls.   The ACD feature is designed to speed up call responses through better call routing facilities.

 


 

 

 

*           Conference Bridge

 

The company has contracted with a telephone conferencing service provider, so that management and technical personnel can be conference when a serious problem occurs.   This facility allows for the planning of actions in response to problem and disaster situations.

 

When a problem arises that warrants activation of the Teleconferencing facility, the Global Systems Help Desk will notify Darome Teleconferencing Services that a teleconference bridge must be established for company personnel.   Separate bridges for Technical and Business Management personnel are available, as well as a common bridge for both parties.

 

The Global Systems Help Desk will page, or otherwise contact, all personnel who must participate in the conference bridge.   During the teleconference, the problem will be described and actions taken explained.   Additional information will be gathered from participants and problem severity classified.

 

Through the use of teleconferencing facilities, more people will be informed of the problem and additional information obtained that will assist in resolving the problem in the shortest time possible.

 

 


 

 

9.4.2.   Global Systems Help Desk

 

Personnel assigned to the Global Systems Help Desk are responsible for providing company personnel with assistance in diagnosing and repairing problems.   They serve as a focal point for problem information and are the front line for problem resolution.

 

Global Systems Help Desk personnel work closely with the Command Center and assist problem reporters and resolvers with problem diagnosis and repair.   All problem reports are monitored by the Global Systems Help Desk, who are also responsible for problem reporting and distribution.

 

 

9.4.3.   Problem Reporters

 

All company personnel can be Problem Reporters, but problems are primarily reported by:

 

*           Technical Operations,

*           Business End Users,

*           Applications Development, and

*           The Global Systems Help Desk.

 

The problem reporter must provide the following information:

 

*           Problem Description and its Impact,

*           Reporters name, phone number, and department,

*           Date and Time of problem event, and

*           Supporting problem information, if available.

 

Reported problems are entered into the Problem Management System, assigned to a resolver, and tracked until resolved.   When a problem solution is found, the problem reporter is notified and asked if the solution is acceptable.  If so, then the problem can enter close processing.

 

 

9.4.4.   Problem Resolvers

 

Problem resolvers are usually Technical Operations personnel responsible for specific technical products and areas.   When problems are reported, the problem is routed to and assigned to the problem resolver.

 

Problem resolvers research problems and define solutions as necessary.   Once a problem solution is determined, the resolver applies the fix and notifies the Global Systems Help Desk to add the problem solution information to the problem record (or they enter the problem resolution information directly into the Apriori system, if authorized).

 

 

9.4.5.   Change Management

 

When a problem solution requires a change to be implemented, the Change Management System is utilized.   In these cases, a PRA form is completed and the problem is placed in a Close Pending state.  

 

The resolver will complete all required Change Control information and perform all functions associated with the type of change being implemented, including:

 

*           Forms completion,

*           Class A testing,

*           Component migrations,

*           Endevor interfacing, etc...

 

Once a change has been implemented successfully, the problem reporter is informed of the resolution and their acceptance sought.   If the solution is acceptable to the problem reporter, then the problem status is changed to Close Verified from Close Pending.

 

 

*           Info/Man

 

The product used to support the Change Management process at the company.

 


 

 

9.5.      Roles and Responsibilities

 

 

9.5.1.   Problem Manager

 

The Client Services Manager is responsible for overall operation of the Problem Management System and the Global Systems Help Desk.

 

 

9.5.2.   Global Systems Help Desk Personnel

 

Global Systems Help Desk personnel are responsible for:

 

*           Responding to problem reports from company personnel,

*           Entering problem information,

*           Performing first level problem resolution,

*           Routing problems to problem resolvers,

*           Tracking problems until they are resolved,

*           Coordinating problem closure with the Problem Reporter and systems managers,

*           Closing Problems,

*           Performing Post Mortems on problems,

*           Performing Problem Reporting and Distribution.

 

 

9.5.3.   Problem Reporters

 

All company personnel can report problems.  If a deviation from Standards and Procedures, or a disruption to an Expected Service Delivery is experienced, then the event must be reported to the Problem Management System, either directly through Apriori, or via the Global Systems Help Desk.

 

 

9.5.4.   Problem Resolvers

 

Personnel responsible for a specific function or component, are usually responsible for the operation and maintenance of the component as well.   When this occurs, these people are referred to as Resolvers by the Problem Management System.

 

Problem Resolvers are responsible for accepting problem reports for components under their control and for responding to the problems report by performing problem resolution activities.   Once problems have been corrected by the resolver, they must be reported to the Problem Management System.

 

Problem escalation is accomplished through communications with Resolvers.   Should a Resolver require assistance to repair a problem, they can request escalation through the Global Systems Help Desk.    Problem resolutions are added to the Apriori data base when Resolvers repair problems.   Since problem repair descriptions will aid in the resolution of future problems of this type, it benefits the resolver to make resolution information as clear and concise as possible.   This adds to the possibility that future problems of this type will be resolved by 1st Level Support and fewer problems will be routed to the Resolver’s area.

 


 

 

9.6.      Process Evaluation

 

Periodic reviews of the Problem Management System are conducted to evaluate its operation.   Any detected weakness is documented and recommendations for improvement formulated.  The process evaluation procedure is repeated on an annual basis.

 

 

9.6.1.   Present System Weaknesses

 

The Problem Management System is based on the use of Apriori which is a product that is not known by many individuals and cannot be accessed by many of the people who have a need to report and track problems.

 

Apriori is a UNIX based product and program products must be used to convert protocols from UNIX to Windows.   The cost of these products has not been fully defined and factored into the overall cost associated with rolling out the Problem Management System.

 

An interface between the Problem and Change Management Systems does not presently exist.

 

Problem reporting does not presently satisfy all of the management and technical needs faced by the organization.

 

 

9.6.2.   Recommendations for Improvement

 

1.      Develop a Problem Report Generation and Distribution System.

 

2.      Develop an interface between the problem and Change Management Systems.

 

3.      Develop a Roll-Out plan for Apriori.

 

4.      Create training materials on the Apriori product.

 

5.      Print a Users Guide for Apriori.

 


 

 

9.7.      Appendices

 

Appendix  ‘A’             Service Level Document - Post Mortem.

 

Provides a sample of the SLD-PM document and explains how to complete the fields contained within the document.

 

 

 

 

 


 

 

Figure 7:  Service Level Document - Post Mortem

9.7.1.   Appendix ‘A’ Service Level Document - Post Mortem

 

Service Level Document - Post Mortem

 

Name: ________________________________________________              System:  ________________

Department: ___________________________________________                  Date:  ________________

Phone #: _____________________    Fax: ___________________             Apriori #:  ________________

============================================================================

 

SYMPTOM:  ________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

 

PROBLEM ANALYSIS:  _____________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

 

DETAILS (causes, immediate resolution, etc.): ____________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

 

IMPACT (CPU/CICS outages, batch delays, etc.): _________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

 

SOLUTION (“root cause”, permanent resolution, etc.): ____________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

 

FUTURE PREVENTION (actions to be taken, who, what, when): ____________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

_____________________________________________________________________________________

 


 

 

 

Service Level Document - Post Mortem

Process Description

 

 

Document Overview:

 

The Service Level Document - Post Mortem (SLD-PM), assists problem reporters, resolvers and management in gaining an understanding of major problems and their impact on the environment.   The SLD-PM will define the problem and the steps taken to permanently resolve the problem. 

 

 

Service Level Objective:

 

 

An SLD-PM document for all major problems (i.e., multiple users affected, business impacted, etc.) will be completed within 24 hours of the problems occurrence, even if a problem solution has not been achieved.   This objective will ensure that problem information is available to management and the area(s) affected by the problem, so appropriate actions can be taken in response to the problem incident.

 

 

Field Definitions:

 

 

The top section defines the person responsible for completing the SLD-PM document and the system affected by the problem (information is self evident and not further defined), while the bottom portion describes the problem and the actions taken to permanently resolve the problem.   The fields contained on bottom section of the (SLD-PM) document are:

 

 

SYMPTOM:

 

Describe the Symptom(s) and/or indicators of the problem within this area.  Include:

 

*           Initial symptoms perceived by the problem reporter, and

*           Any additional symptoms that the resolver may have provided.

 

 

 

PROBLEM ANALYSIS:

 

 

Define the steps taken in researching the problem to determine its “Root Cause”, including:

 

*           Messages and/or Codes,

*           Abends, and

*           Information sources used to develop;

- Bypass / circumvention,

- Recovery / restart procedures, and

- Permanent resolution to the problem.

 

 

DETAILS (causes, immediate resolution, etc.):

 

 

Describe the details associated with:

 

*           Reporting of the problem,

*           Events leading to its cause,

*           Bypass / circumvention actions taken,

*           Recovery actions taken,

*           Restart actions taken, and

*           Resolution actions.

 

 

IMPACT (CPU/CICS outages, batch delays, etc.):

 

 

Provide a description of the impact associated with this problem, including:

 

*           System(s) and subsystem(s) affected by the problem,

*           Length of time associated with the outage,

*           Locations affected by the outage,

*           Secondary impacts caused by the problem (i.e., missed deadlines, missing reports, delayed systems, etc.), and

*           Missing inputs/outputs caused by the problem.

 

 

SOLUTION (“root cause”, permanent resolution, etc.):

 

 

Provide specific information on the formulated problem solution, including:

 

*           Definition of problem’s “Root Cause”,

*           Description of problem solution.

 

 

FUTURE PREVENTION (actions to be taken, who, what, when):

 

 

Include any additional actions that must be taken to permanently resolve the problem in this section, such as:

 

*           Change Control number required to implement permanent solution,

*           What actions must be taken for permanent solution,

*           Who is responsible for implementing permanent solution,

*           When permanent solution will be implemented,

*           Any needed documentation and procedure creations / updates, and

*           Staff training and orientation requirements.

 

 

Document Routing:

 

 

When completed the form is submitted to Client Services who forward a Post Mortem report to the Technology Coordination Group - Mainframe Steering Committee (TCG-MFSC) for their review.