Problem Management System
Release Date: November 21, 2001
Produced by: Thomas Bronack
Section Table of Contents
9.1. Introduction to Problem Management
9.1.5. SMC
Discipline Interfaces
9.2.3. Problem
Reporting and Logging:
9.2.4. Problem reporters
include:
9.2.5. Required
problem information includes:
9.2.6. To report
a problem, via the Global Systems Help Desk:
9.2.7. To report
a problem, directly through Apriori:
9.2.12. Service
Level Document - Post Mortem:
9.2.13. Management
Reports and Review:
9.2.14. The
desired reporting scenario is that each morning:
9.3.2. Entering
problems into APRIORI
9.3.3. Perform
Level 1 Support
9.3.4. Route Problem
to Resolver
9.3.5. Transfer
Problem to Another Resolver
9.3.6. Escalate
Problem to Level 2 Support
9.3.7. Notify
Resolver and Area Manager via Pager
9.3.11. Bypass/Circumvention
Procedures
9.3.12. Recovery/Restart
Procedures
9.3.14. Problem
Closure Procedures
9.3.15. Post
Mortem Procedures
9.4.2. Global
Systems Help Desk
9.5. Roles and Responsibilities
9.5.2. Global
Systems Help Desk Personnel
9.6.1. Present
System Weaknesses
9.6.2. Recommendations
for Improvement
9.7.1. Appendix
‘A’ Service Level Document - Post Mortem
Section Table of Figures
Figure 1:
Help Desk Functions and Procedures
Figure
2: Definition of a Problem
Figure
3: Definition of Problem Management
System
Figure
4: Problem Management Objectives
Figure
5: Proposed Problem Management Process
Flow
Figure
6: Global Systems Help Desk
Functions Overview
Figure
7: Service Level Document - Post Mortem
The purpose of this document is to
provide a description of the Problem Management System and the procedures used
to enter, review, update, assign, escalate, resolve, and close problems.
<note> Apriori is a problem management tool that
has a ”Bubble-Up” data base that displays past problems that match entered
problem abstracts.
Figure 1: Help Desk Functions and Procedures
This topic will provide the
definition of a PROBLEM and how the PROBLEM MANAGEMENT SYSTEM works.
PROBLEM definition:
Figure 2:
Definition of a
Problem

The Problem Management System has
been established to capture and report on encountered problems. It employs the APRIORI Problem Management
System product from Answer Systems, Inc. as a front-end and problem repository.
The Problem Management System is
used to record problem events, assign them to a resolver, and track problems
until they are successfully resolved.
If problems are of a high priority, or outstanding beyond an acceptable
period of time, then the problem Management System will escalate the problem in
priority. When this occurs, the next
level in problem support is activated and additional manpower is applied to resolving the problem.
The PROBLEM MANAGEMENT SYSTEM is defined as:
Figure 3: Definition of Problem Management System

Problem Management is the process of detecting and reporting
problems that impact services supplied by Technology Operations. A problem
is any unplanned deviation from standards or an expected service delivery.
Problems include:
Hardware mainframe through data
network;
Software systems, sub-systems,
applications, and utilities;
Communications applications, devices,
and lines;
Data Network applications, and
devices;
Human errors;
Procedures; and
Environmental failures (i.e., Heating,
Ventilation, Air Conditioning, Power, Water, Raised Floor, etc.).
Intervention may be required to
determine, eliminate, or circumvent problems as they are identified.
Figure 4: Problem Management Objectives

The Mission of the Problem Management System is to employ
standard procedures for reporting and resolving problems. The goal of this process is to reduce the impact
of failures on service expectations to an acceptable level.
Problem Reporting is used to inform
Management and Technical personnel of problems that affect service
expectations. Problems are assigned to
resolvers through the Technology Operations department and/or the Global
Systems Help Desk. Escalation’s,
designed to raise the priority of a problem, are incorporated into the Problem
Management Process so that appropriate resolution actions can be taken and the
outage’s duration reduced.
The Problem Management Process will
record, assign, escalate, track, resolve and report on any situation that is a
deviation from expected service deliveries or standards.
Post Mortem procedures are utilized
to review resolved problems that have impacted multiple users, or are of a high
priority. The goal of the Post Mortem
process is to implement improved problem recovery and resolution procedures, as
deemed necessary.
The objectives of Problem Management are:
Ensure
all encountered problems are reported,
Prioritize
problems as to their criticality and business impact,
Log
problems to Problem Record in Problem Repository,
Resolve
a majority of reported problems at Global Systems Help Desk,
Assign
problems to resolvers that cannot be resolved by Global Systems Help Desk,
Track
and manage problems (from origination through resolution),
Escalate
problem resolution in accordance with criticality and duration,
Close
problems when resolved,
Conduct Post Mortems on resolved problems,
Recommend updates to supportive documentation,
Produce and distribute Problem Reports.
These objectives are met through:
The use of the APRIORI Problem Management System,
Personnel assigned to the Global Systems Help Desk,
Internal
Resolvers assigned to the departments of Technical Operations, and
External
Resolvers assigned to problems under the management of Internal
Resolvers.
The Problem Management Process begins
with the recognition of a problem and ends when a problem is closed. In both cases, the problem reporter must be
included; initially as the reporter or a problem and finally as the approver of
the problem’s resolution.
In some cases, the Systems Manager,
or On-Line Manager, must also approve the resolution to ensure that the closed
problem does not impact other areas beyond the reporters domain (i.e., problems
that affect multiple users, but are reported by only one user, etc...). This “checks-and-balances” procedure is
accomplished via periodic problem review meetings and the distribution of
problem reports.
The APRIORI Problem Management System is used to record and
process problem records (each problem has a unique problem record id). Problems can be entered into the APRIORI
system by personnel from:
Global Systems Help Desk,
Technical Operations,
Applications Development, and
Business End Users.
The Technology Operations Problem
Management System is responsible for problems within the following areas:
Host applications,
Hardware,
Software,
Data Network,
Terminals,
Procedures, and
Facilities.
The Systems Management and Controls
(SMC) disciplines interfacing with the Problem Management System are:
Batch
Management: Provide problem reports
for Batch Jobs and their recovery/restart procedures.
Capacity
Management: Provides problem
reports on abends due to capacity shortages, such as DASD Space and Region
Sizes. May also be used to isolate the
need for additional hardware and/or the reconfiguration of existing resources.
Performance
Management: Provides problem
reports for performance related weaknesses, such as transaction response times,
Batch Job turnaround times, etc.
Recovery
Management: Provides problem
reporting when errors in the recovery process occur, such as recovery planning
for critical assets, sizing requirements for recovery facilities, and
failed recovery/restart procedures.
Service
Level Management: Provides problem
reports when service level delivery to clients is below expected limits, as
determined via Service Level Reporting.
Inventory
Management: Provides problem
reporting for items not included in the Inventory, or wrongly reported in the
Inventory System (i.e., Acquisition, Redeployment, Termination, or Surplus).
Figure 5: Proposed Problem Management Process Flow

Problem Management System
procedures are described below.
A problem is categorized as:
A deviation from “Expected
Service Delivery”, or
A “Standards and Procedures”
violation.
when these conditions are recognized, then they must be reported as a problem.
The detection and identification of problems, or potential problems, through monitoring, trend analysis, or observation. The recognition of a problem can come from any point in the system; when identified, problems must be entered into the Problem Management System and routed to a problem resolver.
Problem entry is performed via the Apriori program product from Answer Systems. Apriori provides a Problem Repository data base and is equipped with front-end displays used to enter, access, research, assign, track, and resolve problems.
Technology Operations
personnel,
Applications Development
personnel,
Business Users, and
Global Systems Help Desk
personnel.
Name, location, and phone
number of problem reporter,
Date and time of problem
occurrence,
Description of problem and
circumstances that led to the problem,
Assessment of the severity and
impact of the problem, including:
Urgent,
Important,
Minor, or
Information.
Supporting information, if
available and appropriate.
Call the Global Systems Help
Desk at ______________
Describe the problem to the
Help Desk,
Define the Impact of the
problem,
The Help Desk will record the
problem,
If possible, the Help Desk
will resolve the problem,
Otherwise, the Help Desk will
route the call to a resolver,
Notification is performed via
Telephone, Beeper, or in Person,
Notification of Urgent and
Important problems is immediate,
Senior Management and impacted
Business Function are notified immediately via page for Urgent problems.
Log onto Apriori,
Enter problem information,
Use problem Search to review
past solution data for problems of this type,
If possible, repair / bypass
problem with supplied information,
Otherwise, route problem to
resolver via Submit Problem Report feature.
Problem tracking and escalation is performed by the Global Systems Help Desk for all reported problems. Problem reporting will be performed on a periodic basis. Problem turnaround to resolution is dependent upon severity. Urgent problems will be addressed immediately (within 1/2 hour), while less urgent problems may take longer to address.
The resolver, or Help Desk, must contact the problem reporter to gain their acceptance of the problem solution before problem closure procedures can be performed. Systems Managers may also be informed of problem solutions and their approval sought for some types of problems. This guarantees that problem reporters are satisfied with resolutions and that problems are truly repaired.
Problems are identified through searches of past problems contained in the Apriori data base, or through analysis efforts performed by the resolver. The intent of problem determination is to identify the source of the problem at a level sufficient enough to enable corrective action(s) to be performed.
Once the problem is assigned to the resolver, they are responsible for defining the problem’s “root cause” and for developing a solution to the problem. Restore/ Recovery and Bypass/Circumvention procedures should also be supplied by the resolver whenever possible. This information is added to the free-form text section of the Apriori problem record and made available to future problem searches by reporters/resolvers.
To allow processing to continue, it is sometimes possible to work around problems through a bypass or circumvention, while recovery/restart operations are used to re-establish the environment just prior to the problem event. The combination of the two disciplines allows for the re-submission of failed jobs, using alternate components after recovery procedures have been performed.
Problems are assigned to resolvers responsible for supporting the area impacted by the problem. If a problem has to be reassigned, or escalated, it is coordinated through the Global Systems Help Desk. If external resolvers are needed, an internal resolver must be responsible for monitoring and coordinating the actions of the external resolver. All problem determination, work-around, and resolution information must be entered into the free-form text section of the problem record.
Problem solutions that require change control activity will be processed through Change Management before verification of problem solutions can be accomplished. Solutions that do not require change control activity will be implemented immediately. In both cases, the problem reporter must approve the problem solution before the problem can be closed.
All problem related information must be added to the free-form text section of the problem record before problem closure is finalized.
Post Mortem reviews for major problems (those problems affecting multiple users, or impacting business operations) are conducted within 24 hours of the problem, even when resolutions are not available. This document is supplied to members of the Technology Coordination Group’s Mainframe Steering Committee (TCG-MFSC).
The sections contained within this document are:
Symptom,
Problem Analysis,
Details,
Impact,
Solution, and
Future Prevention.
If problem resolution information was not available in the initial Post Mortem report, the Post Mortem document is updated until the problem solution is available (see Appendix ‘A’ for Post Mortem document and process).
Problem reports are generated for a variety of business reasons, including:
Open and/or closed problems,
High impact/severity,
problems,
Problems by date/time range,
Problems by functional
area/department,
Problems by component/vendor,
Problems by type, frequency,
duration, etc...
In all cases, problem reports can be used to isolate weaknesses in products, applications, standards, procedures, supportive documentation, and training. Through problem reports and periodic reviews, management and technical personnel can make adjustments to the functions they perform, thereby reducing problem events and improving performance.
Apriori supports a full range of reporting functions. Its data base design allows:
Problem record searches for
specific information,
Grouping data in problem
reports,
Formatting output for hardcopy,
fax, or e-mail,
Routing reports and data to
files for use by other products (i.e., it is possible to copy problem data to a
file in spreadsheet format.
This information can then be used to generate graphs and other reports from Excel, Lotus 123, etc.).
Reports can be produced on a periodic basis, or via ad-hoc requests.
1st Level Managers receive
Problem Reports (Open/Closed) for their areas,
2nd Level Managers receive
Summary Reports for the areas reporting to them.
The reports should be generated via Apriori and distributed via cc:Mail (or some other electronic media like Fax) to the individual. A schedule for report generation and distribution should be created and adhered to. Responses to reported problems should be accomplished via cc:Mail/Fax, when information or resolutions are available.
The problem reports listed above should be for every manager in the Technology Operations organization. Application problem reports should be generated, as well.
The summary report for the Technology Operations executive should include all open/closed problem data for the areas under the executive’s direct control. Trending analysis reports should be provided to allow the executive to pinpoint where problem most often occur, or where resolutions take the most time to develop. This information will allow the executive to direct his forces to concentrate efforts on eliminating problem areas, or to provide additional training to specified areas of operation.
Figure 6: Global Systems Help Desk Functions Overview

Whenever a deviation from Standards
and Procedures occurs, or an Expected Service Delivery is missed, the event
must be reported as a problem. Problem
reporting is accomplished through the Apriori Problem Management System, which
is under the control of the Global Systems Help Desk (included in the Client
Services Department).
Directly
For those individuals having authority to access Apriori, they should enter problem information directly into Apriori. Once entered, the problem must then be routed to a resolver. If the person entering the problem does not know who the resolver for the problem type is, they should contact the Global Systems Help Desk for assistance.
Via Global Systems Help Desk
Individuals not having access to Apriori can report problems to the Global Systems Help Desk by calling (201) - 524-4357 (i.e., 4357 = HELP). The Global Systems Help Desk will enter the problem into the Apriori Problem Repository and route the problem to the appropriate resolver.
Global Systems Help Desk personnel will perform Level 1 support on problems reported to them by Problem Reporters. Level 1 support involves reviewing past solutions to problems of the type being reported. If a resolution is found, then it is applied. Should the solution resolve the problem, then the problem is closed as a duplicate of a previous problem type.
A high percentage of problems reported to the Global Systems Help Desk are resolved in this manner, because of Apriori’s Bubble-Up data base technology which relates reported problems to past solutions of the same problem type.
If the problem cannot be resolved by the Global Systems Help Desk, or through the use of Apriori Bubble-Up data base solution displays, then the problem is routed to the functional area responsible for the problem type (i.e., DASD Management for disk failures, Capacity and Performance group for slow responses, etc.).
A list of departments and the resolvers included in the departments is contained within the Apriori system. If you do not know the resolver associated with the problem type you are reporting, contact the Global Systems Help Desk.
Should a problem have to be reassigned to another resolver, then the transfer procedure must be followed. This procedure guarantees that the new resolver is aware of the problem and accepts responsibility for its resolution. The process is:
Determine new resolver to
transfer problem to,
Contact new resolver and
explain problem to them,
New resolver agrees that
problem is in their area of responsibility,
New resolver accepts
responsibility for problem,
Apriori problem record is
updated to reflect new resolver.
Should the alternate resolver be an outside consultant/vendor, then a company resolver is assigned management coordination responsibility for the problem and the actions taken by the outside resolver.
Problem escalation is based upon the relative importance of the failing component, its impact on delivering business services, and the duration of the outage. The escalation process for open problems is:
30 minutes after resolver has
been called and has not arrived,
60 minutes after resolver has
arrived, but has not formulated problem resolution,
Upon management discretion,
based on impact, duration of outage, and relative importance of affected
component(s).
To accommodate immediate response to Urgent problems, the designated problem Resolver and the affected Area Manager are notified via Pager. They are then directed to contact the Global Systems Help Desk for further instruction. When they contact the Global Systems Help Desk, the problem is described to the Area Manager and the Resolver accepts responsibility for repairing the problem.
Periodically, the Global Systems Help Desk will review active problems to ascertain if they should be raised in priority or to query the resolver and affected area as to the impact of the problem. Since the impact of a problem can change over time, it may result in executing problem escalation procedures, or transferring the problem to another resolver.
The problem review process can also result in problem closures, when it is determined that an active problem is a duplicate of another problem which has been recently resolved. Through this process, it is possible to inform resolvers of solutions obtained through other individuals or vendors.
The problem review process can result in savings in personnel time and problem outage durations.
All active problems are tracked through Problem Reports distributed to the areas affected by the problem and the resolvers area. The problems are then reviewed by personnel in these areas and, if the problems status has changed, they will update the problem record (either directly or through the Global Systems Help Desk).
This process ensures that all active problems are addressed and resolved in the shortest time possible. It also ensures that problem information is updated regularly with the most current information available.
Sometimes it is possible to activate an alternate path around a failing component, or to process work in a different, but satisfactory, method. When possible, the technique used to “Work-Around” a problem is included in the Runbook for the Job and sometimes in the Apriori data base.
Problem reporters can research past problem records of a similar type that have been resolved, or have had Work-Arounds applied to them (free-form text section of problem reports may contain work-around information when used by a previous resolver). When finding this information, the problem reporter may choose to execute the same Work-Around as another resolver.
Another location where Work-Arounds can be found is in the Run Book (SUPPDATA) for the failing job. All locations should be researched to determine if a Work-Around exists and if the problem is severe enough to take advantage of a Work-Around procedure.
In any case, the presence or lack of a Work-Around should be noted when entering the problem into Apriori.
Problem Bypass/Circumvention procedures are often utilized within a data processing organization. They are especially important for critical components, where a single point of failure can result when a secondary access path does not exist.
Bypass/Circumvention actions include:
Isolating failing components,
De-activating the component,
Activating secondary components,
Re-assigning the failing job to the
back-up component,
Restarting the failing job.
Even if a Bypass/Circumvention should allow the failing job to continue with its processing, it is still necessary to report the failing component within a problem report. This will allow for assignment of the problem to a resolver and the eventual repair of the failing component.
Remember, as long as a problem exists on the secondary component, a single-point-of-failure exists. This condition can lead to a disaster event for critical components that suffer a second failure. For this reason, the severity of the problem should be equal to the criticality of the operation and repair work should be escalated as needed to respond to the problem severity.
To recover from an encountered problem, it is necessary to re-establish the operating environment to the status just prior to when the problem occurred (sometimes referred to a the last “Check-Point”). This process is called Recovery processing and may include:
Deleting any datasets created
after the last check point,
Uncataloging datasets,
Updating the Tape Management
System,
Updating the Automated
Scheduler, etc.
After recovery operations have been completed, the failing job can be restarted.
Recovery/Restart procedures should be included with job turnover documentation as SUPPDATA. All non-zero condition codes should have recovery/restart procedures supplied by the programmer. Sometimes, recovery/restart operations are included in the PROC as COND CODE steps that are executed when non-zero COND Codes are received. The use of this process is especially important when recovery/restart operations are extensive in time and labor.
When problems are resolved, their resolution is entered into the Apriori system and closure procedures initiated. It is therefore, important that problem resolutions are tested to insure that they really do provide a complete solution to the problem.
Sometimes, system managers (MVS, On-Line, etc,) are asked to review problem resolutions to insure that the solution does indeed resolve the problem in its entirety, but in all cases, the problem reporter is notified of the problem solution and asked to approve its closure.
After a problem’s resolution has been accepted, closure procedures are initiated. The purpose of problem closure is to notify all concerned parties of a problem solution and to solicit their acceptance of the solution.
The reporter of the problem must approve the problem solution before problem closure procedures can be successfully completed. Sometimes, systems managers and area managers are also consulted on a problem solution, before closure procedures can be completed. This checks-and-balances process is designed to ensure that problems affecting multiple components, or crossing functional areas of responsibility, provide the complete solution to the problem and not just the section of the problem perceived by a single area.
When a problem is finally closed, its solution is added to the free-form text area of the problem report. This information is added to the problem solution information associated with past problems of this type. From that point on, any new problems of this type will have the solution information displayed as reference data that can be used as an aid in problem resolutions going forward. Apriori’s bubble-up data base operation is responsible for providing the resolution data via normal procedures.
There are several variations of problem closure, including Close Pending. If a Change Control is required to repair the problem, then the problem is closed as “Closed Pending” until the change has been successfully implemented. Resolutions that do not require a Change Control are repaired immediately and their status updated to “Closed Verified” through normal close procedures.
For severe problems that affect critical components, or multiple areas, a Post Mortem is performed. The information contained within a Post Mortem is designed to fully define a problem, its impact, and the steps taken to resolve the problem. Post Mortems are used to provide many people with problem information that will serve as an aid in avoiding problems of this type in the future.
Post Mortem information is distributed to members of the Technology Coordination Group - Mainframe Steering Committee (TCG-MFSC) for their review.
The Apriori Problem Management system is equipped with a powerful problem reporting tool, which is capable of grouping problems into categories and formatting reported output to suit user needs. This tool is used to generate reports for first and second level managers, as well as for ad-hoc requests and to support company problem resolver activities.
Periodically, problem reports are produced by the Global Systems Help Desk and distributed to designated personnel.
Each business morning, a report of all newly Opened problems and those problems that have been Closed within the last 24 hours is produced and distributed to personnel attending the 8:45 Problem Meeting.
A full-range of first level and second level management reports has been developed by the Global Systems Help Desk. These reports provide first level managers with a list of problems assigned to their area. Second level managers are provided with summary reports describing the problems assigned to the first level managers who report to them.
The division executive is also supplied with a summary report detailing the problems assigned to the second level managers who report to the division executive.
Should a specific type of problem report be desired, contact the Global Systems Help Desk with your informational request. They are responsible for addressing your problem reporting needs.
The products utilized in support of Problem Management are:
Apriori
Apriori is the Problem Management System from Answer Systems, Inc. It is utilized as a Problem Repository and front-end to the Problem Management System.
UNIX based and located on dual servers, the Apriori Problem Management System is capable of communicating to Windows based and Info/Man based terminals (only those Windows based operation is being utilized).
There are many options available with the Apriori product, some of which have been purchased by the company. They include telephone paging and fax reception/transmission features.
Telephone
When Urgent problems are reported, the Apriori system issues pages to the assigned resolver and the area manager responsible for the failing component. The phone page is a means of obtaining quicker responses to encountered problems and for informing management of a problem in their area.
Automated Call Directory (ACD) facilities are also available with Apriori. This facility routes problems to the next Global Systems Help Desk individual who is available to respond to incoming problem calls. The ACD feature is designed to speed up call responses through better call routing facilities.
Conference Bridge
The company has contracted with a telephone conferencing service provider, so that management and technical personnel can be conference when a serious problem occurs. This facility allows for the planning of actions in response to problem and disaster situations.
When a problem arises that warrants activation of the Teleconferencing facility, the Global Systems Help Desk will notify Darome Teleconferencing Services that a teleconference bridge must be established for company personnel. Separate bridges for Technical and Business Management personnel are available, as well as a common bridge for both parties.
The Global Systems Help Desk will page, or otherwise contact, all personnel who must participate in the conference bridge. During the teleconference, the problem will be described and actions taken explained. Additional information will be gathered from participants and problem severity classified.
Through the use of teleconferencing facilities, more people will be informed of the problem and additional information obtained that will assist in resolving the problem in the shortest time possible.
Personnel assigned to the Global Systems Help Desk are responsible for providing company personnel with assistance in diagnosing and repairing problems. They serve as a focal point for problem information and are the front line for problem resolution.
Global Systems Help Desk personnel work closely with the Command Center and assist problem reporters and resolvers with problem diagnosis and repair. All problem reports are monitored by the Global Systems Help Desk, who are also responsible for problem reporting and distribution.
All company personnel can be Problem Reporters, but problems are primarily reported by:
Technical Operations,
Business End Users,
Applications Development, and
The Global Systems Help Desk.
The problem reporter must provide the following information:
Problem Description and its
Impact,
Reporters name, phone number,
and department,
Date and Time of problem
event, and
Supporting problem
information, if available.
Reported problems are entered into the Problem Management System, assigned to a resolver, and tracked until resolved. When a problem solution is found, the problem reporter is notified and asked if the solution is acceptable. If so, then the problem can enter close processing.
Problem resolvers are usually Technical Operations personnel responsible for specific technical products and areas. When problems are reported, the problem is routed to and assigned to the problem resolver.
Problem resolvers research problems and define solutions as necessary. Once a problem solution is determined, the resolver applies the fix and notifies the Global Systems Help Desk to add the problem solution information to the problem record (or they enter the problem resolution information directly into the Apriori system, if authorized).
When a problem solution requires a change to be implemented, the Change Management System is utilized. In these cases, a PRA form is completed and the problem is placed in a Close Pending state.
The resolver will complete all required Change Control information and perform all functions associated with the type of change being implemented, including:
Forms completion,
Class A testing,
Component migrations,
Endevor interfacing, etc...
Once a change has been implemented successfully, the problem reporter is informed of the resolution and their acceptance sought. If the solution is acceptable to the problem reporter, then the problem status is changed to Close Verified from Close Pending.
Info/Man
The product used to support the Change Management process at the company.
The Client Services Manager is responsible for overall operation of the Problem Management System and the Global Systems Help Desk.
Global Systems Help Desk personnel are responsible for:
Responding to problem reports
from company personnel,
Entering problem information,
Performing first level problem
resolution,
Routing problems to problem
resolvers,
Tracking problems until they
are resolved,
Coordinating problem closure
with the Problem Reporter and systems managers,
Closing Problems,
Performing Post Mortems on
problems,
Performing Problem Reporting
and Distribution.
All company personnel can report problems. If a deviation from Standards and Procedures, or a disruption to an Expected Service Delivery is experienced, then the event must be reported to the Problem Management System, either directly through Apriori, or via the Global Systems Help Desk.
Personnel responsible for a specific function or component, are usually responsible for the operation and maintenance of the component as well. When this occurs, these people are referred to as Resolvers by the Problem Management System.
Problem Resolvers are responsible for accepting problem reports for components under their control and for responding to the problems report by performing problem resolution activities. Once problems have been corrected by the resolver, they must be reported to the Problem Management System.
Problem escalation is accomplished through communications with Resolvers. Should a Resolver require assistance to repair a problem, they can request escalation through the Global Systems Help Desk. Problem resolutions are added to the Apriori data base when Resolvers repair problems. Since problem repair descriptions will aid in the resolution of future problems of this type, it benefits the resolver to make resolution information as clear and concise as possible. This adds to the possibility that future problems of this type will be resolved by 1st Level Support and fewer problems will be routed to the Resolver’s area.
Periodic reviews of the Problem Management System are conducted to evaluate its operation. Any detected weakness is documented and recommendations for improvement formulated. The process evaluation procedure is repeated on an annual basis.
The Problem Management System is based on the use of Apriori which is a product that is not known by many individuals and cannot be accessed by many of the people who have a need to report and track problems.
Apriori is a UNIX based product and program products must be used to convert protocols from UNIX to Windows. The cost of these products has not been fully defined and factored into the overall cost associated with rolling out the Problem Management System.
An interface between the Problem and Change Management Systems does not presently exist.
Problem reporting does not presently satisfy all of the management and technical needs faced by the organization.
1. Develop a Problem Report Generation and Distribution System.
2. Develop an interface between the problem and Change Management Systems.
3. Develop a Roll-Out plan for Apriori.
4. Create training materials on the Apriori product.
5. Print a Users Guide for Apriori.
Appendix ‘A’ Service
Level Document - Post Mortem.
Provides a sample
of the SLD-PM document and explains how to complete the fields contained within
the document.
Figure 7:
Service Level Document - Post Mortem
Service Level Document - Post
Mortem
Name:
________________________________________________ System: ________________
Department:
___________________________________________
Date: ________________
Phone #: _____________________ Fax: ___________________ Apriori
#: ________________
============================================================================
SYMPTOM:
________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
PROBLEM ANALYSIS:
_____________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
DETAILS (causes, immediate resolution, etc.):
____________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
IMPACT (CPU/CICS outages, batch delays, etc.):
_________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
SOLUTION (“root cause”, permanent resolution, etc.):
____________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
FUTURE PREVENTION (actions to be taken, who, what, when):
____________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
_____________________________________________________________________________________
Service Level Document - Post Mortem
Process Description
Document Overview:
The Service Level Document - Post Mortem (SLD-PM),
assists problem reporters, resolvers and management in gaining an understanding
of major problems and their impact on the environment. The SLD-PM will define the problem and the steps
taken to permanently resolve the problem.
Service Level Objective:
An SLD-PM document for all major problems (i.e.,
multiple users affected, business impacted, etc.) will be completed within 24
hours of the problems occurrence, even if a problem solution has not been
achieved. This objective will ensure
that problem information is available to management and the area(s) affected by
the problem, so appropriate actions can be taken in response to the problem
incident.
Field Definitions:
The top section defines the person responsible for
completing the SLD-PM document and the system affected by the problem
(information is self evident and not further defined), while the bottom portion
describes the problem and the actions taken to permanently resolve the
problem. The fields contained on
bottom section of the (SLD-PM) document are:
SYMPTOM:
Describe the Symptom(s) and/or indicators of the
problem within this area. Include:
Initial symptoms perceived by the problem reporter,
and
Any additional symptoms that the resolver may have
provided.
PROBLEM ANALYSIS:
Define the steps taken in researching the problem to
determine its “Root Cause”, including:
Messages and/or Codes,
Abends, and
Information sources used to develop;
- Bypass / circumvention,
- Recovery / restart procedures, and
- Permanent resolution to the problem.
DETAILS (causes, immediate resolution, etc.):
Describe the details associated with:
Reporting of the problem,
Events leading to its cause,
Bypass / circumvention actions taken,
Recovery actions taken,
Restart actions taken, and
Resolution actions.
IMPACT (CPU/CICS outages, batch delays, etc.):
Provide a description of the impact associated with
this problem, including:
System(s) and subsystem(s) affected by the problem,
Length of time associated with the outage,
Locations affected by the outage,
Secondary impacts caused by the problem (i.e.,
missed deadlines, missing reports, delayed systems, etc.), and
Missing inputs/outputs caused by the problem.
SOLUTION (“root cause”, permanent resolution, etc.):
Provide specific information on the formulated
problem solution, including:
Definition of problem’s “Root Cause”,
Description of problem solution.
FUTURE PREVENTION (actions to be taken, who, what, when):
Include any additional actions that must be taken to
permanently resolve the problem in this section, such as:
Change Control number required to implement
permanent solution,
What actions must be taken for permanent solution,
Who is responsible for implementing permanent
solution,
When permanent solution will be implemented,
Any needed documentation and procedure creations /
updates, and
Staff training and orientation requirements.
Document Routing:
When completed the form is submitted to Client
Services who forward a Post Mortem report to the Technology Coordination Group
- Mainframe Steering Committee (TCG-MFSC) for their review.