School of Medicine

    EMORY HOME   |   SCHOOL OF MEDICINE

CALENDAR    |   SEARCH    |     DIRECTORY     

 


HOME

     
Dean's Message
General Information
Prospective
Students
Educational Programs
Research Information
& Resources
Administration &
Faculty Affairs
Departments &
Centers
News, Events &
Publications
Alumni Affairs
Technology & Library Resources
Additional Resources
Contact Information
 
Emory University School of Medicine
 Information Technology Services


General Information
ITS Staff
Policies and Documents
Support Services
Development Services
Hardware & Services
ITS-Grady Campus
Employment
Links

Policies and Documents

Policy Statement on Systems Administration & Maintenance, Downtime and Emergency Response

  1. Definitions
    Within the scope of this policy, the following definitions shall apply:


    1. PRODUCTION SYSTEM - While the term 'production' in industry terms usually refers to a system support status of 24/7, within the scope of this policy the term refers to a particular IT resource being provided to users on ITS equipment, under ITS management, with the understanding that:

      1. the resource will be available consistently and performance will be optimal during normal BUSINESS HOURS,
      2. data and system state for the resource are stored on redundant systems with at least weekly backup and archival,
      3. the resource supports a function of the Dean's Office, OR the academic department, office or other business unit to which the resource is being provided has a written service level agreement with ITS to provide the resource (SLA),
      4. the SLA, including all payment terms if applicable, is current and up-to-date,
      5. in the case of web-based services, the resource is running on ITS-designated PRODUCTION equipment and is not in a constant state of change due to live development.

    2. NON-PRODUCTION SYSTEM - ITS may provide, at its discretion, certain IT resources to individuals or business units within the School with the understanding that these systems are not in a state of production and their availability is not guaranteed. Examples of such resources include development web/application server space, FTP access for temporary storage, experimental or work-in-progress websites, etc.

      A NON-PRODUCTION SYSTEM may experience unexpected downtime, or ITS may elect to revoke or significantly rework the resource without notice or notification. While ITS will make every effort to preserve the business function of the resource and avoid undue inconvenience to customers, NON-PRODUCTION systems should not be integrated into business practice as mission-critical, and ITS cannot be responsible for any disruption that results from the dissolution of a NON-PRODUCTION SYSTEM.

    3. SYSTEMS ADMINISTRATOR - This is the individual (or secondary contact in the case of multiple administrators or unavailability of the primary individual) who is responsible for the day-to-day management of a particular ITS provided resource. Different systems may have different administrators, though ITS maintains a written record and active organizational knowledge of which staff members are responsible for each resource. Users in the customer base may be, at ITS discretion, provided limited administrative rights to the system. These rights are not guaranteed (unless specified in the SLA document) and may be revoked at any time without notice or notification.

    4. PLANNED MAINTENANCE - A procedure by which production services are made unavailable (usually performed during OFF-HOURS to minimize disruption) due to general and/or preventive maintenance. PLANNED MAINTENANCE must be announced well ahead of the scheduled time period (see Emergency Maintenance & Response policy below). PLANNED MAINTENANCE must also be approved beforehand by any person or persons named in an SLA document as the primary notification contact for maintenance.

    5. EMERGENCY MAINTENANCE - A procedure by which production services are made unavailable for an undetermined amount of time due to an emergency situation. This may occur at any time during the day and is a reactive process, in contrast to PLANNED MAINTENANCE, which is proactive. As such it is impossible to predict when services will be reestablished or to assure that all users will be contacted with notification.

    6. BUSINESS HOURS - 8:00am - 5:00pm, Monday - Friday

    7. OFF-HOURS - 5:00pm - 8:00am Monday - Friday, Weekends and University Holidays

    8. DOWNTIME - Unavailability of any production system for customer use.

  2. Policy Statement

    PRODUCTION SYSTEMS will be provided to customers in a satisfactory and consistent manner, in keeping with the following specific policies:

    1. General Maintenance

      ITS maintains a firm commitment to the general maintenance of its systems, including server equipment and configuration, managed desktops and other client equipment used to connect to ITS resources, as well as current versioning of tools and operating software. ITS will perform the following general maintenance functions on all servers and system components to ensure the maximum performance and prevention of failures:

      1. patches to operating system and component files to address vulnerability and bug fixes, increase performance and/or add functionality; a full rollback will be possible if such patches cause unintended problems,
      2. regular monitoring and back-checking of system logs to verify that hardware and software requirements are adjusted to satisfy the demand on the equipment,
      3. security monitoring and auditing of user access to certain resources deemed high security (for more detail, see (7) System Security),
      4. defragmentation, disk cleanup, and other general disk maintenance operations,
      5. upgrades to system physical memory or storage capacity as required; ITS will always maintain a minimum ratio of 1:2 used storage to maximum capacity (additional costs may be required to upgrade storage or disk quotas may be established - refer to SLA),
      6. performance tuning of equipment based on actual usage,
      7. regular backup of system state data for disaster recovery purposes (see (5) Backup and Archival for more detail),
      8. maintenance of parallel servers which can accomodate failover in the event of an extended emergency.


      General maintenance procedures typically will not result in DOWNTIME, as they are affected daily and routinely and usally do not necessitate system reinitialization. In the event that downtime is necessary to complete a general maintenance procedure, the procedure will be performed during OFF-HOURS and DOWNTIME will extend for no more than 30 minutes. In the event that a general maintenance item creates an unintended problem with the system, based on the severity of the problem it will be escalated to either (2) Planned Downtime or (3) Emergency Maintenance & Response.

    2. Planned Downtime

      In many instances involving the installation of new applications, upgrade of a particular service or resource to a new version, or elaborate maintenance procedures requiring the assistance of personnel outside of ITS during BUSINESS HOURS, it is necessary to make resources temporarily unavailable. Planned downtime must be communicated to each identified user of any resource that will be made unavailable to that user during BUSINESS HOURS.

    3. Emergency Maintenance & Response

      When an unexpected condition arises that has the potential to cause damage to equipment or data, compromise system security, or cause massive DOWNTIME if not repaired, ITS will immediately invoke EMERGENCY MAINTENANCE procedures in respose to these threats. EMERGENCY MAINTENANCE is the only mechanism by which PRODUCTION SYSTEMS may be made unavailable without notice to the customer base.

      If EMERGENCY MAINTENANCE procedures cannot be completed within 24 hours from the start of resulting DOWNTIME, the SYSTEM ADMINISTRATOR will be responsible for contacting all users of affected services via the same mechanism defined for PLANNED MAINTENANCE (see (4) below). Once an EMERGENCY MAINTENANCE procedure has been completed, a review by all members of the SYSTEM ADMINISTRATOR group and the CIO will determe whether the actions taken have been appropriate and effective. Customers wishing to appeal the decision to invoke EMERGENCY MAINTENANCE procedures may do so by calling the Office of the CIO at (404) 712-9628.

      If EMERGENCY MAINTENANCE procedures are deemed unsucessful or inappropriate, ITS will contact and provide, at its own expense, an expert resolution of the problem from a third party. A secondary review will also take place to determine whether general maintenance and backup/archival procedures could have been performed more effectively to prevent current and future emergency situations. ITS cannot, however, be held responsible for data loss or business disruption due to emergency problems beyond the control or influence of ITS staff, for example the infestation of a heretofore unknown virus or worm, or the exploitation of an undocumented system vulnerability.

    4. Notification of Downtime

      The SYSTEM ADMINISTRATOR of any ITS provided PRODUCTION SYSTEM is resposible for communication to all individuals, as identified by the sponsoring department or office of those individuals as users of the PRODUCTION SYSTEM, whenever DOWNTIME occurs on any service. The administrator will do so via email using mailing lists containing the preferred address of every user that the sponsoring office has identified to ITS as users of the resource. All users are responsible for checking email in the event of DOWNTIME for notification and system status.

      This notification requirement will vary depending on the level of procedure below:

      1. GENERAL MAINTENANCE

        If general maintenance procedures will cause DOWNTIME during business hours, the SYSTEM ADMINISTRATOR must notify users 24 hours prior, and downtime may not last more than 30 minutes.

      2. PLANNED MAINTENANCE

        PLANNED MAINTENANCE must be announced 10 business days in advance for every day of system unavailability. If a period of PLANNED DOWNTIME coincides with a critical need for the resource, an appeal can be made to the CIO.

      3. EMERGENCY MAINTENANCE

        Depending on the severity of the emergency, the SYSTEM ADMINISTRATOR must use his discretion to determine if time should be allocated to user notification. In these cases, time (even the few minutes it takes to notify users) is often of the essence in successfully resolving the issue or preventing further compromise of system integrity. Users who are experiencing extended periods of DOWNTIME without notification should assume that EMERGENCY MAINTENANCE procedures are underway, but still report the DOWNTIME to the ITS Office. In many cases, the SYSTEM ADMINISTRATOR may not know the extent of the damage until reports have come in from all users.

        Upon completion of EMERGENCY MAINTENANCE procedures, or after 24 hours without successful resolution, the SYSTEM ADMINISTRATOR should contact all users and provide detailed system status, including additional anticipated DOWNTIME.


      The following items will be variable and will have parameters matching the requirements of each system:

    5. Backup and Archival

    6. Restoration of Backup/Archive Data

    7. Granting and Revocation of User Rights

    8. System Security

    9. Non-Production Systems and Resources


Site Designed and Maintained by School of Medicine Information Technology Services
© 2005 Emory University. Last Update: July-05