PRISM System Specification Handbook

Part II: Requirements, Design Options and Constraints (REDOC)


Design Options for PRISM System Architecture

Version 1.4



Authors

C. Larsson, N. Wedi, H. Thiemann

Table of contents

  1. Introduction
  2. Terminology and concepts
  3. System model
  4. Software life cycle
  5. Security
  6. Specific Design Choices and Constraints
  7. Investigated software packages
  8. Discussion on the architectural choices
  9. Cooperating systems
  10. Risks
  11. Definitions,Acronyms and Abbreviations
  12. REFERENCES

1.0 Introduction

The purpose of the PRISM system is to enable users to perform numerical experiments, coupling interchangeable model components, eg. atmosphere, ocean, biosphere, chemistry etc., using standardised interfaces as outlined in section (I.3). The general architecture provides the infrastructure to configure, submit, monitor and subsequently postprocess, archive and diagnose the results of these coupled model experiments.
There is an emphasis on choosing an architectural design that allows these activities to be done remotely, eg. without the user physically being in the place where the numerical computations take place.

The required features of the PRISM system are analysed with respect to the processes involved, the actions they take and where they happen.

The general architecture influences the security models that can be applied.It defines the possible operations of a user from a remote site. An architectural design is presented that satisfies the security and configurability demands required by all processes. Existing technologies are investigated to assess the constraining implications of their use.

It is expected, that the cost and capacity of future computing and network technologies are changing. Therefore, it is important to design an adaptable and scalable architecture.

2.0 Terminology and concepts

2.1 Experiment

An experiment is an ensemble of tasks running on a supercomputer, defined by a configuration process. Three levels of communication exist within such a coupled experiment:


Tasks are under the control of a scheduler. The function of the scheduler is to coordinate the execution of these tasks while preserving any dependencies between them.

2.2 Task

A task is an individual job step of an experiment that needs to be executed. Fig 1 is a hierarchical view of an experiment with the tool Xcdp showing a collection of tasks. The different boxes represent different tasks and the colour code shows the status of the task. Note, that the coupled model, i.e. coupler and component models, represents a single task of an experiment.

Xcdp graphical view of an experiment
Xcdp graphical view of an experiment
Figure 1 This figure shows the status of tasks as different colours in an overview of an experiment.

2.3 Coupler

One major objective of the PRISM project is to develop a standard interface for each possible pair of models constituting the global climate system. The models will exchange information through standard interfaces with a universal coupler or directly with the other model components. The coupler is the program responsible for controlling the coupled model formed by the different component models, and controlling the exchanges and transformations of physical data between them. This is detailed further in section II.2.

2.4 Configuration

Three basic phases of configuration can be identified:


The definition phase comprises the definition of all component models to be coupled (model interfaces and metadata - PMIOD), transformation entities, I/O options, post-processing options, diagnostic options, statistic options, ... etc. . In the PRISM system this is provided by the PRISM model administrator and presented to the PRISM user through the user interface.

During the composition phase the PRISM user sets up a specific coupled experiment through the user interface by
During the deployment phase an abstract compact description of an experiment is generated. This is defined as a configuration instance. A configuration instance details how to run the coupled experiment on a computer in a format, that can be understood by the computer's operating system. Further, it contains information on the coupling communication between models and the internal communication of each model component on the chosen platform. This is further detailed in section II.2. Consistency checking before deployment ensures a correct configuration for each task.

2.5 User interface

The system should allow the domain activities to happen by remote access, i.e. the users do not have to be physically in the same place as where the model is executed or data is located. The interaction between the users and the system takes place through a user interface. This interface establishes the identity of the user and allows for access to the systems functionality. The functionality is provided by an number of specialised servers accessed by the client user interface (UI) detailed in section II.4.

2.6 Administration

An administrator will be provided by each institution maintaining and developing a particular model component of the coupled PRISM system. The administrator has several tasks:


This is accomplished through an administration user interface. Since there is a need for varying expertise to accomplish all administration tasks, the administration may be performed by several persons. Due to the likely distribution of expertise for the different model components, administrators will be physically located at different sites, which further emphasises the need for a distributed system.

2.7 Results

The experiment results in the output of data fields, statistics and diagnostics. This output needs to be archived, catalogued and made accessible to the modeller who will need to visualise the diagnostics and data to understand the results. It is the task of the archiving and data management system to accomplish this and the details are explained in the section II.3 documents.

2.8 Client and server processes

The partitioning of functionality allowing a client to perform operations outside its own capability with the help of a more powerful server is called client/server computing. The client and the server may reside on physically different computers and they communicate by accessing computer networks.

3.0 System model

3.1 System actors and their activities

Three actors on the PRISM system can be identified: Users, developers and administrators.

Table of PRISM Actors and Activities
Use Case 1
Table 1 This figure shows the PRISM actors and their actions on the system.

It is important to distinguish between the behaviour of these actors:
The groups represents diverse and contradicting demands on the system and one system cannot satify all of the demands. We therefore suggest that the system will be developed in two phases and into two different products which are configured from the same software base:

The exact differences will be outlined in the ARCDI document.
Table of PRISM actors and main activities
Actor Main activity and interaction with system Acts on
PRISM administrator Executes administration tasks Definitions of administration entities
PRISM administrator Provides definitions of all entities Model component interfaces and metadata
PRISM user Composes coupled experiments Definitions of all entities
PRISM user Visualises, queries and manages Model results
PRISM developer Developes model components Model components
Table 2

3.2 Process view

The actors realise their activities by means of a user interface and the following processes can be identified from the activities above:>

  1. Client configuration processes
  2. Configuration provider processes
  3. Execution processes

Following is a more detailed list of activities from above grouped by the processes 1,2 and 3:
Table of client configuration processes (1)
Where activity takes place What activity
User interface Configuration
Visualisation
Authentication
Archive query
Documentation
Configuration instance
Monitoring
Table 3

Table of configuration provider processes (2)
Where activity takes place What activity
Configuration server Configuration
Configuration server Configuration instances
Documentation server Documentation
Authentication server Authentication
Administration server Model build configuration
Administration server Model build configuration instance
Experiment database server Configuration instances for the experiments
Visualisation server Visualisation
Monitoring server Start/stop/state information
Table 4

Table of execution processes (3)
Where activity takes place What activity
Scheduling server Configuration instances
Execution server Coupled model (coupler + component models)
Execution server Data pre/post processing
Archiving server Archiving
Table 5

Each process is either distributed over the participating PRISM sites or local, noting that initially different model components are not to be distributed in the system .

3.3 Proposed architecture

The client configuration process (UI) is accessed through the Internet. The configuration provider processes are accessed through a central site but the services can be distributed to other sites without functional difference. The execution process is local to the model provider. This is described as directory centric, web enabled and distributed from local PRISM sites.
Variations of the architecture are shown below in three figures A, B and C. The figures show a Central PRISM site and a Local PRISM site. The local site is where the execution, scheduling and archiving server is located and the central site is one of the participating PRISM sites where the configuration provider processes listed in Table 4 are located and used by all client processes. The component boxes in the figures represent the following:


The figures show the architecture when components are moved from the central PRISM site to the local PRISM site.

Central Site Architecture,directory centric.
Arch A
Figure 2

Common Data Architecture, model provider centric.
Arch B
Figure 3

Full replication Architecture,no central repository.
Arch C
Figure 4

3.4 Data View

The data in the system consists of :


The architecture has to provide for the movement of this data. A data access policy and a security policy determines who can move what data and where it can be moved. This requires interaction with many subsystems such as archiving processes and network providers.

4.0 Software life cycle

Inter operating components in collaborative and distributed environments are deployed and maintained by multiple administrators and thus upgrades and maintenance is likely to be uncoordinated. With no central authority to plan and execute upgrades some clients will always be out of synchronisation. This applies to to the domain model,i.e. the scientific model as well as the computing model,i.e. the infrastructure components such as application servers. For the domain model a practice of allowing the different software (component model) providers all to act as administrators enabling them to start distributions of upgrades to all sites should address this problem.
For the computing model infrastructure software such as service providers the problem is further complicated by the fact that services can call each other. A service should:


PRISM software deployment model
Use Case 1
Figure 5 This figure shows the concept of a central service provider deploying new versions of components on local sites.

The figure above represents a possible architecture where services are deploying themselves through a central service provider. Authentication and security mechanisms must allow for the service provider to provide the new versions of a software for automatic deployment. Failure of supplying the semiautomatic life cycle management will lead to service failures and increased administration costs.

5.0 Security

5.1 Overview

System security is made up of the following components:

  1. Authentication - Proof of identity
  2. Authorisation- The user can only see information he is allowed to
  3. Confidentiality - Information transmitted is not read by third parties
  4. Non-disputable - The sender cannot deny that the message was sent
  5. Integrity - Data transmitted is not tampered with

Authentication can be made by:
  1. Intellectual property, something you know such as a password
  2. Physical property, something you have such as a certificate
  3. Biological property, something unique to you such as a fingerprint

The strength of security is normally determined by the factors involved,i.e. password is one factor, certificate a second. Combining the two raises the strength by magnitudes.
Authentication normally means that two computers can verify the identity of each other, not that the operators are who they claim.The combination of biometrics (fingerprints for example) together with physical property is a very strong authentication and could solve this problem. A different form of authentication is when you need to authenticate between two computers without any operator intervention. This situation arises frequently in a services oriented system when two services are dependent on each other or when new operations are instigated by a service requiring further authentication.

There are in principle the following security solutions in operation today:
  1. Password based - These solutions require an operator to supply the password.Various levels of sophistication involving rotation, time of validity and reuse of passwords can be found. One example is s/key system.
  2. Physical token based - These solutions require you to hold a certificate, smart card or similar. Various commercial and free offerings available building on X509 certificates or proprietary smart cards.
  3. Public key/Private key solutions. These solutions build on the public and private keys being able to encrypt and decrypt messages thus verifying the each other.Can be combined with password or physical tokens. Offers encryption of communication. Systems building on these are Secure Shell (SSH) and Kerberos

5.2 Requirements

There should be an access model enforced in the system to ensure proper authorisation. The level of control is to be set to be practical in terms of administration and confidentiality and to be determined by the PRISM partners.
The levels of integrity in system transmissions is to be high but the confidentiality is not so important as messages will mainly consist of configuration information which is less useful if you have no access to the software being configured.

The service to service authentication needs to be solved in a scalable manner consistent with the administration resources available.

6.0 Specific Design Choices and Constraints

6.1 Design Choices

6.2 Constraints

7.0 Investigated software packages

7.1 PALM

The PALM project aims to provide a general structure for a modular implementation of a data assimilation system. In this system, a data assimilation algorithm is split up into elementary "units" such as the observation operator, the computation of the correlation matrix of observational errors, the forecast model, etc. PALM ensures the synchronization of the units and drives the communication of the fields exchanged by the units and performs elementary algebra if required. This goal has to be achieved without a significant loss of performances if compared to a standard implementation. It is therefore necessary to design the PALM software in view of the following objectives and constraints:



Palm should be considered as a coupler designed for dynamic coupled simulation. The Palm GUI (Pre-Palm) is a sophisticated GUI required to describe a dynamic simulation and the complex relations that can occur between the different dynamic units in the task . For static runs this level of sophistication is probably not required and can incur performance penalties.

7.2 UNICORE

UNICORE is meta computing framework based on a Abstract Job Definition that can be submitted to different sites from java clients. Gateways receive the jobs and translate and schedule the definition for execution on the available hosts. Security is based on certificates. The client is downloaded once together with definitions.Requires programming skills to develop new job definition interfaces (plugins) and takes considerable effort. Mostly lacks comprehensive scheduling and monitoring mechanisms and is not used in a production environment yet.

7.3 GLOBUS

Quoting from www.globus.org:

The Globus Project is a multi-institutional research and development effort creating fundamental technologies for computational grids. Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organisations in widespread locations. A primary product of the Globus Project is the open source Globus Toolkit, which is being used in numerous large Grid deployment and application projects in the United States, Europe, and around the world.

Parts of the Globus project software relates to the tasks at hand in PRISM, such as security mechanisms (certificates),resource lookup,scheduling and Message Passing (MP) technology. Benefits are that many important institutions and commercial interests are supporting the Globus initiative.

7.4 PrepIFS/SMS

PrepIFS is an interactive meteorological application to prepare research experiments using the integrated forecasting system (IFS) at ECMWF. Both researchers at ECMWF and scientists in institutions anywhere in Europe (subject to prior permission) can access the complex computer environment at ECMWF via the Java application prepIFS or via the INTERNET using the Java-Applet PrepIFS and any standard WWW-browser. Forecast-/Analysis-Experiments can be prepared and submitted remotely.

The system uses a combination of web servers and application brokers/directories/providers to communicate with the preparation client application which contains functionality to validate the prepared experiment before it is submitted for processing.

Supervisor Monitor Scheduler (SMS) is an application that enables users to run a large number of programs which may have dependencies on one another, and in time, in a controlled environment with reasonable tolerance of both hardware and software failures, combined with good restart capabilities. SMS submits tasks and receives acknowledgements from the tasks when they change status and when they send events. SMS knows the relationships between tasks, and is able to submit dependent tasks when a given task changes its status, for example when it finishes.An associate application Xcdp allows you to monitor and change jobs in the scheduler in a GUI. The scheduling application is currently only used within local area networks.

7.5 SSH

SSH , Secure Shell is an authenticating protocol used for remote host access and is very secure. It works with public and private key authentication and encrypts transferred data. It has commands for ftp and login and may be a useful tool for administration.

7.6 Technology trends

It has been suggested (ref 1) that the rate of technology change, i.e. the rate at which capacity doubles or price halves, are around 9 ,12 and 18 months for networks,storage and computing power. If network performance doubles relative to computing power every 18 months it will become essentially free. From this point of view it is important to select an architecture that can exploit this advantage.

8.0 Discussion on the architectural choices

The best designs in order to achieve remote access, modularity and extendibility are the directory centric (A) and the model provider centric (B) architectures as outlined in section Proposed architecture .

The directory centric (A) architecture benefits from that it minimises the duplication of static or semi static resources (i.e. land and sea mask). It also allows central content to grow but local content can still be chosen if appropriate. For deployment PRISM sites do not need full web and application servers and so makes management easier. Future cooperative techniques can be used from the central site, such as client visualisation displaying on many clients.

The drawback of the directory centric (A) architecture is that the complexity increases as resources needs to be advertised and discovered by clients. Some concentration of processing power may be required to serve all clients.

The final architecture will show that combinations of local and central resources are possible as they will not compromise the system.

The administration effort required for the central site architecture is likely to be less as the duplication of data and software is not necessary and thus fewer physical copies needs to be accessed.

It is important to understand that most of the system service communication is made over the Internet, a network over which we have not full control. As a result response times will vary considerably for messages and the actions invoked through the user interface. Recoverability is limited as it is often difficult to diagnose where errors occur. If a certain level of performance is deemed essential a virtual private network should be set up with a service level agreement.

9.0 Cooperating systems

9.1 Technology

The technology that realises the proposed architecture is known as "Web Services". This includes the use of web servers, application servers, resource directories and discovery mechanisms and message services and the use of Java clients and servers. For security mechanisms certificates and Secure Socket Layers (SSL) as well as encryption can be used. Web services as a technology is service centric, allowing clients dynamic service discovery over networks such as the Internet. It is usually deployed as a three tier system involving a front end presentation layer such as a browser or java client communicating with a remote domain application (service) through a web server.The web services infrastructure will see benefits coming from application integration of diverse software made possible by standardisation and directory technologies to enable service providers to publish their services irrespective of implementation technology.

9.2 Standards

The issue of standardisation of interfaces in complex and configurable systems becomes very important in deploying distributed architectures for scalability, extendibility and future success. A key factor making the inter operability between software possible is the development of XML, the eXtensible Markup Language. This language allows for standardisation of messages between systems enabling clients and servers to inter operate over networks. The development of XML promises to standardise several other important technologies such as :


In the future this standardisation will make it possible for systems to share and exchange information in a structured way. PRISM will be one of the projects ready for the future by the use of these technologies.

9.3 PRISM implemented infrastructure software

It would be possible to implement all server components in any suitable language such as Perl or C++. Currently there is no client software that can be used with browsers that does not build on Java technology. From a system maintenance point of view using one technology, Java, is the preferred way as this simplifies the task of adhering to multiple standards. The best choice is therefore to implement all software in the infrastructure in Java but not to restrict it if there is a case for using other technologies. Java supports all the mechanisms needed for implementing web services using available standards. Other technologies are Microsofts DotNet and HTML. Today DotNet technology is very new and is also proprietary in nature. The use of a HTML client severely limits the intelligence that can be built into the client and is therefore seen as less useful.

Other projects such as Globus have published similar ideas (Open grid services architecture) building on the web services concept.There is no doubt that the web services concept will be the dominant paradigm over the next 5 years and together with standardisation of technologies, increased network speed and cooperative efforts, the systems that are ready for the interaction, will have an advantage.

10.0 Risks


Table of risks
Risk Risk Magnitude Description Impact
Security demands incompatible on some sites. Severe Multiple security solutions may be necessary. If sites cannot agree on one security solution it may introduce costly separate solution affecting the client experience.
Lack of infrastructure resources. Severe Hardware and software must be available for web services implementation. Slow or nonexistent services.
Table 6

Other risks?

11.0 Definitions,Acronyms and Abbreviations



Table of Definitions,Acronyms and Abbreviations
Keyword Description
WSDL. is an XML format for describing network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information.
XML The Extensible Markup Language is the universal format for structured documents and data on the Web. is a human-readable, machine-understandable, general syntax for describing hierarchical data, applicable to a wide range of applications Custom tags enable the definition, transmission, validation, and interpretation of data between applications and between organisations.
Simple Object Access Protocol SOAP is a lightweight protocol for exchange of information in a decentralised, distributed environment. It is an XML based protocol.
Java A programming language invented by Sun which runs on any platform and supports web services.
SSL Secure Socket Layer, a Netscape invented secure communications protocol.
SSH Secure Shell, a secure remote access protocol pioneered by BSD enabling remote logins.
Kerberos Network security system developed by MIT.
SKey One-Time Password system.
UDDI Universal Description, Discovery and Integration.Enables dynamic lookup and advertising of services.
X509 Standard for certificates used by SSL authentication and encryption.
Table 7

12.0 REFERENCES


  1. GLOBUS - www.globus.org
  2. UNICORE - www.unicore.de
  3. SOAP - http://www.w3.org/TR/SOAP
  4. XML - http://www.w3.org/XML/1999/XML-in-10-points
  5. UDDI- http://www.uddi.org/
  6. WSDL - http://www.w3.org/TR/wsdl
  7. JAVA - http://www.sun.java
  8. SSH - http://www.openssh.com/
  9. SSL - http://www.openssl.org/
  10. Kerberos - http://web.mit.edu/kerberos/www/
  11. Skey - http://www.freesoft.org/CIE/RFC/Orig/rfc1760.txt
  12. X509 - http://www.openssl.org/docs/apps/x509.html
  13. Dr Foster, P.42 Physics Today, Feb 2002
  14. R. W. Ford and G. D. Riley, Model Coupling Requirements, Flume Report, Met Office, January 2002.
  15. R. W. Ford and G. D. Riley, Model Coupling Review, Flume Report, Met Office, January 2002.