Network Monitoring, Visualization, and Control for High-Speed, Large-Scale Networks
Abstract
This document describes the design of a highly scalable
differentiated services (DS) testbed and its associated network
monitoring, visualization, and control (NMVC) middleware. This
integrated system is designed to allow end-users, applications, and
administrators to monitor, visualize, and control the performance of
their quality of service (QoS) across multiple autonomous networks
(AN)s by integrating (1) advanced networking hardware, (2) QoS-enabled
middleware, and (3) scalable visualization algorithms to support
differentiated services across ANs.
This document also describes the first of a series of NMVC
demonstration systems, each building on the experience gained by its
predecessor. The initial demonstration system combines our real-time CORBA Audio/Video
Streaming, Trading,
and Naming
services, Distributed Object
Visualization Environment (DOVE) framework, and Distributed
Object-Oriented Reliable Service (DOORS). This demonstration
system will build our experience integrating these advanced
components, define a rapid prototyping environment from which we can
obtain valuable baseline performance and resource costs, and provide
feedback on the middleware components and hooks necessary to extend
the system capabilities statically and dynamically.
The key benefits of our NMVC middleware are to (1) ensure adequate
end-to-end QoS to applications while (2) maintaining high levels of
network and endsystem resource utilization. In addition, NMVC allows
network administrators to calibrate and fine-tune network and
application parameters in real-time according to observed traffic
patterns. These capabilities of NMVC help to ensure that network
elements, middleware network services, and applications adapt
efficiently to dynamically changing conditions.
1. Introduction
1.1. Context: Next-Generation Internet Applications
Next Generation Internet (NGI) applications require advanced networks
and middleware to support a wide mix of dynamically changing
multimedia streams. Internet-spanning activities, such as
high-bandwidth data acquisition, transparent data cache updating, and
remote collaboration, are representive sources for these streams. For
instance, BaBar at
SLAC and CMS at CERN are
high-energy physics experiments that generate petabytes of testbeam
data that must be filtered and stored in large-scale databases.
Groups of geographically distributed scientists collaborate to view,
analyze, and interpret the data and to design new experiments. The
quality, and sometimes the feasibility, of these activities are
directly related to the ability of the networks and middleware to
provide these applications with adequate end-to-end quality of service
(QoS), such as guaranteed bandwidth, delay and jitter bounds.
1.2. Problem: Specifying, Enforcing, and Managing End-to-End Quality of Service for NGI Applications
Conventional internets that span multiple autonomous networks (ANs)
cannot deliver the end-to-end QoS to NGI applications like those
described above. Moreover, applications will only be able to use the
NGI effectively if middleware tools, technologies, and communication
frameworks can be developed and deployed to program, provision,
monitor, and control applications and services that span multiple ANs.
Although researchers and commercial vendors have produced network
elements that offer both high performance and have some QoS features,
capabilities for programming, provisioning, monitoring, and
controlling end-to-end QoS across multiple ANs are only beginning to
be addressed.
Supporting end-to-end QoS guarantees across multiple ANs is
challenging. It is difficult even within one AN where a single
administrative entity is responsible for delivering services.
Managing these services requires satisfying multiple stakeholders. In
particular, service providers must deliver end-to-end QoS
guarantees consistently. Likewise, end-users and network
administrators must verify that the QoS requests specified by
applications are actually delivered.
The complexity of managing applications in a multi-AN environment is
exacerbated by the following three forces:
- The exponential growth in demand from users for ever richer
network services.
- The desire of providers to fully utilize network resources.
- The complexity of the interactions between advanced networking
components, which themselves are becoming more complex.
These forces yield inherently unstable distributed applications and
systems, where defects are often triggered as service providers
attempt to maximize their resource utilization. Troubleshooting in
this context is equivalent to debugging a large, complex distributed
program with limited meters and controls, and with limited knowledge
and control of advanced networking and middleware technology.
Providing these services across multiple ANs adds another dimension of
difficulty that involves cooperation between different administrative
domains. This cooperation includes such things as (1) the negotiation
of service level agreements (SLAs), (2) resource allocation for QoS
flows, (3) traffic scheduling consistent with QoS guarantees, and (4)
traffic shaping and policing to enforce SLAs. Furthermore, this
cooperation must be conducted between networks that may utilize
different hardware, different operating systems, and different
resource management policies.
1.3. Solution: Integrated Network Monitoring, Visualization, and Control (NMVC) Middleware
To alleviate the problems described above, we are developing an
integrated network monitoring, visualization, and control (NMVC)
middleware that can provide the uniform resource, service, and control
access interfaces necessary to support NGI applications and network
services. Our NMVC middleware supports the following features:
- Ubiquitous, standardized Object Request
Brokers (ORB)s that help increase the productivity in delivering
and managing demanding services and complex network components.
Using an ORB approach will give uniform access to ANs which themselves
may be composed of heterogeneous components and to network resources
that span multiple heterogeneous ANs.
- Higher-level management services that build upon the ORB
middleware to adapt to rapidly evolving and demanding network services
and problems. These management services support a rich set of
monitoring, visualization, and control features to meet end-user
demands. In addition, they can be (re)configured flexibly and
efficiently to respond to the rapidly evolving complexity of advanced
network components and interactions.
Only integrated middleware, such as NMVC, can be (1) general enough to
support stringent NGI application bandwidth, delay, and jitter
requirements, (2) adaptable enough to react to changing user demands
and technological evolution, and (3) stable enough for the development
of significant amounts of advanced network and middleware
services.
2. Introduction to Our Integrated Demonstrations
2.1. General Goals
Our integrated demonstrations are designed to illustrate how to
assemble and deploy a network management system that can monitor,
visualize, and control applications that possess stringent QoS
requirements for performance, predictability, and reliability over
autonomous networks (AN)s using standards-based middleware, such as CORBA. These
demonstrations will involve (1) integrating and enhancing hardware and
software components we have developed, (2) conducting systematic
traffic measurements, and (3) measuring the overhead of the network
management system and benchmarking tools themselves.
When complete, our integrated demos will illustrate a range of
scenarios that reflect the multi-dimensional aspects, e.g.,
service provider responsibilities and end-user expectations, of QoS
specification, enforcement, and management. The following are some of
the general network management research issues that will be
addressed by experiments conducted with our NMVC demonstration
system:
- Efficiency and functionality of network probes (hardware and
software)
- Utility of passive monitoring in troubleshooting network
problems
- Effectiveness of ORB-based middleware communication in real-time
network management
By addressing these questions, we will also address many other
interesting questions that will be useful in designing future
demonstration systems. Moreover, the experience gained from
developing the demonstration systems will require us to address many
system design and optimization issues.
2.2. Initial Demonstration Goals
The first integrated demonstration will focus on the following two
complementary scenarios:
- Operations Support Systems (OSS) monitoring, visualization,
and control -- In this scenario, we will monitor and visualize
system performance in support of service provider OSS groups who
manage various network and endsystem components. For this scenario,
our NMVC tools will help them run their business better,
e.g., by proactively ensuring that their networks are
providing the necessary QoS to end-user customers.
- End-user application monitoring, visualization, and
control -- In this scenario, we will provide monitoring and
visualization information to end-users. For this scenario, our NMVC
tools will help end-users and applications select appropriate services
and fine-tune their QoS specifications.
The primary focus of our initial demonstration will be on passive
measurement of network traffic and not on full feedback control
between networks, application, and end-users (which will be the
subject of subsequent demonstrations). The next section presents some
network management scenarios that will elaborate on the passive
measurement process. From a network management research perspective,
we will focus on conducting basic measurements that provide data our
measurement infrastructure resource consumption and the operation of a
commercial ATM switch. In addition to user traffic, therefore, we
will also monitor network control traffic (e.g., signaling
and resource management cells) so that we can verify the operation of
vendor control software.
The following are some of the specific network management research
questions that will be addressed by experiments conducted with our
first NMVC demonstration system:
- What is the measurement capacity of network probes and
what are the factors that contribute to those limits?
- What useful measures can and can not be obtained through passive
monitoring and what factors limit our ability to gather data?
- Can we map FORE's signaling protocol for the demonstration
applications?
- What are the costs of TAO's ORB-based communication
infrastructure, in terms of run-time overhead, priority
inversion and non-determinism, and fault tolerance features?
3. Scenarios for Initial Demonstration
Our first demo will illustrate the capabilities of our NMVC middleware
in supporting
TAO's Audio/Video
Streaming under various failure scenarios within a single AN
composed of an ATM network core.
This demonstration will showcase how the features of
Trading,
and Naming
services, the Distributed Object
Visualization Environment (DOVE) framework, and the Distributed
Object-Oriented Reliable Service (DOORS) can be applied to the
network management scenario described below.
The demonstration system will allow: 1) end-users to
select MPEG movies from various multimedia servers; 2) network
administrators to visualize the system in order to troubleshoot
performance bottlenecks and network/application faults; and 3) system
developers to emulate simple failure modes. Fault
tolerance of key system components also will be supported.
Below, we describe two scenarios that guide our initial demonstration.
One scenario focuses on network monitoring, visualization, and control
from an end-user application perspective, whereas the other scenario
focuses on a more traditional OSS perspective.
3.1. End-user Application Scenario: Control and Management of A/V Streams
- Problem
User U would like to run an audio/Visual (A/V) application in
which U's host will receive video and audio from a remote video server.
Because the user is a general video consumer and not a networking
expert, he would like to setup this service without dealing with
the technical details.
The user has a menu of service levels which he can select on a trial
basis, from experience he has selected a QoS service level of
Medium-Quality .
How will the system deliver this service to U?
- Solution
The following transparent activities must occur in order to deliver
the video/audio to user U:
- Translate the user-level QoS description to a network-level
QoS specification.
Two channels that can deliver at least 1.5 Mbps are
needed. Delay and jitter guarantees are not requested
because the semantics of the user-level QoS description
is that large buffers at the receiver will be used to
make the stream elastic, and late frames may be dropped
occasionally without harm.
- Determine if the QoS specification can be supported.
The application program and server reserve sufficient
buffers, and network admission control
determines whether there are sufficient resources
(bandwidth along the QoS path) to support this new
connection, and the application.
- Establish the QoS route from the server to the endsystem.
Network signaling reserves sufficient resources
for the QoS guarantees. These reservation activities
include updating routing tables (ATM VC Tables), QoS
tables at the NOC (Network Operations Center), and updating
or installing traffic enforcers along the QoS route.
- Discussion
The end result of User U selecting the A/V service is the establishment
of contract between the user and the service provider.
The service provider will insure that the service (at the selected QoS
level) is provided to the user at the agreed upon price.
The user has an expectation of the service quality.
Problems arise when the components involved in delivering the service
do not deliver their part of the guarantee or do not cooperate properly.
As noted earlier, the delivery mechanisms are still immature, and the
complexity of their interactions lead to unforseen QoS failures.
This situation is why network management is described as debugging
a huge, complex distributed program. Network monitoring forms the
basis for troubleshooting in this environment. Both proactive and reactive
network monitoring is needed to quickly identify problems and their
root causes and to increase management staff productivity. But because
of the complexity of the problem, a divide-and-conquer strategy
is often employed to troubleshoot network service problems.
3.2. OSS Monitoring Scenario: Control and Management of A/V Streams within a Single AN
- Problem
User U has recently started to complain about poor video quality and
is blaming it on the network.
- Solution
We take a divide-and-conquer approach to troubleshooting the problem.
First, we determine if the problem is truly a network problem.
Second, if the problem is with the network, we determine the width of
the problem; i.e., how many potential users will be affected.
For example, congestion will affect all traffic crossing a congestion
point. Third, if the problem is isolated to only user U, we try to
pinpoint the source of the problem along the QoS route.
Some of the information that we need to assess network health in
general include:
- Input/output link status (up/down?)
- Input/output switch port status (up/dow?n)
- Output queue size and change (stable/growing?)
- VC table entries (correct connection?)
- Switching fabric packet loss
- Output port packet loss due to buffer overflow
- Long term throughput
- Short term throughput
Some of this information may be available from switch registers.
Others such as throughput must be derived by sampling switch
registers (e.g., VC connection, packet loss) or monitoring the
traffic itself (e.g., link status, throughput).
- Discussion
You might think that there is indeed a network problem since you would
expect that the user would have a feel for the effective bandwidth of
the connection. But this may not be the case, and only periodic
monitoring would give you hard numbers. But typically, permanently
installed probes collect aggregate (not per flow) statistics.
Furthermore, we can not over emphasize that the complexity of of the
advanced hardware and software (network and endsystems) and their
interactions can lead to service delivery problems.
Embedded probes record primitive data (e.g., counts) on flow
aggregates. Permanent network probes log additional statistics for
flow aggregates on an ongoing basis to provide a historical
perspective on network usage. The data will be dispersed by agents
residing on the probes and switch controllers using TAO to data
consumers such as the NOC and the traffic history archiver. At the
NOC, a DOVE agent collects incoming data and displays it as strip
charts showing short term (e.g., 15 minute) bandwidth usage
as a whole and usage for the top 5 categories (e.g., special
VCIs or protocols). Note that if details at the IP layer or higher
are to be shown, there must be probes that extract this information.
Embedded probes collect very primitive data (e.g., cell
counts).
If the video problem is confined to user U, network probes will need
to be dynamically configured to collect statistics on the video and
audio connections (e.g., short term throughput). Ideally,
these probes will be installed in a stepwise manner along the QoS path
starting at the stream head inside the application and terminating at
the stream tail in the receiver application. This will require the
installation of filters in the probes that will capture cells from the
application. Several DOVE displays might be shown. One display might
show the short term bandwidth usage at 15 second intervals along with
the expected bandwidth usage (1.5 Mbps for each connection). This is
a traditional display applied to a specific connection. One display
might show the resources supporting the connection along with the
original resource allocation. Any disagreement indicates a
reconfiguration error.
4. Structure of the Integrated Demonstration
Figure 1 illustrates the various components in our initial integrated
demonstration:

Figure 1: Integrated Demo Architecture
The demo contains a number of components that are described below.
4.1. High-speed Networking Components
From a hardware perspective, the demonstration system will consist of
a single FORE switch with associated control processor (CP), client
stations, servers, and APIC-based network probes, as shown in Figure
2.
Figure 2: Switch Configuration
Each network probe is built from an ATM Port Interconnect Controller
(APIC) Chip and a Smart PortCard (SPC) CPU-memory module. This probe
forms the fundamental traffic monitoring component of our NMVC
project. The APIC has two full duplex ATM ports that can be easily
inserted in a link and can thus, efficiently snoop traffic as the ATM
cells move from input to the output port.
Future network configurations include a larger number of FORE switches
and a network of our own gigabit IP routers (GIPR). The GIPR will
consist of a Washington
University Gigabit Switch (WUGS) with IP Processing Elements
(IPPE) at each input port. The IPPEs will use the APIC-SPC nodes
(described earlier as the network probe) to support IP routing,
monitoring, and specialized network control functions.
The SPC will house several software components. It will run a
NetBSD-based operating system that has been specially modified for use
in our Active Networking Node project. The node is active in the
sense that its functionality can be dynamically altered by code
modules loaded over the network itself. This OS can perform
high-speed IP packet classification in the kernel and can be easily
modified to perform high-speed event filtering of IP packets.
Furthermore, event filters can be dynamically installed by using the
dynamic module installation mechanisms of the ANN.
An embedded version of TAO will provide a minimum footprint ORB that
supports uniform access to network controls, communication services
and other useful services such as event channels. Events are filtered
from the incoming packet stream and pushed to or pulled from the
agent, as shown in Figure 3.
Figure 3: Agent Architecture
This agent will collect the events and send them to the NOC using the
TAO event channel services. In the future, various forms of event
channeling and event fusion can be incorporated into this
structure.
4.2. A/V Streaming Components
Traditional distributed object computing middleware, such as CORBA, is
well-suited for objects that communicate using request/response
interactions between clients and servers. DOC middleware
traditionally has not been as well suited, however, for multimedia
streaming applications, where continuous streams of data are exchanged
between suppliers and consumers. The CORBA Audio/Video Streaming
Service alleviates much of this problem via a standard
architecture that supports multimedia applications efficiently using
CORBA. In addition, the A/V Streaming service helps implement
interoperable streams developed using different ORB
implementations.
4.3. Overview of CORBA A/V Streaming
Figure 4 illustrates the key components in the CORBA Audio/Video
Streaming service architecture.

Figure 4: CORBA A/V Streaming Components
A stream is a continuous media transfer between 2 or more virtual
Multimedia Devices (MMDevice)s. An MMDevice abstracts a Multimedia
device, which could be logical (such as a file) or physical (such as a
video player). The Stream Control (StreamCtrl) object is used to bind
two multimedia devices to establish a stream. MMDevices create a
endpoint of communication (StreamEndpoint) and Virtual Device (VDev)
for each connection. The StreamEndpoint captures the network aspects
of the stream, such as the TCP/IP communication endpoint. The VDev
represents the properties of the device for the current stream, such
as the format of the stream data, e.g., MPEG-1 or MJPEG.
Additional information on the CORBA A/V Streaming service architecture
is available in CORBA Audio/Video Streaming
specification and in a paper that appeared at HICSS '99.
4.4. Server-side A/V Streaming Components
The components shown in Figure 5 are used on the server and described
below.

Figure 5: Server Components
-
The MMDevice_Exporter is the component that
exports the properties of the Audio/Video
Server like the number of connecions, server name and maximum
number of connections, to the Trading Service and contains the
Audio and Video
MMDevice references.
-
The Audio_MMDevice and Video_MMdevice are endpoint
factories for creating new audio and video stream connections
respectively with different strategies. Each endpoint consists of a pair of objects:(1) a
Virtual Device (
VDev), which encapsulates the device-specific
parameters of the connection and (2) the Stream Endpoint (StreamEndpoint),
which encapsulates the transport-specific parameters of the
connection.
-
The TAO_Property_Exporter presents a uniform
interface for exporting static and dynamic properties. There are
actually three types of properties exportable by
Property_Exporter: static, where the value of the
property is stored twice: once in the offer and once in the MMDevice's
CosProperty::PropertySet; semi-dynamic, where the value of the
property is stored in the CosProperty::Property
set, and a dynamic property in the
service offer retrieves the value from the property set (this is
useful for MMDevice dynamic properties); and dynamic, where the value
of the property isn't stored anywhere, but lazily-evaluated by a
dynamic property in the offer (this is useful for non-MMDevice
properties).
-
TAO_Video_Repository is a a
TAO_Dynamic_Property, in addition to being
a TAO_Exportable. In define_properties
it exports itself as a dynamic property with the TAO_Property_Exporter
its passed, and when called back, parses a manifest of available
movies, obtains information about them by parsing the headers of each
of the media files, and returns this information of a sequence of
structs, each struct containing the attributes of an individual
movie. The IDL code for this TAO_VR::Movie
structure resides in VideoRepository.idl.
-
TAO_Machine_Properties is also a dynamic property.
For each of the ten or so performance properties like CPU
usage, disk usage, it exports a dynamic property with itself as
the evaluation interface reference. When its evalDP method is called,
it obtains the value for the statistic whose name is in the prop_name
parameter. The statistics are gathered by Sun RPC from the
rstatd
daemon and cached. The cache expires every so often and is then
refreshed by another rpc call, obviating the need for an rpc call for
every call to evalDP.
4.5. Client-side A/V Streaming Components
The components shown in Figure 6 are used on the client and described
below.

Figure 6: Client Components
-
The Java Interface Agent portion of the demo is actually a Java VM
embedded in a TAO Trader Client. The main program bootstraps
the vm, initializes the ORB and bootstraps to the Trading Service,
performs the initial query of the Trading Service, and invokes the
main method on the Java Server_Discovery class. Embedding the VM in
this way is possible because of the JNI Invocation interface. On the
C++ side of things, the Trader_Client class queries the Trading
Service and caches the results in a two-tier hashtable: the first tier
maps the server name to who the offer belongs to a second-tier
hashtable, which maps each property name in the offer to its
value.
-
The MPEG player obtains the reference to the supplier from the
Trader Client and receives A/V packets sent by it. It is
responsible for decoding the streams and playing them i9n a viewer.
4.6. DOVE Components
The Distributed
Object Visualization Environment (DOVE) is designed to monitor,
visualize, and control application components distributed throughout a
network. DOVE supports several scenarios, including (1) allowing
service providers to manage the QoS provided by network components and
(2) allowing end-users and applications the capability to apply
dynamic feedback about their realized QoS to help control applications
and system resources more effectively.
Figure 7 shows the key components in DOVE.

Figure 7: DOVE Components
Each component is described below:
-
The DOVE Agent portion of the demo is implemented
using TAO's CORBA Event Service. As described below under
Dynamics of the Integrated Demo, each server provides dynamic
callbacks used to obtain various performance characteristics from
the server's Management Information Base (MIB). Clients, servers,
or remote monitoring applications can obtain these dynamic callbacks
from the Trading Service and register them with the DOVE agent as
performance event suppliers.
-
The DOVE-enabled Browser and DOVE Applet
are small graphics programs that run respectively either as a
separate executable or as an applet in a general purpose browser.
These graphics programs can be run either from an independent
monitoring location or on client applications in order to visualize
performance of various parts of the entire system. To provide
proactive response to degradations in QoS, service providers can
use a DOVE-enabled Browser to independently monitor the behavior
of network components. In addtion, service providers can export this
information to end-users, allowing them to validate their
promised levels of service. End-users can then use DOVE-enabled
Browsers to monitor the end-to-end QoS seen by their applications.
-
The applet and browser graphics programs register various
DOVE Visualization Components with the DOVE agent.
Each visualization component is registered as a consumer of specific
visualization events, so that it delivers an individually tailored
view of the overall visualization environment to its browser or applet.
4.7. DOORS Components
Distributed Object-Oriented Reliable Service (DOORS)
provides high availability to CORBA applications through replication
of CORBA servers. It maintains this availability even in the presence
of failures through failure detection and restart of individual
replicas in a server group. Functionality provided by DOORS: -
Replication of servers to provide higher availability in the presence
of failures
-
Failure detection of the replicas
-
Restarting failed replicas
Addition of DOORS to the integrated demo of A/V Streaming and DOVE can
provide the following benefits:
-
Fault Tolerance to the DOVE component through the replication and failure
detection of the DOVE agent
-
Fault Tolerance to the basic services like Trading service through replication
and failure detection of Trading service server
-
Fault Tolerance of the multimedia server through replication and failure
detection
5. Dynamic Scenarios in the Integrated Demo
The components in the integrated demo that we described above will
interact as described in the following scenarios.
5.1. A/V Server Properties Export Scenario
Each A/V server will export an offer containing server properties and a
reference to its MMDevice_Exporter to a Trading Service
instance. The server will advertise three types of properties: device
configuration -- audio/video formats; server performance -- load, disk
usage, bandwidth available; and movie selection -- the contents of the
movie directory. Properties that change frequently, such as server
performance and movie availability, will be registered with the
Trading Service as dynamic properties. The server will implement
dynamic property callback interfaces to handle their evaluation.

Figure 8: Java Interface Agent
5.2. A/V Server Discovery Scenario
The client will discover and bind to a server whose properties best
match its device configuration, that has a movie the user finds worth
viewing, and that offers the best possible performance. A GUI will
prompt the user for the client's configuration, and allow the user to
choose from movies available on the servers that match the client's
configuration. After the user selects a movie, the GUI will allow the
user to select the server that has the best performance from among
those that offer the movie. Hence, the best-matched server
selection occurs in the following three phases:
- The client first does a query asking for the movie listings
of all servers that deliver video and audio in a format
understandable by the client, and can accept more
connections.This is then presented to the user as in Figure 8.
-
The client's user would then select an interesting movie
from a list compiled from the results of the first query.
In a second query, the client requests the performance
characteristics of those servers that both match the client's
configuration and are showing the selected movie.This would
happen when the user presses the "Performance Graph" field on the
movie properties in Figure 8.
-
The trader will store the performance characteristics ---
load, CPU usage, disk access, network traffic, context
switches, etc... --- as callback interfaces to the server
(dynamic properties). The client can attach these interfaces
to DOVE as
suppliers to be periodically polled, so the client can
visually display charts of each server's performance, updated
periodically by callbacks to those interfaces. Then based on
the charts, the user selects the server with the most suitable
performance, and the client will open an audio/video
connection to that server using the
MMDevice
object reference it retrieved from the trader.
5.3. Movie Selection Scenario
The A/V server exports a sequence of movie information
structures to the trading service, each containing a movie
name, associated file name, frame rate, duration, and frame
size, for example. Also contained in this structure will be a
URL pointer to a web page on the movie. The java interface agent,
written in Java and communicating with the trading service
using JavaIDL, compiles the list of movie names for each
server into a list GUI. When the user selects an item in the
list, the Java client generates a table containing all the technical
information about the movie --- duration, frame rate, frame
size --- as advertised by each compliant server that offers
the movie, and displays the table in the main frame.
Once the user has decided on a server and movie, the Java
Interface Agent passes the server IOR and the movie name to
the A/V client controller process.
5.4. DOORS Fault Tolerance Scenario
In this demo, the fault tolerance of the DOVE agent is illustrated
by having the DOORS components interact with the various TAO services
and DOVE agent shown in Figure 9.
Figure 9: DOORS Components with DOVE agents
These interactions are described below:
-
The ReplicaManager is a CORBA server responsible for creating
and maintaining the replicas of the DOVE agents. The DOVE agent registers
with the ReplicaManager using the register() method call of the
ReplicaManager. As parameters to the register() call it specifies
the degree of replication (2 for this case) and the replication style (Warm).
In response to the call, the ReplicaManager interacts with the WatchDog
object to create two replicas of the DOVE agent.
-
The WatchDog component sits on each host in the network and
is responsible for starting up and monitoring replicas on its local host.
ReplicaManager makes use of the WatchDog component to create and monitor
replicas of the DOVE agent.
-
The SuperWatchDog component is responsible for failure detection
at the host level. It does not directly interact with the DOVE agent.
DOORS does not provide automatic state transfer between replicas. The
application that is to be made fault tolerant will need to be modified
to support state transfer between replicas. This will prevent loss of
state when a failure occurs. The DOVE agent will be modified to
provide support for state transfer. Since the DOVE agent carries a
large state, it will be implemented as a Hot Backup, i.e.,
all state affecting requests will be forwarded to the backup which
will process the request to update its state but will cut off any
replies. The integrated demo will show how in the presence of
failure of the Primary DOVE agent, the Backup DOVE agent takes over
without any loss of state.
5.5. Operation and Failure Scenario
-
The DOVE agent is registered with the ReplicaManager as described above.
-
Two copies of DOVE agents are running in the network, the copy on host
M1 being the primary and the copy on host M2 being the backup
-
All monitoring applications interested in the multimedia servers
will get the dynamic callbacks of the multimedia servers from the Trading
service and register them with the primary DOVE agent as suppliers of events.
Since the backup DOVE agent is implemented as a Hot backup, all of these
suppliers will automatically be registered with the backup DOVE agent
-
Similarly, registrations of DOVE visualization components as consumers
of events will be propagated to both the primary and the backup
- Now, consider the scenario where the DOVE agent on host M1
crashes. The DOORS WatchDog component on M1 will detect the failure
and report the failure to the ReplicaManager. The ReplicaManager asks
the DOVE agent on M2 to take over as the primary. It also starts up
another copy of the DOVE Agent on one of the machines as the new
backup.
-
The new primary holds the references to the multimedia server suppliers
and the DOVE visualization agents (consumers) and can continue operation.
When a supplier tries to push a new event to the DOVE agent on M1
(the old primary) and it fails, it will transparently connect to the
new primary DOVE Agent on M2.
6. Scenarios for Subsequent NMVC Demonstrations
Most of the discussion above focused on the first demonstration
scenario with a single AN composed of an ATM network core. Future
demonstrations will support multiple ANs. The scenario below is based
on an A/V Streaming application that crosses multiple ANs. The
description sketches how QoS connections are established and the
management scenario illustrates how monitoring is used to determine
the root cause of service failure.
6.1. Monitoring Scenario: A/V Stream Across Multiple ANs
We now consider the same problem of low video quality but where the
A/V stream now spans multiple ANs (e.g., St. Louis to San
Diego). In order for the connection to be established, the ANs along
the QoS path must cooperate to insure that sufficient resources be
reserved. In the DiffServ model, service level agreements (SLAs) for
aggregated traffic are negotiated between the ANs and enforced at the
routers at the AN boundaries (i.e., the edge (ingress and egress)
routers) (see figure below). For this example, we assume that our A/V
stream has been assigned to one of several controlled load aggregated
channels and that the market price for this connection accounts for
some appropriate qualitative delay distribution.

Figure 10: Management of Autonomous Networks
Troubleshooting in this environment is made more difficult because we
may not have the capabilities for fully exploring the cause of our
video quality failure. This inability is caused by traffic
aggregation and multiple administrative domains. The same techniques
as in Scenario 1 can be applied to determine if the fault is within
our own AN, but these techniques will not be available to us in the
transit networks along the QoS path. If the problem is not confined
to our individual flow, hopefully, there will be sufficient evidence
of network problems that alarms will initiate corrective actions in
the transit networks.
However, application level tools analogous to ping ,
traceroute and pathchar need to be developed which
will allow the user (and service managers) to investigate QoS problems
that span multiple ANs. In our example, the video server should allow
users to determine if their connection is alive and the bandwidth the
server is attempting to deliver. The user should be allowed to
initiate a burst of probe packets from the server that can record the
transit times from the server to the receiver.
Additional Information
If you have any questions about this integrated demo please contact:
Yamuna Krishnamurthy
Nagarajan Surendran
Chris Gill
For questions about DOORS please contact:
Aniruddha Gokhale
Shalini Yajnik
Back to Douglas
C. Schmidt's home page.
Last modified 00:49:50 CST 09 March 1999