Network Monitoring, Visualization, and Control for High-Speed, Large-Scale Networks


This document describes the design of a highly scalable differentiated services (DS) testbed and its associated network monitoring, visualization, and control (NMVC) middleware. This integrated system is designed to allow end-users, applications, and administrators to monitor, visualize, and control the performance of their quality of service (QoS) across multiple autonomous networks (AN)s by integrating (1) advanced networking hardware, (2) QoS-enabled middleware, and (3) scalable visualization algorithms to support differentiated services across ANs.

This document also describes the first of a series of NMVC demonstration systems, each building on the experience gained by its predecessor. The initial demonstration system combines our real-time CORBA Audio/Video Streaming, Trading, and Naming services, Distributed Object Visualization Environment (DOVE) framework, and Distributed Object-Oriented Reliable Service (DOORS). This demonstration system will build our experience integrating these advanced components, define a rapid prototyping environment from which we can obtain valuable baseline performance and resource costs, and provide feedback on the middleware components and hooks necessary to extend the system capabilities statically and dynamically.

The key benefits of our NMVC middleware are to (1) ensure adequate end-to-end QoS to applications while (2) maintaining high levels of network and endsystem resource utilization. In addition, NMVC allows network administrators to calibrate and fine-tune network and application parameters in real-time according to observed traffic patterns. These capabilities of NMVC help to ensure that network elements, middleware network services, and applications adapt efficiently to dynamically changing conditions.

1. Introduction

1.1. Context: Next-Generation Internet Applications

Next Generation Internet (NGI) applications require advanced networks and middleware to support a wide mix of dynamically changing multimedia streams. Internet-spanning activities, such as high-bandwidth data acquisition, transparent data cache updating, and remote collaboration, are representive sources for these streams. For instance, BaBar at SLAC and CMS at CERN are high-energy physics experiments that generate petabytes of testbeam data that must be filtered and stored in large-scale databases. Groups of geographically distributed scientists collaborate to view, analyze, and interpret the data and to design new experiments. The quality, and sometimes the feasibility, of these activities are directly related to the ability of the networks and middleware to provide these applications with adequate end-to-end quality of service (QoS), such as guaranteed bandwidth, delay and jitter bounds.

1.2. Problem: Specifying, Enforcing, and Managing End-to-End Quality of Service for NGI Applications

Conventional internets that span multiple autonomous networks (ANs) cannot deliver the end-to-end QoS to NGI applications like those described above. Moreover, applications will only be able to use the NGI effectively if middleware tools, technologies, and communication frameworks can be developed and deployed to program, provision, monitor, and control applications and services that span multiple ANs. Although researchers and commercial vendors have produced network elements that offer both high performance and have some QoS features, capabilities for programming, provisioning, monitoring, and controlling end-to-end QoS across multiple ANs are only beginning to be addressed.

Supporting end-to-end QoS guarantees across multiple ANs is challenging. It is difficult even within one AN where a single administrative entity is responsible for delivering services. Managing these services requires satisfying multiple stakeholders. In particular, service providers must deliver end-to-end QoS guarantees consistently. Likewise, end-users and network administrators must verify that the QoS requests specified by applications are actually delivered.

The complexity of managing applications in a multi-AN environment is exacerbated by the following three forces:

  1. The exponential growth in demand from users for ever richer network services.

  2. The desire of providers to fully utilize network resources.

  3. The complexity of the interactions between advanced networking components, which themselves are becoming more complex.

These forces yield inherently unstable distributed applications and systems, where defects are often triggered as service providers attempt to maximize their resource utilization. Troubleshooting in this context is equivalent to debugging a large, complex distributed program with limited meters and controls, and with limited knowledge and control of advanced networking and middleware technology.

Providing these services across multiple ANs adds another dimension of difficulty that involves cooperation between different administrative domains. This cooperation includes such things as (1) the negotiation of service level agreements (SLAs), (2) resource allocation for QoS flows, (3) traffic scheduling consistent with QoS guarantees, and (4) traffic shaping and policing to enforce SLAs. Furthermore, this cooperation must be conducted between networks that may utilize different hardware, different operating systems, and different resource management policies.

1.3. Solution: Integrated Network Monitoring, Visualization, and Control (NMVC) Middleware

To alleviate the problems described above, we are developing an integrated network monitoring, visualization, and control (NMVC) middleware that can provide the uniform resource, service, and control access interfaces necessary to support NGI applications and network services. Our NMVC middleware supports the following features: Only integrated middleware, such as NMVC, can be (1) general enough to support stringent NGI application bandwidth, delay, and jitter requirements, (2) adaptable enough to react to changing user demands and technological evolution, and (3) stable enough for the development of significant amounts of advanced network and middleware services.

2. Introduction to Our Integrated Demonstrations

2.1. General Goals

Our integrated demonstrations are designed to illustrate how to assemble and deploy a network management system that can monitor, visualize, and control applications that possess stringent QoS requirements for performance, predictability, and reliability over autonomous networks (AN)s using standards-based middleware, such as CORBA. These demonstrations will involve (1) integrating and enhancing hardware and software components we have developed, (2) conducting systematic traffic measurements, and (3) measuring the overhead of the network management system and benchmarking tools themselves.

When complete, our integrated demos will illustrate a range of scenarios that reflect the multi-dimensional aspects, e.g., service provider responsibilities and end-user expectations, of QoS specification, enforcement, and management. The following are some of the general network management research issues that will be addressed by experiments conducted with our NMVC demonstration system:

By addressing these questions, we will also address many other interesting questions that will be useful in designing future demonstration systems. Moreover, the experience gained from developing the demonstration systems will require us to address many system design and optimization issues.

2.2. Initial Demonstration Goals

The first integrated demonstration will focus on the following two complementary scenarios: The primary focus of our initial demonstration will be on passive measurement of network traffic and not on full feedback control between networks, application, and end-users (which will be the subject of subsequent demonstrations). The next section presents some network management scenarios that will elaborate on the passive measurement process. From a network management research perspective, we will focus on conducting basic measurements that provide data our measurement infrastructure resource consumption and the operation of a commercial ATM switch. In addition to user traffic, therefore, we will also monitor network control traffic (e.g., signaling and resource management cells) so that we can verify the operation of vendor control software.

The following are some of the specific network management research questions that will be addressed by experiments conducted with our first NMVC demonstration system:

3. Scenarios for Initial Demonstration

Our first demo will illustrate the capabilities of our NMVC middleware in supporting TAO's Audio/Video Streaming under various failure scenarios within a single AN composed of an ATM network core. This demonstration will showcase how the features of Trading, and Naming services, the Distributed Object Visualization Environment (DOVE) framework, and the Distributed Object-Oriented Reliable Service (DOORS) can be applied to the network management scenario described below. The demonstration system will allow: 1) end-users to select MPEG movies from various multimedia servers; 2) network administrators to visualize the system in order to troubleshoot performance bottlenecks and network/application faults; and 3) system developers to emulate simple failure modes. Fault tolerance of key system components also will be supported.

Below, we describe two scenarios that guide our initial demonstration. One scenario focuses on network monitoring, visualization, and control from an end-user application perspective, whereas the other scenario focuses on a more traditional OSS perspective.

3.1. End-user Application Scenario: Control and Management of A/V Streams

3.2. OSS Monitoring Scenario: Control and Management of A/V Streams within a Single AN

4. Structure of the Integrated Demonstration

Figure 1 illustrates the various components in our initial integrated demonstration:

Figure 1: Integrated Demo Architecture

The demo contains a number of components that are described below.

4.1. High-speed Networking Components

From a hardware perspective, the demonstration system will consist of a single FORE switch with associated control processor (CP), client stations, servers, and APIC-based network probes, as shown in Figure 2.

Figure 2: Switch Configuration

Each network probe is built from an ATM Port Interconnect Controller (APIC) Chip and a Smart PortCard (SPC) CPU-memory module. This probe forms the fundamental traffic monitoring component of our NMVC project. The APIC has two full duplex ATM ports that can be easily inserted in a link and can thus, efficiently snoop traffic as the ATM cells move from input to the output port. Future network configurations include a larger number of FORE switches and a network of our own gigabit IP routers (GIPR). The GIPR will consist of a Washington University Gigabit Switch (WUGS) with IP Processing Elements (IPPE) at each input port. The IPPEs will use the APIC-SPC nodes (described earlier as the network probe) to support IP routing, monitoring, and specialized network control functions.

The SPC will house several software components. It will run a NetBSD-based operating system that has been specially modified for use in our Active Networking Node project. The node is active in the sense that its functionality can be dynamically altered by code modules loaded over the network itself. This OS can perform high-speed IP packet classification in the kernel and can be easily modified to perform high-speed event filtering of IP packets. Furthermore, event filters can be dynamically installed by using the dynamic module installation mechanisms of the ANN. An embedded version of TAO will provide a minimum footprint ORB that supports uniform access to network controls, communication services and other useful services such as event channels. Events are filtered from the incoming packet stream and pushed to or pulled from the agent, as shown in Figure 3.

Figure 3: Agent Architecture

This agent will collect the events and send them to the NOC using the TAO event channel services. In the future, various forms of event channeling and event fusion can be incorporated into this structure.

4.2. A/V Streaming Components

Traditional distributed object computing middleware, such as CORBA, is well-suited for objects that communicate using request/response interactions between clients and servers. DOC middleware traditionally has not been as well suited, however, for multimedia streaming applications, where continuous streams of data are exchanged between suppliers and consumers. The CORBA Audio/Video Streaming Service alleviates much of this problem via a standard architecture that supports multimedia applications efficiently using CORBA. In addition, the A/V Streaming service helps implement interoperable streams developed using different ORB implementations.

4.3. Overview of CORBA A/V Streaming

Figure 4 illustrates the key components in the CORBA Audio/Video Streaming service architecture.

Figure 4: CORBA A/V Streaming Components

A stream is a continuous media transfer between 2 or more virtual Multimedia Devices (MMDevice)s. An MMDevice abstracts a Multimedia device, which could be logical (such as a file) or physical (such as a video player). The Stream Control (StreamCtrl) object is used to bind two multimedia devices to establish a stream. MMDevices create a endpoint of communication (StreamEndpoint) and Virtual Device (VDev) for each connection. The StreamEndpoint captures the network aspects of the stream, such as the TCP/IP communication endpoint. The VDev represents the properties of the device for the current stream, such as the format of the stream data, e.g., MPEG-1 or MJPEG. Additional information on the CORBA A/V Streaming service architecture is available in CORBA Audio/Video Streaming specification and in a paper that appeared at HICSS '99.

4.4. Server-side A/V Streaming Components

The components shown in Figure 5 are used on the server and described below.

Figure 5: Server Components

4.5. Client-side A/V Streaming Components

The components shown in Figure 6 are used on the client and described below.

Figure 6: Client Components

4.6. DOVE Components

The Distributed Object Visualization Environment (DOVE) is designed to monitor, visualize, and control application components distributed throughout a network. DOVE supports several scenarios, including (1) allowing service providers to manage the QoS provided by network components and (2) allowing end-users and applications the capability to apply dynamic feedback about their realized QoS to help control applications and system resources more effectively.

Figure 7 shows the key components in DOVE.

Figure 7: DOVE Components

Each component is described below:

4.7. DOORS Components

Distributed Object-Oriented Reliable Service (DOORS) provides high availability to CORBA applications through replication of CORBA servers. It maintains this availability even in the presence of failures through failure detection and restart of individual replicas in a server group. Functionality provided by DOORS: Addition of DOORS to the integrated demo of A/V Streaming and DOVE can provide the following benefits:

5. Dynamic Scenarios in the Integrated Demo

The components in the integrated demo that we described above will interact as described in the following scenarios.

5.1. A/V Server Properties Export Scenario

Each A/V server will export an offer containing server properties and a reference to its MMDevice_Exporter to a Trading Service instance. The server will advertise three types of properties: device configuration -- audio/video formats; server performance -- load, disk usage, bandwidth available; and movie selection -- the contents of the movie directory. Properties that change frequently, such as server performance and movie availability, will be registered with the Trading Service as dynamic properties. The server will implement dynamic property callback interfaces to handle their evaluation.

Figure 8: Java Interface Agent

5.2. A/V Server Discovery Scenario

The client will discover and bind to a server whose properties best match its device configuration, that has a movie the user finds worth viewing, and that offers the best possible performance. A GUI will prompt the user for the client's configuration, and allow the user to choose from movies available on the servers that match the client's configuration. After the user selects a movie, the GUI will allow the user to select the server that has the best performance from among those that offer the movie. Hence, the best-matched server selection occurs in the following three phases:

  1. The client first does a query asking for the movie listings of all servers that deliver video and audio in a format understandable by the client, and can accept more connections.This is then presented to the user as in Figure 8.
  2. The client's user would then select an interesting movie from a list compiled from the results of the first query. In a second query, the client requests the performance characteristics of those servers that both match the client's configuration and are showing the selected movie.This would happen when the user presses the "Performance Graph" field on the movie properties in Figure 8.

  3. The trader will store the performance characteristics --- load, CPU usage, disk access, network traffic, context switches, etc... --- as callback interfaces to the server (dynamic properties). The client can attach these interfaces to DOVE as suppliers to be periodically polled, so the client can visually display charts of each server's performance, updated periodically by callbacks to those interfaces. Then based on the charts, the user selects the server with the most suitable performance, and the client will open an audio/video connection to that server using the MMDevice object reference it retrieved from the trader.

5.3. Movie Selection Scenario

The A/V server exports a sequence of movie information structures to the trading service, each containing a movie name, associated file name, frame rate, duration, and frame size, for example. Also contained in this structure will be a URL pointer to a web page on the movie. The java interface agent, written in Java and communicating with the trading service using JavaIDL, compiles the list of movie names for each server into a list GUI. When the user selects an item in the list, the Java client generates a table containing all the technical information about the movie --- duration, frame rate, frame size --- as advertised by each compliant server that offers the movie, and displays the table in the main frame.

Once the user has decided on a server and movie, the Java Interface Agent passes the server IOR and the movie name to the A/V client controller process.

5.4. DOORS Fault Tolerance Scenario

In this demo, the fault tolerance of the DOVE agent is illustrated by having the DOORS components interact with the various TAO services and DOVE agent shown in Figure 9.

Figure 9: DOORS Components with DOVE agents

These interactions are described below:

DOORS does not provide automatic state transfer between replicas. The application that is to be made fault tolerant will need to be modified to support state transfer between replicas. This will prevent loss of state when a failure occurs. The DOVE agent will be modified to provide support for state transfer. Since the DOVE agent carries a large state, it will be implemented as a Hot Backup, i.e., all state affecting requests will be forwarded to the backup which will process the request to update its state but will cut off any replies.

The integrated demo will show how in the presence of failure of the Primary DOVE agent, the Backup DOVE agent takes over without any loss of state.

5.5. Operation and Failure Scenario

  1. The DOVE agent is registered with the ReplicaManager as described above.

  2. Two copies of DOVE agents are running in the network, the copy on host M1 being the primary and the copy on host M2 being the backup

  3. All monitoring applications interested in the multimedia servers will get the dynamic callbacks of the multimedia servers from the Trading service and register them with the primary DOVE agent as suppliers of events. Since the backup DOVE agent is implemented as a Hot backup, all of these suppliers will automatically be registered with the backup DOVE agent

  4. Similarly, registrations of DOVE visualization components as consumers of events will be propagated to both the primary and the backup

  5. Now, consider the scenario where the DOVE agent on host M1 crashes. The DOORS WatchDog component on M1 will detect the failure and report the failure to the ReplicaManager. The ReplicaManager asks the DOVE agent on M2 to take over as the primary. It also starts up another copy of the DOVE Agent on one of the machines as the new backup.

  6. The new primary holds the references to the multimedia server suppliers and the DOVE visualization agents (consumers) and can continue operation.

    When a supplier tries to push a new event to the DOVE agent on M1 (the old primary) and it fails, it will transparently connect to the new primary DOVE Agent on M2.

6. Scenarios for Subsequent NMVC Demonstrations

Most of the discussion above focused on the first demonstration scenario with a single AN composed of an ATM network core. Future demonstrations will support multiple ANs. The scenario below is based on an A/V Streaming application that crosses multiple ANs. The description sketches how QoS connections are established and the management scenario illustrates how monitoring is used to determine the root cause of service failure.

6.1. Monitoring Scenario: A/V Stream Across Multiple ANs

We now consider the same problem of low video quality but where the A/V stream now spans multiple ANs (e.g., St. Louis to San Diego). In order for the connection to be established, the ANs along the QoS path must cooperate to insure that sufficient resources be reserved. In the DiffServ model, service level agreements (SLAs) for aggregated traffic are negotiated between the ANs and enforced at the routers at the AN boundaries (i.e., the edge (ingress and egress) routers) (see figure below). For this example, we assume that our A/V stream has been assigned to one of several controlled load aggregated channels and that the market price for this connection accounts for some appropriate qualitative delay distribution.

Figure 10: Management of Autonomous Networks

Troubleshooting in this environment is made more difficult because we may not have the capabilities for fully exploring the cause of our video quality failure. This inability is caused by traffic aggregation and multiple administrative domains. The same techniques as in Scenario 1 can be applied to determine if the fault is within our own AN, but these techniques will not be available to us in the transit networks along the QoS path. If the problem is not confined to our individual flow, hopefully, there will be sufficient evidence of network problems that alarms will initiate corrective actions in the transit networks.

However, application level tools analogous to ping , traceroute and pathchar need to be developed which will allow the user (and service managers) to investigate QoS problems that span multiple ANs. In our example, the video server should allow users to determine if their connection is alive and the bandwidth the server is attempting to deliver. The user should be allowed to initiate a burst of probe packets from the server that can record the transit times from the server to the receiver.

Additional Information

If you have any questions about this integrated demo please contact:

Yamuna Krishnamurthy
Nagarajan Surendran
Chris Gill

For questions about DOORS please contact:

Aniruddha Gokhale
Shalini Yajnik

Back to Douglas C. Schmidt's home page.

Last modified 00:49:50 CST 09 March 1999

Back to Douglas C. Schmidt's home page.

Last modified 11:34:22 CDT 28 September 2006