Cone Beam Tomography using MPI on Heterogeneous Workstation Clusters.

From: MPI Developers Conference and Users' Group Meeting. Proceedings, Second MPI Developer's Conference, pages 142-148. University of Notre Dame, Notre Dame, Indiana, 1-2 July 1996.

Cone Beam Tomography using MPI on Heterogeneous Workstation Clusters

David A. Reimann(1,2), Vipin Chaudhary(3), Michael J. Flynn(1), and Ishwar K. Sethi(2)
1 - Department of Radiology, Henry Ford Hospital
2 - Department of Computer Science, Wayne State University
3 - Department of Electrical and Computer Engineering, Wayne State University
Detroit, Michigan 48202

Introduction

Cone beam tomography is a computed tomography (CT) method which efficiently acquires projectional data through a single rotation. Using a 2D detector, a 3D volume can be reconstructed with isotropic voxel sizes without the mechanical translation required for conventional CT. Our current application area involves 3D CT with microscopic resolution [1].

With cone beam CT long computation and a large amount of memory is required. Reconstruction times of several hours have been required on conventional workstation. Large voxel arrays are used, typically 256³ to 512³. A pseudocoding of the cone beam algorithm is given by:

      1   Initialize each PE
      2      Read and Broadcast problem specifications
      3      Partition and allocate memory
      4      Allocate memory
      5      Precomputation of weights
      6   for each  (N  views):
      7      if (PE is ROOT)
      8         Read Projection
      9         Weight and Filter Projection
      10     Broadcast Projection
      11     Backproject Projection
      12  Gather Reconstructed Voxels on ROOT PE

The majority of the computation is in the backprojection (11). The majority of the communications is spent in sending the projection (10). We have investigated the use of parallel methods on workstation clusters to overcome both the long processing time and lessen memory requirements on individual workstations.

Because of the large amounts of data involved in this problem, it was our interest to use an explicit message passing library for the implementation to achieve maximum performance. Our earlier results showed 81.8% processor utilization on a heterogeneous cluster of 6 Sun workstations [3]. This high degree of utilization was achieved by load balancing in the backprojection (11) and in the filtering (9) steps.

In this work, we demonstrate improved performance using asynchronous communications to broadcast the projection (10) using MPICH.

Methods

The reconstruction problem is very suitable for asynchronous communication because of the repeated deterministic nature of the views in the problem. By using asynchronous communication to send projectional data while backprojecting the previous projection. the communications overhead can be eliminated on the receiving processors as long as the backprojection time is greater than or equal to the communication time.

We replaced the synchronous broadcast (MPI_Bcast) with repeated asynchronous send/receive pairs (MPI_Isend/MPI_Irecv) and with double buffering of the projection to assure correctness. For the large projection sizes, the MPI_Isend became non-asynchronous. To achieve asynchronous communication, a repeated MPI_Test during backprojection on the receiving processors was needed to complete the send.

Results

Use of asynchronous broadcast in the cone beam CT problem has improved utilization from 81.8% to 91.9% on a cluster of 6 heterogeneous workstations.

Discussion

Asynchronous communication was very useful for this problem. Further effort is needed in addressing the interplay between workstation performances and communication speeds in a heterogeneous environment.

In addition to asynchronous communication, another possibility is a multiphase broadcast. Since many networks now isolate traffic into physical subnetworks with bridges and routers a broadcast method using forwarding can be employed. In this method, the master processes broadcast to a head slave in each subnet which then forwards to each workstation on the subnet. This can be easily implemented using communicator groups. A disadvantage of this method requires apriori knowledge of the network topology. However, this topology is likely to change slowly and could be maintained in a configuration file with processor speed and availability. A multiphase broadcast could be done both synchronously and asynchronously.

Acknowledgements

Partial support of this work was provided by NIH Grant RO1-AR42101.

References

D.A. Reimann, M.J. Flynn, and S.M. Hames, "A Flexible Laboratory System for 3D X-Ray Microtomography of 3-50 mm Specimens," IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, San Jose, California, February 5-10, 1995. Abstract. Compressed postscript article.

L.A. Feldkamp, L.C. Davis, and J.W. Kress. Practical cone-beam algorithm. Journal of the Optical Society of America A, 1:612-19, June 1984.

D.A. Reimann, V. Chaudhary, M.J. Flynn, and I.K. Sethi. Parallel Implementation of Cone Beam Tomography. 1996 International Conference on Parallel Processing, 12-16 August 1996, Indian Lakes Resort, Bloomingdale, Illinois. Abstract.