Pooch | ||
Yes, Pooch launches multiple tasks per node to take advantage of multiple processing cores. The number of tasks allowed per node is equal to the total number of licensed nodes. For example, a 4-node Pooch will run 4 tasks per node, which is appropriate for 4-core Macs. Similarly, an 8-node Pooch will run 8 tasks per node, which is appropriate for 8-core Macs. Currently 16-node Pooch and greater are limited to 16 tasks per node.
Yes, version 1.7 of Pooch the first Universal Application cluster solution that operates natively on both the new Intel-based Macs as well as PowerPC Macs. This is the first and only parallel computing solution to support Universal Clustering: launching other Universal Applications in parallel on a cluster.
You may start with the materials on this web site, but where you begin
depends on your background.
The Getting Started page
lists starting points depending on your
knowledge, experience, and interest.
The links from that page may serve as your introduction.
Yes, Pooch has supported parallel computing using the distributed-memory MPI model from day one. Pooch currently supports seven MPIs. Information on using and compiling these MPIs can be found in the Cluster Software Development Kit and on the Compiling the MPIs page. We commonly recompile Fortran and C distributed-memory MPI codes, which were already portable across platforms like Cray, SGI, IBM SP, and Linux clusters, on the Mac platform without modification. Making that possible for our plasma physics simulations was a design requirement from the first day of the Mac cluster at UCLA Physics. We highly recommend that users start with MacMPI_X if possible because of its long history of stability and reliability, its wide flexibility compiling in different programming environments, and its extremely helpful visualization tools.
We provide several links to help you with that. The Parallelization page provides an overview on the issues involved with designing and writing parallel code. You can then view the Parallel Knock, Parallel Adder, and Parallel Pascal's Triangle tutorials to get an idea of how to write parallel code or convert existing single-processor code into parallel code. While you are viewing the tutorials, you can use components of the Pooch Software Development Kit or follow the instructions on the Compiling MPI page to compile and write parallel code yourself. We also link to other references in those tutorial pages.
Pooch is the shortest path to practical parallel computing, while Pooch Pro addresses the needs of those administering larger clusters for a large number of users. Pooch Pro was created because we recognize that there exists the need to manage a cluster's compute time for many users. At the same time, we saw that such features are unnecessary complications for someone who just wants to get their cluster up and running and working. This sort of bifurcation is not unusual in the industry. It is similar to the difference between, say, Final Cut Express and Final Cut Pro. Pooch Pro has features that some users of Pooch simply don't need. If you are a single user of a cluster or one of a small number of users who can share their cluster, then Pooch is for you. If you are an administrate for many users, then you should consider Pooch Pro.
Yes. Using Pooch, you can launch a parallel job taking advantage of all processors in your system using just MPI. Pooch will launch as many instances of the executable as there are processors on the included nodes and supply the appropriate information to the MPI library. This behavior, the default setting of the current Pooch, can be overridden using the Tasks per Computer menu in the Options... pop-up in the Job Window.
From a programming point of view, you can simply use the count that MPI tells
your code when your MPI code is running.
It just so happens that two MPI tasks are running on each dual-processor node.
Yes, the machines need not be identical to run Pooch or parallel codes. They would each have to have whatever minimum requirements (minimum RAM and so forth) needed to run the particular parallel app. For best overall performance, however, the parallel app would need to be able to adjust to differences in the processing performance of individual nodes. This behavior is sometimes called "load balancing". It is not always easy to implement, so not all parallel apps are written to perform the additional overhead to balance their work. The Power Fractal app, for example, does not make any adjustments for different processor speeds, so it performs best when the nodes are identical. The Fresnel Diffraction Explorer, however, does adjust its workload depending on individual node performance. Pooch supports both categories of applications, but the parallel application has the last word in most efficiently utilizing the hardware.
Yes, those can be combined. One can think of it as shared memory multiprocessing (MP), vector processing, and distributed-memory MPI as three "orthogonal axes" of parallelization. Vectorization can be accomplished directly using a compiler that supported the AltiVec macro instructions (such as Metrowerks CodeWarrior or gcc) or indirectly through a library (like Absoft's BLAS implementation). The Apple's Multiprocessing libraries are available to distribute work between processors inside a box. The way one can go about combining the three is to partition work at the highest level using distributed-memory MPI, then the work within each MPI task would be subdivided between the two processors using shared-memory MP. Going further, one would vectorize the inner loops within those routines. A demonstration of all three is present in the Power Fractal app. By default, a single instance of that Fractal code will subdivide the work between the processors using Apple's MP. (Toggled using "Turn Multiprocessing Code On/Off" under the File menu.) When it's launched via Pooch, the Fractal code uses MPI to coordinate the work between physical nodes, but for each partition of the fractal, one copy of the code will distribute the work between the processors using MP. Within those subroutines, the work to compute four pixels is mapped to operations on a four-element floating-point vector. Combining MPI and MP was accomplished earlier using the plasma physics codes at UCLA by Dr. Viktor Decyk. The plasma code was a distributed-memory MPI code that ran on the supercomputers and other parallel computing platforms. The code divides the plasma particles among the MPI tasks. To take advantage of MP, he had each MPI task subdivide its loop over its portion of the particles among the available processors. To assist with the work, he wrote MacMP, available from the AppleSeed Multiprocessing Development page. MacMP uses Apple's MP to take a subroutine and push its work onto another CPU. In many cases, it's easier to do this using MacMP rather than calling Apple's MP library directly. The only other caveat about shared-memory MP is that one has to be careful that the routine is "thread safe", meaning that the subroutines you run won't step on memory that the other threads might need as it runs. For a code example demonstrating how vector and parallel processing can be combined, see the Parallel Pascal's Triangle tutorial.
Grand Central Dispatch (GCD) is an easier way to do multithreading, easier than OpenMP or Apple's previous multithreading API or POSIX threads. The benefit is that, assuming a software writer can break their code down into separate smaller data-independent tasks, GCD can do the intelligence to get those tasks done on many cores. GCD is complimentary to Pooch. Think of it like an "orthogonal axis" of parallelization (see above). Pooch and MPI can be used to parallelize across boxes, like an outer loop, while GCD could handle tasks within a box, like the inner loops. This is just like how the Fractal app works by parallelizing the inner loops using vectorization but across boxes using MPI. You could also use just Pooch and MPI to parallelize across nodes and cores just fine and only worry about one API for parallel code.
If you know an application you'd like to see parallelized, we encourage you to suggest the idea to the developers of that app. In almost all cases, it is technically possible to parallelize a code to take advantage of cluster computing. We would consider the problem no more difficult than coding for dual-processors within a box, but the benefits can be so much more than a 2x boost. Actually, the most difficult problem is convincing the developers to parallelize their source code. We encourage you to contact your app's developers to convince them that there is a demand for such capabilities. We are willing to help with the parallelization process. You may certainly refer them to this web site for information and inspiration.
We understand how desirable such a solution would be, but, for the forseeable future, the answer is: No, it is not practical. There are two ways to explain why: 1. The high-level answer: Getting a typical code to run in parallel that has not been parallelized and make it run well in parallel is a very difficult thing to do. That has been tried in scientific computing for over a decade. After considerable work, special "autoparallelizing" compilers have taken non-parallel Fortran and C source, attempted to recognize independent work, and run them on parallel computers, all while getting the same answer as the original non-parallel code. (In principle the technology could probably be applied to PowerPC assembly, but that is probably more difficult.) Such codes did run, and they produced the right answer (very important in science, but not so easy to do), but they did not run well. A typical code achieved only 10-20% parallelism; that is, if you double the number of processors, it found the answer only 10-20% faster, which is nowhere near double performance. That low performance makes such a solution impractical. For comparison, when we hand-parallelize a code, we see 80-90% parallelism. 2. The low-level answer: Suppose you did somehow write a low-level code or kernel extension or some tool that, while the application was running, watched the raw instructions of an application and recognized pieces that could be partitioned off into independent sections. And let's suppose it could recognize which pieces of memory those instructions needed and where it output its results. And let's suppose that it could somehow figure out how to reassemble everything in memory back. And let's assume that this process of recognition, partitioning, and reassembly took zero time, what would happen? Typical loops and sections of independent code in a typical app are on the order of 10s to 1000s of instructions long. The modern PowerPC tries to push them through at a rate of once per cycle. Assuming a 1 GHz clock rate, this piece of code might take less than a microsecond to complete. And it may be pushing data in and out of memory at, in round numbers, 100 MB/sec. In 1 microsecond, that's about 100 bytes. Each PowerPC instruction is four bytes long, so the size of the instructions plus data is 4 * 1000 + 100 = 4100 bytes, which is about 4 kB. You'd have to send this little parcel of instructions and data out to another computer over Ethernet, run it, then receive it back. On Gigabit Ethernet, we're seeing over 40 MB/sec throughput, so a 4 kB message would take about 100 microseconds to send. But at that small size, latency, the additional overhead time it takes to send any message at all, would dominate, which we've seen is about 300 microseconds. So send time is 100 + 300 = 400 microseconds, then the compute time is 1 microsecond, then the time to send the output back is again dominated by latency, about 300 microseconds. So the total time to send out this parcel of instructions and data, compute it, and send it back is: 400 + 1 + 300 = about 701 microseconds. This is a piece of code that, if computed locally, would take about 1 microsecond. The point is that chopping up a code at such a low level would be dominated by communications time, effectively slowing down the overall computation. (Well, perhaps you could design a 700-node system that would compute each piece, one after the other, and after about 700 microseconds, all the pieces would be all done. But that too is impractical, because: 1. You would probably have to have hundreds of network cables from the one machine to the others to prevent network congestion; and 2. based on experience at Sandia National Laboratories, Lawrence Livermore National Laboratory, and Ohio Supercomputing Center, that many nodes nodes easily get out of step with each other, besides the fact that depending on 1000s of nodes quickly becomes unreliable (MTBF is around 8 hours) without extraordinarily clever management.) We would have to see a tremendous breakthrough in communications performance to make this approach practical. The required level would be having messages reliably get from one node to another in less than a microsecond. Those kinds of speeds are seen within computers, moving pieces of data from RAM, to the bus, to the processsor, but not between typical computers. Only the most advanced Cray T3-series parallel computers reached this regime, but their "network" costs $20,000 per node. Again, that price is impractical for most users. And neither Ethernet nor FireWire nor any other network technology seems to be getting close to reliably delivering that level (> 100x improvement in bandwidth and latency) until maybe a decade from now at the earliest. Remember that processor speeds will probably increase in the meantime, raising the bar further. So, we find we must conclude that such a low-level parallelizer, while technically not impossible, would be impractical because of how the communications time would dominate everything else. One could build such a tool, but few would actually use it because it wouldn't make the typical application faster. A fundamental and dramatic shift in computer technology would be required to change that conclusion. When we design a parallel code, we try to parallelize at a much higher level. We get the best performance by having our code compute on the order of a second at a time in between sending possibly megabytes of data at a time to work around these latency problems, although such parameters can vary widely. In any case, this approach to parallelization requires a degree of intelligence to recognize these high-level pieces, or form high-level pieces out of many low-level ones, and how to organize them, but the important thing is: it works.
Pooch is its own lock and key. You should keep track of your Pooch like you keep track of your keys. Before Pooch will accept commands from another Pooch, it must receive a passcode that matches its own. Then, all subsequent commands use a 512-bit encryption key that rotates for each message in a psuedo-random manner. Only those two Pooches can predict the next encryption and decryption keys. If a mistake in the passcode or commands is made at any time, Pooch will reject the connection. Since Pooch waits a second or two before it accepts another connection, an exhaustive search for the correct encryption keys (2^512 possibilities once per second would take over 10145 years) will be extraordinarily unlikely to succeed. The first passcode and the start of the rotating key are 512-bit psuedo-random numbers derived from the registration name of that Pooch (which is set at compile time). Therefore, only Pooches of the same registration will be able to communicate with one another. Because the registration name is unique for each Pooch customer, a copy of Pooch registered to, say, MIT, will not be able to communicate with a Pooch registered to UCSD. (For cross-registered Pooches or other customized configurations or encryption methods or implementations, please email.) Security for your cluster then becomes dependent on the security of your Pooch registered with your registration name. Your Pooch can be installed on the Macs of your cluster, and, if no additional copies of Pooch exist, no one can get in. But if you make a copy of that Pooch and bring it home to access the cluster, the security of that cluster depends on how securely you keep that extra copy of Pooch. This approach is also known as an "administrative domain". The nature of this security is analogous to having the ability to copy a key to a locked office. It is not uncommon to entrust a group of people with the keys to a shared resource, such as office equipment. The security of the equipment is shared by those who have copies of the key. These people understand the responsibility that comes with the privilege for that access. Access to Pooch can be shared in a similar way. If you are using the downloadable demonstration version of Pooch, you should be aware that the same version can be downloaded by anyone else on the web. So, if they have the IP address of your Mac, they could access your Mac, to the extent that Pooch allows, over the Internet. Although guessing your Mac's IP address is unlikely, a uniquely registered Pooch makes for much better security.
You may remove Pooch and all its components by allowing Pooch to run normally, then holding down the Option key after launching the Pooch Installer. In the Pooch Installer, a dialog should appear that allows you to select either to upgrade or uninstall Pooch. Clicking on uninstall will delete the running Pooch and the components that allow it to run at logout. The latter process may require administrative authorization. Note:
If you were using the download version of Pooch and it expired, we suggest
downloading a current version and reinstalling it to overwrite
the expired version with a new one.
The uninstall function needs to detect a running Pooch to know where its components are.
Then you can uninstall it with the above procedure.
Yes, we are using our software with the Xserve G5, Power Mac G5, Mac OS 10.3, and their predecessors. Pooch has combined Power Macs, PowerBooks, Xserves from the 604es to the G5s and OS 9 through 10.3.2. We have seen no problems using the new hardware and the new OS to run Pooch, MacMPI, and the other software on our site.
A user has said, "Pooch is Xgrid on steriods!" We couldn't agree more. With Pooch you can do what Xgrid can do, and much more. Pooch handles all major types of computing involving clusters. The kind of parallel computing that Xgrid focuses on is only of subset of what Pooch can address. The difference is that Xgrid handles problems requiring little communication and where centralized coordination is adequate. Pooch can handle that type of parallel computing plus more demanding, tightly-coupled problems clusters are good for. Pooch handles cluster jobs, including grid-style jobs, while Xgrid is suitable for grid-style jobs only. See more details at the Parallel Zoology page on the differences between these types of jobs. Clusters using Pooch are supercomputer-compatible. Pooch supports compliance with parallel computing standards. Pooch handles jobs that use MPI, the dominant programming interface used on clusters and supercomputers worldwide, while Xgrid does not. Pooch builds on the lessons already learned in scientific computing coinciding with MPI's wide adoption. Using Xgrid requires Xgrid-specific code to operate, making the code written for Xgrid non-portable. MPI code written on a Mac cluster need not be Pooch-specifc. An MPI code run on a 4000-processor supercomputer runs via Pooch on Mac clusters with only a recompile. We do that with MPI code all the time. Pooch is more flexible with application code. Xgrid uses a plug-in architecture to accept application code. With Pooch, applications can stand alone or can choose to tap Pooch for cluster resources at run time. This feature is demonstrated in the Parallel menu of the Fresnel Diffraction Explorer. In addition, while Xgrid requires Cocoa-based code, Pooch accepts all the executable types Mac OS X and Mac OS 9 have to offer: Cocoa, Carbon, Mach-O, Classic, AppleScript, and Unix script. And some of these can be compiled using a variety of Fortran and C compilers. See the Pooch SDK for details.
MPI (Message-Passing Interface) is an industry-standard programming interface, not a program, meant for all forms of parallel computing. Several different implementations of MPI exist. mpich is one of them. mpich includes a launching utility that can launch jobs called mpirun, but mpirun assumes numerous non-intuitive settings, connections, and files are all correctly configured, organized, and operating such as NFS, rsh or ssh, and machine lists. Pooch requires only the simplest settings of a modern computer, such as those needed to run a web browser. But Pooch's capabilities extend far beyond those of mpirun. Pooch serves as a queuing system, scheduler, cluster management utility, graphical front end, and scripting interface, among many other functions. These functions are far beyond mpirun's scope and would otherwise require the user to integrate a host of easily incompatible command-line utilities. Pooch provides all that functionality in one convenient, reliable package.
To say that Pooch is an incremental improvement on Beowulf is like saying the original 1984 Macintosh was an incremental improvement on the IBM PC. Like the first computers to use graphical user interfaces, Pooch resulted from a complete rethinking of how to build and operate a parallel computer. We mean one that is designed, from the start, to be convenient, reliable, flexible, easy to use, and friendly, and, therefore, powerful. With Pooch, we reinvented the cluster computer.
|