The Performance API (PAPI) project specifies a standard
application programming interface (API) for accessing hardware
performance counters available on most modern microprocessors.
PAPI provides portability across different platforms and uses
the same routines with similar argument lists to control and
access the counters But to be successful, the PAPI library
needs a little help from the operating system to gain access
to the information in the counters.
Presently, we have the latest version of PAPI (v3.5) running on
the Cluster. Recompiling the test harness and the dll proved to
be relatively straightforward; the majority of the difficulty
came in sorting through the assembly level portions of the kernel
driver that provides access to the counters. The AMD64 environment
provides no inline assembler. The WinPMC kernel driver relied on
inline assembly to access the hardware counters. Also, there was
some inconsistency in the availability of compiler intrinsics to
provide access to the assembly instructions needed to access to
the PMC registers. This revolved around implementations of the
cpuid instruction and the readpmc instruction.
The C test programs provided with a normal PAPI distribution were
built and tested as appropriate for the Windows environment.
Most converted and ran cleanly in the Windows 2003 Server environment;
some had features that were no longer applicable. The Fortran test and
example programs were not converted, since at the time of this work,
a suitable Fortran compiler replacement for the older Compaq Fortran
compiler had not been identified.
Remaining work revolves around two areas. The first involves completing the
test and example programming to bring it up to par with what’s available in
other PAPI distributions. The second is significantly more involved and
requires some explanation.
PAPI is primarily intended as a ‘first-person’ mechanism for attributing
hardware counter events to portions of program code. In order to do that,
the programmer (or a higher level tool) inserts calls into the user code to start,
stop and read the hardware counters at specific points. This fundamentally assumes
that the counts occurring between the start call and the stop (or read) call can
all be attributed to the user’s code. Such a situation can only be approximated
in a multitasking system and can be wildly inaccurate in a busy system. The only
way to guarantee that counts can be properly attributed is for the operating system’s
context switch routine to save and restore the state of the performance monitoring
registers. This is how PAPI behaves in Linux systems. On Windows, the WinPMC driver
currently simply controls the state of the counters and hopes for the best. This
works acceptably well on laptop or single user systems; not so well on clusters.
We would like to work with Microsoft engineers to determine the feasibility of
modifying the Compute Cluster kernel software to support functionality similar
to the open source perfmon2 performance interface
that is being incorporated into the Linux kernel and rapidly adopted as the
standard mechanism for accessing hardware performance counters.