During the last couple of days, I was troubleshooting a critical server crash encountered by an airline, on behalf of a software consulting firm which requested my help. The symptoms were that WebSphere 6.0.2 was crashing with an Out of Memory (OOM) condition, sometimes reaching 100% (or a very high) CPU utilization level in addition. Although you can profile an application with excellent profilers such as JProbe, it becomes much more difficult when you are unable to re-create the issues in any other environment but they keep happening on the live production instances, to which you only have limited access.
Luckily the IBM JVM's generate a Portable Heap Dump (PHD) on an OOM, and has an array of extremely helpful tools to analyze information from heap/thread dumps and and other information offline. Thus it is still possible to detect memory leaks in applications, even when they only occur in live production systems. Sometimes the cause would be heap fragmentation, where even if a considerable percentage of memory is still available, a contiguous chunk of the size required cannot be freed.
The above image shows the IBM Heap Analyzer, detecting a memory leak by the WebSphere DRS / Session Replication, where 923MB of heap has been consumed by 14,011 HashMap#Entry objects held onto by the WebSphere Data Replication Service, used for HTTP session replication.
It is also interesting to look for the use of Xalan 2.6.0 by any application code, as I have at earlier instances found memory leaks that are typically more difficult to trace - but which occurs primarily due to a well known bug in 2.6.0 of Xalan.