Friday, September 20, 2013

Potamus update

The GAE cost profiling graphs have gotten a facelift, now using flot instead of Google visualizations. I rapidly hit the limit of the GViz capabilities (one notable shortcoming is the lack of support for sparsely-populated data). Most of the controls are now completely client-side, which makes it a lot easier to tweak the graph to get just the information you'd like.
Flot generally provides more CSS-level control over styling, and a nice plugin system to allow for mixing features.

Sunday, June 30, 2013

App Engine real-time cost profiling is available on github.  Some assembly required.

Wednesday, April 10, 2013

Cost profiling on Google App Engine

I've recently been measuring costs for various operations that are currently being performed on Google App Engine.  Google provides some cost estimates on the app engine dashboard, and you can get historical daily totals, but it's generally not straightforward to answer the question "How much does this operation cost (or is going to cost if I ramp up)?".

The google-provided appstats is fine for profiling individual requests, but sometimes you need a much more comprehensive view.

With a Chrome extension to monitor the app engine dashboard numbers, and a small app engine app to collect data, I've managed to collect some interesting intra-day profile data, as well as provide a means for fairly accurate estimates of discrete operations.

Group view (for multiple app IDs).  The artifact on the left is due to two days' worth of missing data.  The lower graph has an obvious daily cron job, while the upper has much more distributed activity:

 Zoomed view (detail for a single app ID).  On this graph, you can see some annotations have been added; the data collector provides an API for applications to post events that can be overlaid on the cost data, making it easy to pick start and end points and calculating the cost for the selected span:


This project is now available on github.  The Chrome extension is based on the OSE (Offline Statistics Estimator) which scrapes usage data and applies customizable usage rates from the GAE dashboard pages.

Wednesday, January 30, 2013

Enable cProfile in Google App Engine

If it's not readily apparent to you how to enable CPU profiling on Google App Engine (it certainly wasn't to me, aside from a few hand waves at cProfile), this code snippet should get you up and running so you can focus on finding the data you need rather than the implied interfaces you have to mimic.  It uses the standard WSGI middleware hook to wrap an incoming request in a cProfile call, formatting and dumping the resulting stats to the log when the request returns:


def cprofile_wsgi_middleware(app):
    """
    Call this middleware hook to enable cProfile on each request.  Statistics are dumped to
    the log at the end of the request.
    :param app: WSGI app object
    :return: WSGI middleware wrapper
    """
    def _cprofile_wsgi_wrapper(environ, start_response):
        import cProfile, cStringIO, pstats
        profile = cProfile.Profile()
        try:
            return profile.runcall(app, environ, start_response)
        finally:
            stream = cStringIO.StringIO()
            stats = pstats.Stats(profile, stream=stream)
            stats.strip_dirs().sort_stats('cumulative', 'time', 'calls').print_stats(25)
            logging.info('cProfile data:\n%s', stream.getvalue())
    return _cprofile_wsgi_wrapper

def webapp_add_wsgi_middleware(app):
    return cprofile_wsgi_middleware(app)

Saturday, September 22, 2012

Thread termination/exit handlers

I'd been hunting for a while for a good solution to automatically call cleanup operations on exit from threads I didn't own.  Windows (XP onward) has a pretty straightforward solution if you're working with a DLL, but for pthreads on most other platforms, the solution is not as obvious.

Some folks have been using JNA to allow Java code to be invoked as callbacks from various streaming (video, sound) libraries.  If these callbacks come from threads instantiated from native code, the JVM has to (at least temporarily) map the thread into Java space for the duration of the callback.  If the callbacks are frequent, and always come in on the same native thread, we don't want to incur the mapping overhead on every invocation.  The solution, then is to avoid detaching the thread when the callback finishes.

Now we have a new problem.  When the native thread actually terminates, the JVM has no idea that the thread went away because it's only really got a placeholder Thread object, and hasn't hooked up all the plumbing it normally does to detect that the thread has gone away and clean up/GC the mapped Thread object.  Thus the need for a thread exit handler.  In some cases, you may not care about the minimal object overhead, for instance if you've just got a few native threads.  However, with a lot of threads coming and going (we can't force folks to thread-pool).

Windows


On Windows, if you have a DllMain function defined, you'll get notices when threads and processes attach/detach your DLL's code.  This works out nicely, we can make the VM detach the current thread when we get that message:

// Store thread-local information required to detach the thread
// (in my case, only a JVM reference was required)
// TlsSetValue is only set when we recognize that we need the 
// extra cleanup
static DWORD dwTlsIndex;
BOOL WINAPI DllMain(HINSTANCE hDLL, 
                    DWORD fdwReason, 
                    LPVOID lpvReserved) {
  switch (fdwReason) {
  case DLL_PROCESS_ATTACH:
    dwTlsIndex = TlsAlloc();
    if (dwTlsIndex == TLS_OUT_OF_INDEXES) {
      return FALSE;
    }
    break;
  case DLL_PROCESS_DETACH:
    TlsFree(dwTlsIndex);
    break;
  case DLL_THREAD_ATTACH:
    break;
  case DLL_THREAD_DETACH: {
    extern void detach_thread();
    detach_thread(TlsGetValue(dwTlsIndex));
    break;
  }
  default:
    break;
  }
  return TRUE;
}
 Easy enough.  Note that TlsSetValue is called elsewhere only when the callback decides it doesn't want to detach immediately (rather than every time a native thread attaches).  Callbacks normally detach on exit if they weren't attached to begin with.

POSIX Threads (pthreads)

The pthreads library is a different beast.  Searches for "thread termination handler" don't turn up much, except for the stack-based pthread_cancel_push/pop, which seems to do the right thing, but have to arrive in pairs.  It's actually in the pthreads implementation of thread-local storage that we find a way to attach a termination handler to a given thread.

When you define a given thread-local variable in pthreads (a "key" in pthreads lingo), you can provide a destructor function to be used to clean up the storage...on thread exit.  It was a little hard to find since it's not a thread termination handler per se, but rather a mechanism to clean up thread local storage.  I'd overlooked it several times because I wasn't looking for thread local storage solutions.


// Basic plumbing to create a unique key identifying
// specific thread-local storage
static pthread_key_t key;
static void make_key() {
  extern void detach_thread();
  pthread_key_create(&key, detach_thread);
}

    ...
    // This code gets called whenever we identify that
    // a thread needs detach-on-exit behavior
    static pthread_once_t key_once = PTHREAD_ONCE_INIT;
    pthread_once(&key_once, make_key);
    if (!jvm || pthread_getspecific(key) == NULL) {
      pthread_setspecific(key, jvm);
    }
    ...

Now detach_thread will be called (with the value of the thread-local storage as its single argument) when the thread exits.

And voilĂ , you have a POSIX thread termination handler.

Saturday, October 10, 2009

WebStart unit tests

I had a few bits of JNA functionality that were only active when in a web start environment, which made it a bit tricky to add tests for them that could be run at the same time as all my other tests. The functionality also showed up as a big splot of missing code coverage under clover, so I decided to tinker just a bit to see if I could somehow simulate a web start environment for the unit tests.

I managed to come up with a JUnit-based test fixture which ensures its test methods are all run in a web start environment. This works well under Windows and OSX; unfortunately *nix variations use NetX which needs some extra hand-holding when first run.

Finding the right hooks involved standard hacking to find whether the most configurability was offered by invoking the javaws executable or adding the javaws classes to my classpath and hooking directly into the Java. I ended up doing a Runtime.exec to ensure the running environment wouldn't interfere with that of Web Start.

The main obstacle to be overcome was signed code. You can get web start to run unsigned code by tweaking the local policy file to allow the code we're testing, but when you include any native code via the <nativelib> tag, JNLP requires the <all-permission> tag, which seems to re-trigger the requirement for signed code. Self-signing the code is no big deal, but that triggers web-starts authorization dialogs to allow the unknown CA to be used. Fortunately, the somewhat obscure deployment properties file can be used to temporarily allow the self-signed certs and bypass the dialogs. The tests can be run entirely without user input (except in the case of *nix, where NetX is not sufficiently configurable -- you have to dismiss the dialog on the first test run).

Unfortunately the exit codes for javaws don't correspond to the codes passed to System.exit by Java code, so I had to have the test fixture communicate to the running javaws instance via socket. Failures are transmitted by the fixture in such a way that you can't tell that the test is running in a separate process.

I encapsulated the deployment config, JNLP file creation, javaws launch, and test case execution into a single class, WebStartTest, which runs a few self tests and can be extended to add whatever other tests you need to run under web start. The JNLP is partly hard-coded to include files set up sepecifically for JNA testing, but should be trivial to change to accommodate a different project.

There are probably other ways to test WebStart code, but this fit nicely with the project's existing tests, allowing its WebStart features to be tested by test methods identical to those for the rest of the project.

Full source is here. This has been tested under Sun and IBM JDKs, on Windows, OSX, Solaris/sparc, and GNU/Linux.

Thursday, September 11, 2008

JNA: increasing performance with large Structures

If you're using very large structures and using them often, here's a tip that can boost performance by several orders of magnitude. Note that you should follow this tip *only* if you really need the performance boost; otherwise you may wind up obfuscating your code.

By default, when JNA makes a native call it will copy the full contents of a Java Structure to native memory prior to the call and read it all back after the call. If your Structure is very large, this can result in significant overhead reflecting all the fields of the Structure. The reflection dwarfs the actual native communication time.

If you're only reading or writing a single field, it's much faster (although somewhat less elegant) to use the readField(String) and writeField(String) methods to access the data, while disabling the normal read and write. Depending on the size of your structure, you may see two orders of magnitude or more improvement in the native function call time.

Here's an example of performing the same operation two different ways:


class Big extends Structure {
public int toNative;
public int fromNative;
// plus lots more fields
// the more, the bigger the difference in performance

}

class FastBig extends Big {
public void read() { }
public void write() { }
}

Big big = new Big();
big.toNative = 42;
lib.callMyNativeFunction(big);
System.out.println("Got " + big.fromNative);

Big fast = new FastBig();
fast.toNative = 42;
fast.writeField("toNative");
lib.callMyNativeFunction(fast);
System.out.println("Got " + fast.readField("fromNative"));


If you wrap a loop and time these, you'll see what kind of difference it makes. On a test structure with 25 "int" fields, the fast version reduces time by a factor of 10.

Trivia: some other "struct" implementations (e.g. Javolution) use objects for all fields and require an explicit "write" or "set" on each. This reduces data transfer and/or reflection overhead, at the expense of simplicity of assignment and initialization.

JNA:
s.field = 1;
call(s.field);

Javolution:
s.field.set(1);
call(s.field.get());