Monday, March 28, 2011

Have you missed this presentation?

Have you missed 'Deadlocks: the beginning of the end" at EclipseCon last week?

Here's your chance of having your own personal copy in high definition!

26 comments:

JamesB said...

Interesting presentation... It highlights a few gotchas in the Eclipse concurrency API, but there are a couple things that I don't think quite see...

Job API + Scheduling rules: As scheduling rules must contain all rules that will be acquired, it's not possible to generate deadlock using scheduling rules (they fail-fast). So the comment on number of following locks should have no consequence, as they're a separate class of locks. Agreed this property make them harder to use than pure locks...

There was a proposal to nest scheduling rules:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=206664

The ThreadPool provided by the jobs API is surely a good thing? It prevents the proliferation of threads per consumer and amortizes the cost of creating and destroying threads.

I don't get how the thread on which Jobs are scheduled can cause issues? They're always scheduled from the root of one of the threads from the threadpool, and there aren't other locks held here => so it should have no runtime effect on likeliness to deadlock?

The main source of deadlocks I see is to do with acquiring locks and scheduling rules on the UI thread. From my POV any blocking operation whether it's lock acquisition, I/O, whatever, should happen off the UI in a Job, and UI updates should be scheduled using an #asyncExec.

As you point out #syncExec is broken. Especially acquiring a scheduling rule in a #syncExec:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=312179

Also Eclipse ILocks and SchedulingRules have the nice property that they're deadlock free. If two ILocks should happen to deadlock the platform lets one of them proceed so the user shouldn't end up with two blocked threads...


I think deadlocks are a small part of the far greater problem of incorrect locking. It's very easy, when reducing a lock scope, to create a situation where data structures accesses are thread-safe, but the the code contains race conditions.
Or worse: it's very easy to incorrectly lock data structures. The compare editor is a great example of something that regularly breaks for me on a multi-core machine -- it doesn't deadlock it just randomly breaks the editor when I perform actions too quickly.

Google have a project that would partially address the above. ThreadSanitizer checks that data structures are accessed using consistent set of locks:
http://code.google.com/p/java-thread-sanitizer/

Finally I'm not sure what the proposal in bug 340819 buys that the existing UI thread + Worker Jobs doesn't give you? From the POV of someone extending Job, I don't see any difference in the flow / locks held. Surely the point of the Jobs API abstracts the runnable away from the underlying thread? Furthermore if you're serializing CommonJobs to one thread, competing unrelated CommonJobs never run concurrently, so you've reduced the concurrency and throughput of the application to that single thread. The result is then that unrelated tasks can't easily exploit system parallelism? Moreover if an API relied on work happening on CommonJob for correctness, then that API will be inherently single threaded + will compete with all other users of CommonJob.

Thanks for the neat tools and exploratory work you're doing on this Serge! :)

Serge Beauchamp said...

Hi James, thank you for your insightful comment, please see my comments below:

"Job API + Scheduling rules: As scheduling rules must contain all rules that will be acquired, it's not possible to generate deadlock using scheduling rules (they fail-fast). So the comment on number of following locks should have no consequence, as they're a separate class of locks. Agreed this property make them harder to use than pure locks...
"

It's possible to have a deadlock between a scheduling rule, and a simple synchronized block, for example. So Scheduling rules are real ultimately real locks that can cause concurrency problems, and their follower count matters.

"The ThreadPool provided by the jobs API is surely a good thing?"

I don't agree that having a thread pool transparently for Jobs is a good thing. Thread pools are good, but only when there's a valid use-case for it.

Carelessly using a thread pool like the Jobs API does, I believe, more harm than good, because it obfuscate the threading model, promotes proliferation of threads and makes lock acquisition order unpredictable between threads.

The only case for the thread pool is that it somewhat reduce the cost of spawning a new Job. The negligible cycles spent on spawning threads in Eclipse is not the reason why we end up with 9% CPU usage: the reason is because concurrency is unmanageable, and automatic thread pools is one of the things that increase the cost of managing concurrency.

Serge Beauchamp said...

"I don't get how the thread on which Jobs are scheduled can cause issues? They're always scheduled from the root of one of the threads from the threadpool, and there aren't other locks held here => so it should have no runtime effect on likeliness to deadlock?"

If you have an inconsistent lock acquisition between 2 locks in two runnables, if they are scheduled on the same thread because the 2 jobs are using a thread pool, it can't cause a deadlock.

But when the load of the system increases, and then suddenly those 2 runnables are executed on 2 different threads, they can deadlocks.

This makes the threading model (what code acquire what locks on what threads) unpredictable, very difficult to debug, and unstable.

The thread pool decides to spawn a new thread on whether there's an available idle thread or not, which depends on a race condition.

Systematically using thread pools amplifies the unpredictability and randomness of deadlocks.

Serge Beauchamp said...

"The main source of deadlocks I see is to do with acquiring locks and scheduling rules on the UI thread. From my POV any blocking operation whether it's lock acquisition, I/O, whatever, should happen off the UI in a Job, and UI updates should be scheduled using an #asyncExec."

I believe this is due to the proportion of code being actually executed on the main thread, and the fact that deadlocks happening on background threads might be un-reported by the customers (the UI might not freeze).

Also, the UIJob/Display.asyncExec problem makes deadlocks more likely, so the main thread is 'crippled' by design from the currency APIs.

But deferring to other threads runnables doesn't solve the deadlocks, inconsistent lock acquisition will still be there, and will simply deadlock worker threads instead.

Serge Beauchamp said...

"Also Eclipse ILocks and SchedulingRules have the nice property that they're deadlock free. If two ILocks should happen to deadlock the platform lets one of them proceed so the user shouldn't end up with two blocked threads..."

Well, they aren't deadlock free when used with other synchronization primitives.

Scheduling rules are deadlock free (although that's not strickly true I believe) because they are so restrictive (only one held per Job) that they can't be systematically used through the APIs.

Then, ILocks are deadlock free, because they spontaneously transform a deadlock into a memory corruption!

Worse, it doesn't display anything to the user, but silently log the issue in the workspace .log file! How is the user be aware that the locks designed to prevent data corruption were released by the Job Manager, and his data has been corrupted in unpredictable way? This can't be a solution to handling deadlock issues.

So I don't see the current Jobs API as a satisfactory API to avoid deadlocks.

Serge Beauchamp said...

Thanks for the link about the ThreadSanitizer.

There's some tool from IBM (MATRAT - http://www.alphaworks.ibm.com/tech/mtrat)that does the same thing for java.

Serge Beauchamp said...

"Finally I'm not sure what the proposal in bug 340819 buys that the existing UI thread + Worker Jobs doesn't give you? From the POV of someone extending Job, I don't see any difference in the flow / locks held."

If you have 10 runnables, and you use 10 Jobs to schedule them, depending on random race conditions, they can end up all being executed on 10 different threads.

If you have 10 runnables, and you schedule them all using the CommonJob, they will never execute in parallel, they will be executed one after each other as a FIFO list on a single thread, the 'Common thread'.

This is a significant difference in threading model.

This mean, for example, that if you have different runnables that executes with inconsistent lock order that are difficult to resolve, scheduling them on the CommonJob will avoid any possible deadlock, while keeping the main thread responsive.

Does that make sense?

JamesB said...

> It's possible to have a deadlock between a scheduling rule, and a simple synchronized block, for example.

Oh sure. The platform concurrency API is only deadlock safe when used exclusively. Introducing any other locks will cause issues.

> Then, ILocks are deadlock free, because they spontaneously transform a deadlock into a memory corruption!

Hmm.. It depends on the definition of corruption. The API ensure that there aren't two threads *concurrently* accessing the same datastructure (and there is a memory barrier between each running). The Locks API only yields control to one thread when deadlock is detected while both threads are waiting to run.

Clearly results are unpredictable, and there is a race, however there isn't concurrent access.
Also it now logs an error to the error log so users can easily report the bug:
http://help.eclipse.org/helios/index.jsp?topic=/org.eclipse.platform.doc.isv/reference/api/org/eclipse/core/runtime/jobs/ILock.html

For this reason I think ILock should be preferred above basic synchronization in the platform.

JamesB said...

> Thanks for the link about the ThreadSanitizer.

ThreadSanitizer also exists for Java:
http://code.google.com/p/java-thread-sanitizer/
and it open source.


> If you have an inconsistent lock acquisition between 2 locks in two runnables, if they are scheduled on the same thread because the 2 jobs are using a thread pool, it can't cause a deadlock.

But this is a real bug, right? Either you acquire locks consistently, or your suffer deadlocks, or you're limited to being single threaded. You can't both allow inconsistent lock ordering and provide concurrency.

...
> This is a significant difference in threading model.

However, I think the result is you quickly enter a trap where all code has to be run on a single thread to be threadsafe... As soon as anyone wants to acquire more than one lock outside the CommonJob, they're at risk from deadlock against the bad jobs written to run on the CommonJob thread.

The problem I see is that this appears to boil down to making the application single threaded (or double threaded if you count trivial work being done on the UI). As soon as clients want to do work in parallel *and* interact with locks used by CommonJob users everything breaks.

This is basically where eclipse is today where lots of people seem to want to run stuff on the UI thread. As soon as you want to go parallel deadlocks are back.

Multi-thread is hard because the state space of the system explodes exponentially. Having a consistent API like Jobs means that developers are forced to consider lock ordering, and this is important. If instead we constrain part of the system, it may make deadlocks less likely (because there's less concurrency) but they'll still exist. Moreover when we want to exploit parallelism it's harder because parallel code will now deadlock with badly written code running on CommonJob.

Serge Beauchamp said...

"Hmm.. It depends on the definition of corruption. The API ensure that there aren't two threads *concurrently* accessing the same datastructure (and there is a memory barrier between each running). The Locks API only yields control to one thread when deadlock is detected while both threads are waiting to run."

Unless the programmer acquired locks unnecessarily, and prevents concurrent data modification where no modification is performed, then releasing a lock by the Jobs API will corrupt the data structure.

Although, granted that locks are are mechanism available to the developer to prevent concurrent access to some data structure, and the Jobs API destroys this mechanism when deadlocks are susceptible to occur, it will likely corrupt the data structure, i.e. allow modification that would make their internal consistency invalid.

"For this reason I think ILock should be preferred above basic synchronization in the platform."

I guess it depends on how valuable is the data, and I can certainly imagine that in some cases, for example internet VOIP, having some level of data corruption isn't a fatal error.

Because ILock are inherently unreliable, and generally speaking, data structure integrity is paramount, I strongly recommend not using ILocks.

Serge Beauchamp said...

"But this is a real bug, right? Either you acquire locks consistently, or your suffer deadlocks, or you're limited to being single threaded. You can't both allow inconsistent lock ordering and provide concurrency."

Of course, but then, we'd never have deadlocks.

The point is that there's several ways to resolve a deadlock, and one way is to reduce concurrency by either acquiring more locks (or coarser locks), or reduce the usage of threads.

The CommonJob pattern is a way to reduce the usage of threads for jobs that should not block the main thread, but still do not offer significant performance or efficiency benefit to justify being scheduled on an independent thread, and hence causing a deadlock risk.

Serge Beauchamp said...

"Multi-thread is hard because the state space of the system explodes exponentially. Having a consistent API like Jobs means that developers are forced to consider lock ordering, and this is important. If instead we constrain part of the system, it may make deadlocks less likely (because there's less concurrency) but they'll still exist. Moreover when we want to exploit parallelism it's harder because parallel code will now deadlock with badly written code running on CommonJob."

Well, having badly written code running on independent threads will be worse.

Also, I wouldn't consider that code that acquire lock in inconsistent order "bad code", because that's something very difficult to avoid especially when working with large system and interdependent components.

My perspective, is that some lock ordering are difficult to resolve, for example, because some locks are in 3rd party components that it isn't possible to change.

Having a CommonJob simplifies the threading model, and allows different components to share a common thread to eliminate problems with lock ordering, and make it easier to parallelize tasks that are worth running on independent threads.

Serge Beauchamp said...

"ThreadSanitizer also exists for Java:
http://code.google.com/p/java-thread-sanitizer/
and it open source."

Thanks, I'll take a look at it!

JamesB said...

> Because ILock are inherently unreliable,

How are they unreliable? If there are bugs in it they should be fixed... They seem to work well for me in current eclipses. Note that these are pretty widely used throughout eclipse and I've fixed a few bugs in ILock and the DeadlockDetector so am pretty confident that it's doing the right thing ;)

> and generally speaking, data structure integrity is paramount, I strongly recommend not using ILocks.

Deadlock occurs at synchronization points and the code doing the modification holds both locks taking part in the deadlock.

So the state of the system is like:

t1: Holds l1 -> does some work with l2 -> attempts l2
t2: Holds l2 -> does some work with l2 -> attempts l1
*Deadlock*

State++:
t1: Holds l1 -> given l2 -> does some work holding both locks
t2: waiting for l1 : waiting for l2.

I'd bet, in the majority of cases, that the threads are acquiring a new lock to do some work on data structures protected by that new lock. So releasing t2's lock at a synchronization point is a better result than just hanging forever.

Note that in real code I've rarely see an ILock deadlock reported. However they are a good backstop for the 1/1,000,000 case where something happens in an unexpected way. It allows the user to continue working and the developer to fix the bug.

It's like a runtime version of your lock tracer where the work of testing lock interactions is distributed across the user-base ;)

Serge Beauchamp said...

"The problem I see is that this appears to boil down to making the application single threaded (or double threaded if you count trivial work being done on the UI). As soon as clients want to do work in parallel *and* interact with locks used by CommonJob users everything breaks.

This is basically where eclipse is today where lots of people seem to want to run stuff on the UI thread. As soon as you want to go parallel deadlocks are back.
"

I agree with you.

But right now, as soon as runnables that are currently running off the main thread are put in independent job, they cause deadlocks.

So it's better to be in a situation where the main thread remains responsive, and the struggle is to have lot of stuff running on the CommonJob, and struggling to get stuff off the CommonJob to independent threads, instead of struggling to get stuff off the main thread to independent threads.

From the user's perspective, the CommonJob traffic is only a performance problem, not a usability issue as when too many runnables are scheduled off the main thread because they cause problems when running off independent threads.

Serge Beauchamp said...

"The API ensure that there aren't two threads *concurrently* accessing the same datastructure (and there is a memory barrier between each running)."

From the programmer's point of view, they are running concurrently.

Example:

synchronized(lockA) {
int value = structure.getValue();
value += structure.getDelta();
structure.setValue(value);
}

Now, say that getDelta() acquires lockB, and can cause a deadlock with some other code acquiring first lockB, then lockA.

What the Jobs Manager does is effectively allow the structure.getValue() to suddently change between the code has a chance to call structure.setValue().

So it doesn't matter if the modification were done in CPU cycles were the code was blocking on a lock, the only important thing was that it couldn't be change while the lock was held.

And this is that guarantee the Jobs Manager violates.

JamesB said...

Ok I see your point about a single threaded non-UI CommonJob.
However I don't like the fact that it might encourage locks to be acquired in an arbitrary order because it works with the developer's set of CommonJob using plugins.

Would it be possible to have an API a bit like Display access? So something like:

CommonJob.asyncExec(new Runnable() {
....
});

and then called API, which require this simplified concurrency model, would assert that they're only called on the CommonJob thread:

assert CommonJob.getThread() == Thread.currentThread()

That way client API can explicitly set their threading requirement rather than relying on an implicit contract which might break when locks are in play and multiple threads used in addition to the CommonJob.

JamesB said...

> From the programmer's point of view, they are running concurrently.

Well yes, they are when a deadlock occurs :)

I agree deadlocks are bad and *must* be fixed.
However there's a choice between a hard lockup and a continue logging a major health-warning. The latter must be more user-friendly.

Serge Beauchamp said...

"Note that in real code I've rarely see an ILock deadlock reported. However they are a good backstop for the 1/1,000,000 case where something happens in an unexpected way. It allows the user to continue working and the developer to fix the bug. "

Well, maybe your estimate is good, maybe it's not, but it seems we agree that the ILock *is* unreliable.

Maybe for some customer, that number will jump to 1/10.

Maybe for some applications , a random data corruption rate might be acceptable, but imagine if every software API was design as such (unpredictable failure rate by design), we'd be crawling under random crash and data corruption.

My perspective, is that this inherent and unpredictable API failure - is unlike any other software API I know of - and this is not "if you pass this parameter, the behaviour is undefined" - it is "Under random race condition, the lock might simply unlock, and allow other code to change your data structure".

If the ILock would cause a dialog to be displayed to the user, I'd understand, and it would be more a graceful recovery from a fatal error (such as out of memory errors), but as it is implemented now (which user routinely check's the .log file to see if his data was corrupted?), it is transforming a bug (deadlock) into an even worse bug (random data corruption).

Serge Beauchamp said...

"Would it be possible to have an API a bit like Display access? So something like:

CommonJob.asyncExec(new Runnable() {
....
});"

This would accomplish the exact same thing as the current API:

CommonJob job = new CommonJob(new Runnable() {
....
});
job.schedule();

I think it would be good, but is the point only to be able to replace Display.asyncExec() by CommonJob.asyncExec()?

JamesB said...

> I think it would be good, but is the point only to be able to replace Display.asyncExec() by CommonJob.asyncExec()?

Well the part I was most interested in is how the API callee-side behaves. If it asserted that it was only called from the CommonJob.getThread() then there are benefits:
a) The locks could be removed, as the data structures are only accessed from one thread
b) Deadlock is not possible as there isn't real multi-threaded access.

The issue I see is that currently this mechanism only prevents deadlock where every client subscribes to the pattern.

If the API doesn't enforce correct thread access then deadlock will still occur - and I believe is more likely to occur. CommonJob deliberately doesn't require a client to use consistent lock orderings (thus providing the property of no-deadlock). So CommonJob clients will collide with non-CommonJob clients unless thread locality is enforced.

Serge Beauchamp said...

"If the API doesn't enforce correct thread access then deadlock will still occur - and I believe is more likely to occur. CommonJob deliberately doesn't require a client to use consistent lock orderings (thus providing the property of no-deadlock). So CommonJob clients will collide with non-CommonJob clients unless thread locality is enforced."

That would be one way of using the CommonJob, and I think it would be a good thing for it to support such API (I'll look into it tomorrow).

Although I believe that the CommonJob is useful even when APIs don't explicitly restrict their use on that thread (although it wouldn't be necessarily a bad thing - such as SWT restricts access from the main thread).

JamesB said...

> If the ILock would cause a dialog to be displayed to the user, I'd understand, and it would be more a graceful recovery from a fatal error (such as out of memory errors),

Agreed, there's no reason it shouldn't do this.

> but as it is implemented now (which user routinely check's the .log file to see if his data was corrupted?), it is transforming a bug (deadlock) into an even worse bug (random data corruption).

Note that for a real deadlock users have to provide a backtrace, which is harder to get them to do. Especially if the deadlock is not on the UI thread they may not even know that's the problem.

With the eclipse .log you can easily say send me your .log file, or do as we do here and get Eclipse to periodically log errors and warnings to a central server.

JamesB said...

> Although I think believe that the CommonJob is useful even when APIs don't explicitly restrict their use on that thread (although it wouldn't be necessarily a bad thing - such as SWT restricts access from the main thread).

And clearly there's no way for the CommonJob API to enforce this. It can provide API so that methods can discover which thread is allocated to CommonJob statically. However it's up to the API providers to lay down the rules of engagement.

I'm just thinking that for this to be useful as a way to constain multi-threaded complexity, such usage should be recommended in the JavaDoc.

JamesB said...

> Then, ILocks are deadlock free, because they spontaneously transform a deadlock into a memory corruption!

BTW I think the behaviour could be configurable. The Workbench already has a switch -allowDeadlocks which causes the UILockListener to not be installed (which prevents locks from being transferred to the UI thread during a syncExec from a thread holding locks).

The DeadlockDetector also contains a Debug option:
-debug /jobs/errorondeadlock. This causes one of the deadlocking threads to throw IllegalStateException.

I think we could promote this up to a workbench option rather than debug option so you get the best of both worlds in your product: you don't get a deadlock _and_ you prevent data corruption.

With a real deadlock, users are at the mercy of developers fixing the bugs -- it's a work stopping bug. With a ILock deadlock, the default behaviour may translates the deadlock into a data race and the user can continue while a fix is worked on.

Getting the user to give us the a backtrace is key, but a full crash is an unfriendly way to do it.

Serge Beauchamp said...

I agree, the ideal would be an ILock that would notify the user that something bad happen, and that he should restart the workbench, but at the same time, allowing for a graceful shutdown as opposed to vm monitor that freeze hard the process.